Robert Muir created LUCENE-5748:
-----------------------------------
Summary: SORTED_NUMERIC dv type
Key: LUCENE-5748
URL: https://issues.apache.org/jira/browse/LUCENE-5748
Project: Lucene - Core
Issue Type: New Feature
Reporter: Robert Muir
Currently for Strings you have SORTED and SORTED_SET, capable of single and
multiple values per document respectively.
For multi-numerics, there are only a few choices:
* encode with NumericUtils into byte[]'s and store with SORTED_SET.
* encode yourself per-document into BINARY.
Both of these techniques have problems:
SORTED_SET isn't bad if you just want to do basic sorting (e.g. min/max) or
faceting counts: most of the bloat in the "terms dict" is compressed away, and
it optimizes the case where the data is actually single-valued, but it falls
apart performance-wise if you want to do more complex stuff like solr's
analytics component or elasticsearch's aggregations: the ordinals just get in
your way and cause additional work, deref'ing each to a byte[] and then
decoding that back to a number. Worst of all, any mathematical calculations are
off because it discards frequency (deduplicates).
using your own custom encoding in BINARY removes the unnecessary ordinal
dereferencing, but you trade off bad compression and access: you have no real
choice but to do something like vInt within each byte[] for the doc, which
means even basic sorting (e.g. max) is slow as its not constant time. There is
no chance for the codec to optimize things like dates with GCD compression or
optimize the single-valued case because its just an opaque byte[].
So I think it would be good to explore a simple long[] type that solves these
problems.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]