Robert Muir created LUCENE-5748:
-----------------------------------

             Summary: SORTED_NUMERIC dv type
                 Key: LUCENE-5748
                 URL: https://issues.apache.org/jira/browse/LUCENE-5748
             Project: Lucene - Core
          Issue Type: New Feature
            Reporter: Robert Muir


Currently for Strings you have SORTED and SORTED_SET, capable of single and 
multiple values per document respectively.

For multi-numerics, there are only a few choices:
* encode with NumericUtils into byte[]'s and store with SORTED_SET.
* encode yourself per-document into BINARY.

Both of these techniques have problems: 

SORTED_SET isn't bad if you just want to do basic sorting (e.g. min/max) or 
faceting counts: most of the bloat in the "terms dict" is compressed away, and 
it optimizes the case where the data is actually single-valued, but it falls 
apart performance-wise if you want to do more complex stuff like solr's 
analytics component or elasticsearch's aggregations: the ordinals just get in 
your way and cause additional work, deref'ing each to a byte[] and then 
decoding that back to a number. Worst of all, any mathematical calculations are 
off because it discards frequency (deduplicates).

using your own custom encoding in BINARY removes the unnecessary ordinal 
dereferencing, but you trade off bad compression and access: you have no real 
choice but to do something like vInt within each byte[] for the doc, which 
means even basic sorting (e.g. max) is slow as its not constant time. There is 
no chance for the codec to optimize things like dates with GCD compression or 
optimize the single-valued case because its just an opaque byte[].

So I think it would be good to explore a simple long[] type that solves these 
problems.






--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to