[ 
https://issues.apache.org/jira/browse/LUCENE-5748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026380#comment-14026380
 ] 

Adrien Grand commented on LUCENE-5748:
--------------------------------------

+1 I like it!

> SORTED_NUMERIC dv type
> ----------------------
>
>                 Key: LUCENE-5748
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5748
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Robert Muir
>         Attachments: LUCENE-5748.patch
>
>
> Currently for Strings you have SORTED and SORTED_SET, capable of single and 
> multiple values per document respectively.
> For multi-numerics, there are only a few choices:
> * encode with NumericUtils into byte[]'s and store with SORTED_SET.
> * encode yourself per-document into BINARY.
> Both of these techniques have problems: 
> SORTED_SET isn't bad if you just want to do basic sorting (e.g. min/max) or 
> faceting counts: most of the bloat in the "terms dict" is compressed away, 
> and it optimizes the case where the data is actually single-valued, but it 
> falls apart performance-wise if you want to do more complex stuff like solr's 
> analytics component or elasticsearch's aggregations: the ordinals just get in 
> your way and cause additional work, deref'ing each to a byte[] and then 
> decoding that back to a number. Worst of all, any mathematical calculations 
> are off because it discards frequency (deduplicates).
> using your own custom encoding in BINARY removes the unnecessary ordinal 
> dereferencing, but you trade off bad compression and access: you have no real 
> choice but to do something like vInt within each byte[] for the doc, which 
> means even basic sorting (e.g. max) is slow as its not constant time. There 
> is no chance for the codec to optimize things like dates with GCD compression 
> or optimize the single-valued case because its just an opaque byte[].
> So I think it would be good to explore a simple long[] type that solves these 
> problems.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to