[jira] [Commented] (LUCENE-5596) Support for index/search large numeric field

Uwe Schindler (JIRA) Tue, 15 Apr 2014 02:24:14 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969364#comment-13969364
 ]


Uwe Schindler commented on LUCENE-5596:
---------------------------------------

bq. I'm also interested in how we plan to encode terms. NumericUtils currently 
encodes shifts on a single byte but if we aim at supporting variable-length 
data, I guess we either need to enforce a maximum length of 255 on terms or use 
a different encoding.

I agree! There are 2 problems:
- The idea behind the "prefixed" shift value byte is: Terms with one shift 
value should be grouped together in the index. So 
BytesRef.compareTo(otherBytesRef) should behave like: terms with smaller shift 
value should be binary smaller. The full-length terms must come first (shift 
value = 0). Based on this, term enumerations can easily seek and stuff like 
FieldCache can stop iterating terms after visiting all zero-shift terms. Also 
we can make the MultiTermQuery only seek forward in the TermsEnum (very 
important) - zthis is the reason for the whole setup we currently have!
- If we need longer terms than 255, we would need 2 bytes to encode maximum 
shift. On the other hand this is wasteful not only because of an additional 
byte per term, it is also wasteful because of number of terms, where it is 
unlikely to have many terms that differ only on the last few bytes. Prefix 
encoding only makes sense for common prefixes which appear millions of times in 
your index, Maybe we should instead of storing the shift value in the first 
byte, store the (inverse) number of preserved bits (255-preservedBits)! 
Important is: byte 0 => full precision. Longer terms are only prefixed up to a 
maximum length, the remaining stuff is only stored full precision. In fact we 
only apply prefix terms to the first n bytes of the term, everything loger gets 
stored in full precision only. I think, it makes no sense to have longer 
prefixes than maybe 8 bytes in the index.

> Support for index/search large numeric field
> --------------------------------------------
>
>                 Key: LUCENE-5596
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5596
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Kevin Wang
>            Assignee: Uwe Schindler
>         Attachments: LUCENE-5596.patch, LUCENE-5596.patch
>
>
> Currently if an number is larger than Long.MAX_VALUE, we can't index/search 
> that in lucene as a number. For example, IPv6 address is an 128 bit number, 
> so we can't index that as a numeric field and do numeric range query etc.
> It would be good to support BigInteger / BigDecimal
> I've tried use BigInteger for IPv6 in Elasticsearch and that works fine, but 
> there are still lots of things to do
> https://github.com/elasticsearch/elasticsearch/pull/5758



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5596) Support for index/search large numeric field

Reply via email to