[
https://issues.apache.org/jira/browse/LUCENE-5596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969364#comment-13969364
]
Uwe Schindler commented on LUCENE-5596:
---------------------------------------
bq. I'm also interested in how we plan to encode terms. NumericUtils currently
encodes shifts on a single byte but if we aim at supporting variable-length
data, I guess we either need to enforce a maximum length of 255 on terms or use
a different encoding.
I agree! There are 2 problems:
- The idea behind the "prefixed" shift value byte is: Terms with one shift
value should be grouped together in the index. So
BytesRef.compareTo(otherBytesRef) should behave like: terms with smaller shift
value should be binary smaller. The full-length terms must come first (shift
value = 0). Based on this, term enumerations can easily seek and stuff like
FieldCache can stop iterating terms after visiting all zero-shift terms. Also
we can make the MultiTermQuery only seek forward in the TermsEnum (very
important) - zthis is the reason for the whole setup we currently have!
- If we need longer terms than 255, we would need 2 bytes to encode maximum
shift. On the other hand this is wasteful not only because of an additional
byte per term, it is also wasteful because of number of terms, where it is
unlikely to have many terms that differ only on the last few bytes. Prefix
encoding only makes sense for common prefixes which appear millions of times in
your index, Maybe we should instead of storing the shift value in the first
byte, store the (inverse) number of preserved bits (255-preservedBits)!
Important is: byte 0 => full precision. Longer terms are only prefixed up to a
maximum length, the remaining stuff is only stored full precision. In fact we
only apply prefix terms to the first n bytes of the term, everything loger gets
stored in full precision only. I think, it makes no sense to have longer
prefixes than maybe 8 bytes in the index.
> Support for index/search large numeric field
> --------------------------------------------
>
> Key: LUCENE-5596
> URL: https://issues.apache.org/jira/browse/LUCENE-5596
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Kevin Wang
> Assignee: Uwe Schindler
> Attachments: LUCENE-5596.patch, LUCENE-5596.patch
>
>
> Currently if an number is larger than Long.MAX_VALUE, we can't index/search
> that in lucene as a number. For example, IPv6 address is an 128 bit number,
> so we can't index that as a numeric field and do numeric range query etc.
> It would be good to support BigInteger / BigDecimal
> I've tried use BigInteger for IPv6 in Elasticsearch and that works fine, but
> there are still lots of things to do
> https://github.com/elasticsearch/elasticsearch/pull/5758
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]