[
https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701587#comment-15701587
]
Adrien Grand commented on LUCENE-7563:
--------------------------------------
It seems we are always delta coding with the split value of the parent level,
but for the multi-dimensional case, I think it would be better to delta-code
with the last split value that was on the same dimension? Otherwise compression
would be very poor if both dimensions store a very different range of values?
Something else I was wondering is whether we can make bigger gains. For
instance we use whole bytes to store the split dimension or the prefix length
while they only need 3 and 4 bits? In the multi-dimensional case we could store
both on a single byte. Maybe we can do even better, I haven't though much about
it.
It doesn't need to be done in the same patch, but it would also be nice for
SimpleText to not use the legacy format of the index. I'm not sure how to
proceed however.
> BKD index should compress unused leading bytes
> ----------------------------------------------
>
> Key: LUCENE-7563
> URL: https://issues.apache.org/jira/browse/LUCENE-7563
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Fix For: master (7.0), 6.4
>
> Attachments: LUCENE-7563.patch, LUCENE-7563.patch
>
>
> Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per
> dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom
> two bytes in a given segment, we shouldn't store all those leading 0s in the
> index.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]