[ 
https://issues.apache.org/jira/browse/LUCENE-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15701587#comment-15701587
 ] 

Adrien Grand commented on LUCENE-7563:
--------------------------------------

It seems we are always delta coding with the split value of the parent level, 
but for the multi-dimensional case, I think it would be better to delta-code 
with the last split value that was on the same dimension? Otherwise compression 
would be very poor if both dimensions store a very different range of values?

Something else I was wondering is whether we can make bigger gains. For 
instance we use whole bytes to store the split dimension or the prefix length 
while they only need 3 and 4 bits? In the multi-dimensional case we could store 
both on a single byte. Maybe we can do even better, I haven't though much about 
it.

It doesn't need to be done in the same patch, but it would also be nice for 
SimpleText to not use the legacy format of the index. I'm not sure how to 
proceed however.

> BKD index should compress unused leading bytes
> ----------------------------------------------
>
>                 Key: LUCENE-7563
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7563
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>             Fix For: master (7.0), 6.4
>
>         Attachments: LUCENE-7563.patch, LUCENE-7563.patch
>
>
> Today the BKD (points) in-heap index always uses {{dimensionNumBytes}} per 
> dimension, but if e.g. you are indexing {{LongPoint}} yet only use the bottom 
> two bytes in a given segment, we shouldn't store all those leading 0s in the 
> index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to