[ 
https://issues.apache.org/jira/browse/LUCENE-8868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Vera updated LUCENE-8868:
---------------------------------
    Description: 
Currently if a leaf on the BKD tree contains only few values, then the leaf is 
treated the same way as it all values are different. It many cases it can be 
much more efficient to store the distinct values with the cardinality.

The strategy is the following:

1.  When writing a leaf block the cardinality is computed.
2. Perform some naive calculation to compute if it is better to store the leaf 
as a low cardinality leaf. The storage cost are calculated as follows:
*   low cardinality: leafCardinality * (packedBytesLength - prefixLenSum + 2) 
where two is the estimated size of storing the cardinality. This is an 
overestimation as in some cases you will only need one byte to store the 
cardinality.
* High cardinality: count * (packedBytesLength - prefixLenSum). We are not 
taking into account the runlen compression.

3. If the tree has low cardinality then we set the compressed dim to -2. Note 
that -1 is when all values are equal.




  was:
Currently if a leaf on the BKD tree contains only few values, then the leaf is 
treated the same way as it all values are different. It many cases it can be 
much more efficient to store the distinct values with the cardinality.

The strategy is the following:

1.  When writing a leaf block the cardinality is computed.
2. Perform some naive calculation to compute if it is better to store the leaf 
as a low cardinality leaf. The storage cost are calculated as follows:
*   low cardinality: leafCardinality * (packedBytesLength - prefixLenSum + 2) 
where two is the estimated size of storing the cardinality. This is an 
overestimation as in some cases you will only need one byte to store the 
cardinality.
* High cardinality: count * (packedBytesLength - prefixLenSum). We are not 
taking into account the runlen compression.
3.  If the tree has low cardinality then we set the compressed dim to -2. Note 
that -1 is when all values are equal.





> New storing strategy for BKD tree leaves with low cardinality
> -------------------------------------------------------------
>
>                 Key: LUCENE-8868
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8868
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Ignacio Vera
>            Priority: Major
>
> Currently if a leaf on the BKD tree contains only few values, then the leaf 
> is treated the same way as it all values are different. It many cases it can 
> be much more efficient to store the distinct values with the cardinality.
> The strategy is the following:
> 1.  When writing a leaf block the cardinality is computed.
> 2. Perform some naive calculation to compute if it is better to store the 
> leaf as a low cardinality leaf. The storage cost are calculated as follows:
> *   low cardinality: leafCardinality * (packedBytesLength - prefixLenSum + 2) 
> where two is the estimated size of storing the cardinality. This is an 
> overestimation as in some cases you will only need one byte to store the 
> cardinality.
> * High cardinality: count * (packedBytesLength - prefixLenSum). We are not 
> taking into account the runlen compression.
> 3. If the tree has low cardinality then we set the compressed dim to -2. Note 
> that -1 is when all values are equal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to