[ 
https://issues.apache.org/jira/browse/LUCENE-6863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980687#comment-14980687
 ] 

Adrien Grand commented on LUCENE-6863:
--------------------------------------

I ran some benchmarks with the geoname dataset which has a few sparse fields:
 - cc2: 3.2% of documents have this field, which has 573 unique values
 - admin4: 4.3% of documents have this field, which has 102950 unique values
 - admin3: 10.2% of documents have this field, which has 73120 unique values
 - admin2: 45.3% of documents have this field, which has 30603 unique values

First I enabled sparse compression on all fields, regardless of density to see 
how this compares to the delta compression that we use by default, and then ran 
two kinds of queries:
 - queries on a random partition of the index, which I guess would be the case 
when you have true sparse fields
 - a query only on documents that have a value, which I guess would be more 
realistic if you store several types of data in the same index that don't have 
the same fields

||Field||disk usage for ordinals||memory usage with sparse compression 
enabled||sort performance on a MatchAllDocsQuery||sort performance on a term 
query that matches 10% of docs||sort performance on a term query that matches 
1% of docs||sort performance on a term query that matches docs that have the 
field||
|cc2 | -88%|1680 bytes|-27%|+25%|+58%|+208%|
|admin4|-86%|568 bytes|-20%|+7%|-20%|+214%|
|admin3|-67%|1312 bytes|+11%|+57%|+42%|+236%|
|admin2 |+17%|2904 bytes|+132%|+275%|+331%|+221%|

The reduction in disk usage is significant, but so is the slowdown, especially 
when running a query that only matches docs that have a value. However memory 
usage looks acceptable to me for 10M docs.

I couldn't test with 3% as even the rarest field is contained by 3.2% of 
documents, but I updated the heuristic to require at least 1024 docs in the 
segment (like Robert suggested) and that less than 5% of docs have a value:

||Field||memory usage due to sparse compression||sort performance on a 
MatchAllDocsQuery||sort performance on a term query that matches 10% of 
docs||sort performance on a term query that matches 1% of docs||sort 
performance on a term query that matches docs that have the field||
|cc2 | 1680 bytes|-10%|+34%|+62%|+214%|
|admin4|568 bytes|-7%|+20%|-14%|+241%|
|admin3|576 bytes|+9%|+7%|+11%|+10%|
|admin2 |1008 bytes|+1%|+8%|+9%|+11%|

To my surprise, admin2 and admin3 were still using sparse compression on some 
segments. The reason is that documents with sparse values are not uniform in 
the dataset but rather clustered: I suspect this partially explains of the 
slowdown for admin2/admin3, maybe there is also hotspot not liking having more 
impls to deal with.

> Store sparse doc values more efficiently
> ----------------------------------------
>
>                 Key: LUCENE-6863
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6863
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>         Attachments: LUCENE-6863.patch
>
>
> For both NUMERIC fields and ordinals of SORTED fields, we store data in a 
> dense way. As a consequence, if you have only 1000 documents out of 1B that 
> have a value, and 8 bits are required to store those 1000 numbers, we will 
> not require 1KB of storage, but 1GB.
> I suspect this mostly happens in abuse cases, but still it's a pity that we 
> explode storage requirements. We could try to detect sparsity and compress 
> accordingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to