[ https://issues.apache.org/jira/browse/LUCENE-6863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980687#comment-14980687 ]
Adrien Grand commented on LUCENE-6863: -------------------------------------- I ran some benchmarks with the geoname dataset which has a few sparse fields: - cc2: 3.2% of documents have this field, which has 573 unique values - admin4: 4.3% of documents have this field, which has 102950 unique values - admin3: 10.2% of documents have this field, which has 73120 unique values - admin2: 45.3% of documents have this field, which has 30603 unique values First I enabled sparse compression on all fields, regardless of density to see how this compares to the delta compression that we use by default, and then ran two kinds of queries: - queries on a random partition of the index, which I guess would be the case when you have true sparse fields - a query only on documents that have a value, which I guess would be more realistic if you store several types of data in the same index that don't have the same fields ||Field||disk usage for ordinals||memory usage with sparse compression enabled||sort performance on a MatchAllDocsQuery||sort performance on a term query that matches 10% of docs||sort performance on a term query that matches 1% of docs||sort performance on a term query that matches docs that have the field|| |cc2 | -88%|1680 bytes|-27%|+25%|+58%|+208%| |admin4|-86%|568 bytes|-20%|+7%|-20%|+214%| |admin3|-67%|1312 bytes|+11%|+57%|+42%|+236%| |admin2 |+17%|2904 bytes|+132%|+275%|+331%|+221%| The reduction in disk usage is significant, but so is the slowdown, especially when running a query that only matches docs that have a value. However memory usage looks acceptable to me for 10M docs. I couldn't test with 3% as even the rarest field is contained by 3.2% of documents, but I updated the heuristic to require at least 1024 docs in the segment (like Robert suggested) and that less than 5% of docs have a value: ||Field||memory usage due to sparse compression||sort performance on a MatchAllDocsQuery||sort performance on a term query that matches 10% of docs||sort performance on a term query that matches 1% of docs||sort performance on a term query that matches docs that have the field|| |cc2 | 1680 bytes|-10%|+34%|+62%|+214%| |admin4|568 bytes|-7%|+20%|-14%|+241%| |admin3|576 bytes|+9%|+7%|+11%|+10%| |admin2 |1008 bytes|+1%|+8%|+9%|+11%| To my surprise, admin2 and admin3 were still using sparse compression on some segments. The reason is that documents with sparse values are not uniform in the dataset but rather clustered: I suspect this partially explains of the slowdown for admin2/admin3, maybe there is also hotspot not liking having more impls to deal with. > Store sparse doc values more efficiently > ---------------------------------------- > > Key: LUCENE-6863 > URL: https://issues.apache.org/jira/browse/LUCENE-6863 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Assignee: Adrien Grand > Attachments: LUCENE-6863.patch > > > For both NUMERIC fields and ordinals of SORTED fields, we store data in a > dense way. As a consequence, if you have only 1000 documents out of 1B that > have a value, and 8 bits are required to store those 1000 numbers, we will > not require 1KB of storage, but 1GB. > I suspect this mostly happens in abuse cases, but still it's a pity that we > explode storage requirements. We could try to detect sparsity and compress > accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org