[ https://issues.apache.org/jira/browse/LUCENE-6863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982832#comment-14982832 ]
Adrien Grand commented on LUCENE-6863: -------------------------------------- bq. Did you consider a hash lookup instead of binary-search, as was done in LUCENE-5688? I just read the comments there and it seems promising for very sparse data. I took this approach because I saw a couple of setups that had lots of fields, many of those being sparse and the fact that sparse fields require as much resources as the dense ones is a bit frustrating. The issue description focuses on disk usage, but I think memory usage is even more important. Obviously I had to increase memory usage (since numerics don't require any memory at all today) but I wanted to keep it very low. For instance, if you look at the cc2 field, about 320k documents have a value for this field and yet it only takes 1680 bytes of memory for the entire index, so about 0.005 byte per document. With a hastable, you would either put data into memory and you could hardly avoid using one or more bytes per documents, or on disk but then the random-access pattern could be a problem if not everything fits into your filesystem cache. In contrast, the patch keeps memory usage very low and keeps the access pattern sequential by keeping track of the previous/next documents that have a value and using [exponential search|https://en.wikipedia.org/wiki/Exponential_search] (a variant of binary search) to seek forward. bq. Does +214% mean the whole query, search & top-10 doc retrieval took over twice as long? I computed {noformat}(${new response time} - ${old response time}) / ${old response time}{noformat} so it means more than 3x slower actually. However this does not include loading stored fields, just computing the top document by calling IndexSearcher.search(query, 1, sort). bq. How fast was this query any way? Here are the times it take to run these queries on trunk (in milliseconds). ||Field||sort time on a MatchAllDocsQuery||sort time on a term query that matches 10% of docs||sort time on a term query that matches 1% of docs||sort time on a term query that matches docs that have the field|| |cc2 |{color:green}128{color}|20|{color:red}2{color}|{color:red}6{color}| |admin4|{color:green}122{color}|19|3|{color:red}7{color}| |admin3|116|18|3|17| |admin2 |121|19|3|78| I've put the response times that we are making significantly slower in red and those that we are making faster in green. So as you can see, the queries that we are speeding up are among the slower ones, and those that we are slowing down are among the faster ones. > Store sparse doc values more efficiently > ---------------------------------------- > > Key: LUCENE-6863 > URL: https://issues.apache.org/jira/browse/LUCENE-6863 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Assignee: Adrien Grand > Attachments: LUCENE-6863.patch, LUCENE-6863.patch > > > For both NUMERIC fields and ordinals of SORTED fields, we store data in a > dense way. As a consequence, if you have only 1000 documents out of 1B that > have a value, and 8 bits are required to store those 1000 numbers, we will > not require 1KB of storage, but 1GB. > I suspect this mostly happens in abuse cases, but still it's a pity that we > explode storage requirements. We could try to detect sparsity and compress > accordingly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org