[ 
https://issues.apache.org/jira/browse/LUCENE-6863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982832#comment-14982832
 ] 

Adrien Grand commented on LUCENE-6863:
--------------------------------------

bq. Did you consider a hash lookup instead of binary-search, as was done in 
LUCENE-5688? I just read the comments there and it seems promising for very 
sparse data. 

I took this approach because I saw a couple of setups that had lots of fields, 
many of those being sparse and the fact that sparse fields require as much 
resources as the dense ones is a bit frustrating. The issue description focuses 
on disk usage, but I think memory usage is even more important. Obviously I had 
to increase memory usage (since numerics don't require any memory at all today) 
but I wanted to keep it very low. For instance, if you look at the cc2 field, 
about 320k documents have a value for this field and yet it only takes 1680 
bytes of memory for the entire index, so about 0.005 byte per document. With a 
hastable, you would either put data into memory and you could hardly avoid 
using one or more bytes per documents, or on disk but then the random-access 
pattern could be a problem if not everything fits into your filesystem cache. 
In contrast, the patch keeps memory usage very low and keeps the access pattern 
sequential by keeping track of the previous/next documents that have a value 
and using [exponential search|https://en.wikipedia.org/wiki/Exponential_search] 
(a variant of binary search) to seek forward.

bq. Does +214% mean the whole query, search & top-10 doc retrieval took over 
twice as long?

I computed {noformat}(${new response time} - ${old response time}) / ${old 
response time}{noformat} so it means more than 3x slower actually. However this 
does not include loading stored fields, just computing the top document by 
calling IndexSearcher.search(query, 1, sort).

bq. How fast was this query any way?

Here are the times it take to run these queries on trunk (in milliseconds).

||Field||sort time on a MatchAllDocsQuery||sort time on a term query that 
matches 10% of docs||sort time on a term query that matches 1% of docs||sort 
time on a term query that matches docs that have the field||
|cc2 |{color:green}128{color}|20|{color:red}2{color}|{color:red}6{color}|
|admin4|{color:green}122{color}|19|3|{color:red}7{color}|
|admin3|116|18|3|17|
|admin2 |121|19|3|78|

I've put the response times that we are making significantly slower in red and 
those that we are making faster in green. So as you can see, the queries that 
we are speeding up are among the slower ones, and those that we are slowing 
down are among the faster ones.

> Store sparse doc values more efficiently
> ----------------------------------------
>
>                 Key: LUCENE-6863
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6863
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>         Attachments: LUCENE-6863.patch, LUCENE-6863.patch
>
>
> For both NUMERIC fields and ordinals of SORTED fields, we store data in a 
> dense way. As a consequence, if you have only 1000 documents out of 1B that 
> have a value, and 8 bits are required to store those 1000 numbers, we will 
> not require 1KB of storage, but 1GB.
> I suspect this mostly happens in abuse cases, but still it's a pity that we 
> explode storage requirements. We could try to detect sparsity and compress 
> accordingly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to