[jira] [Created] (LUCENE-6891) Lucene30DimensionalFormat should use block prefix coding when writing values

Michael McCandless (JIRA) Wed, 11 Nov 2015 02:58:57 -0800

Michael McCandless created LUCENE-6891:
------------------------------------------


             Summary: Lucene30DimensionalFormat should use block prefix coding 
when writing values
                 Key: LUCENE-6891
                 URL: https://issues.apache.org/jira/browse/LUCENE-6891
             Project: Lucene - Core
          Issue Type: Improvement
            Reporter: Michael McCandless
            Assignee: Michael McCandless
             Fix For: Trunk


Today we write the whole value for every doc in one leaf block in the BKD tree, 
but that's crazy because the whole point of that leaf block is all the docs 
inside it have values that are very close together.

So I changed this to write the common prefix for the whole block up front in 
each block.  This requires more index-time and search-time work, but gives nice 
index size reductions:

On the 2D (London, UK) lat/lon benchmark:
  * Indexing time was a wee bit slower (743 -> 747 seconds)
  * Index size was ~11% smaller (704 MB -> 630 MB)
  * Query time was ~7% slower (2.84 sec -> 3.05 sec)
  * Heap usage is the same

On the 1D (just "lat" from the above test) benchmark:
  * Indexing time was a wee bit slower (363 -> 364 sec)
  * Index size was ~23% smaller (472 MB -> 363 MB)
  * Query time was a wee bit slower (5.39 -> 5.41 sec)
  * Heap usage is the same

Index time can be a bit slower since there are two passes now per leaf block 
(first to find the common prefix per dimension, and second pass must then strip 
those prefixes).

Query time is slower because there's more work per hit that needs value 
filtering, i.e. collating the suffixes onto the prefixes, per dimension.  This 
affects 2D much more than 1D because 1D has fewer leaf blocks that need 
filtering (typically 0, 1 or 2, unless there are many duplicate values in the 
index).

I suspect the index size savings is use-case dependent, e.g. if you index a 
bunch of ipv4 addresses along with a few ipv6 addresses, you'd probably see 
sizable savings.

I also suspect the more docs you index, the greater the savings, because the 
cells will generally be smaller.

Net/net I think the opto is worth it, even if it slows query time a bit.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (LUCENE-6891) Lucene30DimensionalFormat should use block prefix coding when writing values

Reply via email to