Michael McCandless created LUCENE-6891:
------------------------------------------
Summary: Lucene30DimensionalFormat should use block prefix coding
when writing values
Key: LUCENE-6891
URL: https://issues.apache.org/jira/browse/LUCENE-6891
Project: Lucene - Core
Issue Type: Improvement
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: Trunk
Today we write the whole value for every doc in one leaf block in the BKD tree,
but that's crazy because the whole point of that leaf block is all the docs
inside it have values that are very close together.
So I changed this to write the common prefix for the whole block up front in
each block. This requires more index-time and search-time work, but gives nice
index size reductions:
On the 2D (London, UK) lat/lon benchmark:
* Indexing time was a wee bit slower (743 -> 747 seconds)
* Index size was ~11% smaller (704 MB -> 630 MB)
* Query time was ~7% slower (2.84 sec -> 3.05 sec)
* Heap usage is the same
On the 1D (just "lat" from the above test) benchmark:
* Indexing time was a wee bit slower (363 -> 364 sec)
* Index size was ~23% smaller (472 MB -> 363 MB)
* Query time was a wee bit slower (5.39 -> 5.41 sec)
* Heap usage is the same
Index time can be a bit slower since there are two passes now per leaf block
(first to find the common prefix per dimension, and second pass must then strip
those prefixes).
Query time is slower because there's more work per hit that needs value
filtering, i.e. collating the suffixes onto the prefixes, per dimension. This
affects 2D much more than 1D because 1D has fewer leaf blocks that need
filtering (typically 0, 1 or 2, unless there are many duplicate values in the
index).
I suspect the index size savings is use-case dependent, e.g. if you index a
bunch of ipv4 addresses along with a few ipv6 addresses, you'd probably see
sizable savings.
I also suspect the more docs you index, the greater the savings, because the
cells will generally be smaller.
Net/net I think the opto is worth it, even if it slows query time a bit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]