[
https://issues.apache.org/jira/browse/LUCENE-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-6891:
---------------------------------------
Attachment: LUCENE-6891.patch
Patch, tests/precommit passes, I think it's ready.
> Lucene30DimensionalFormat should use block prefix coding when writing values
> ----------------------------------------------------------------------------
>
> Key: LUCENE-6891
> URL: https://issues.apache.org/jira/browse/LUCENE-6891
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Fix For: Trunk
>
> Attachments: LUCENE-6891.patch
>
>
> Today we write the whole value for every doc in one leaf block in the BKD
> tree, but that's crazy because the whole point of that leaf block is all the
> docs inside it have values that are very close together.
> So I changed this to write the common prefix for the whole block up front in
> each block. This requires more index-time and search-time work, but gives
> nice index size reductions:
> On the 2D (London, UK) lat/lon benchmark:
> * Indexing time was a wee bit slower (743 -> 747 seconds)
> * Index size was ~11% smaller (704 MB -> 630 MB)
> * Query time was ~7% slower (2.84 sec -> 3.05 sec)
> * Heap usage is the same
> On the 1D (just "lat" from the above test) benchmark:
> * Indexing time was a wee bit slower (363 -> 364 sec)
> * Index size was ~23% smaller (472 MB -> 363 MB)
> * Query time was a wee bit slower (5.39 -> 5.41 sec)
> * Heap usage is the same
> Index time can be a bit slower since there are two passes now per leaf block
> (first to find the common prefix per dimension, and second pass must then
> strip those prefixes).
> Query time is slower because there's more work per hit that needs value
> filtering, i.e. collating the suffixes onto the prefixes, per dimension.
> This affects 2D much more than 1D because 1D has fewer leaf blocks that need
> filtering (typically 0, 1 or 2, unless there are many duplicate values in the
> index).
> I suspect the index size savings is use-case dependent, e.g. if you index a
> bunch of ipv4 addresses along with a few ipv6 addresses, you'd probably see
> sizable savings.
> I also suspect the more docs you index, the greater the savings, because the
> cells will generally be smaller.
> Net/net I think the opto is worth it, even if it slows query time a bit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]