[jira] [Updated] (LUCENE-6891) Lucene30DimensionalFormat should use block prefix coding when writing values

Michael McCandless (JIRA) Wed, 11 Nov 2015 03:12:35 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-6891:
---------------------------------------
    Attachment: LUCENE-6891.patch

Patch, tests/precommit passes, I think it's ready.

> Lucene30DimensionalFormat should use block prefix coding when writing values
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-6891
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6891
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: Trunk
>
>         Attachments: LUCENE-6891.patch
>
>
> Today we write the whole value for every doc in one leaf block in the BKD 
> tree, but that's crazy because the whole point of that leaf block is all the 
> docs inside it have values that are very close together.
> So I changed this to write the common prefix for the whole block up front in 
> each block.  This requires more index-time and search-time work, but gives 
> nice index size reductions:
> On the 2D (London, UK) lat/lon benchmark:
>   * Indexing time was a wee bit slower (743 -> 747 seconds)
>   * Index size was ~11% smaller (704 MB -> 630 MB)
>   * Query time was ~7% slower (2.84 sec -> 3.05 sec)
>   * Heap usage is the same
> On the 1D (just "lat" from the above test) benchmark:
>   * Indexing time was a wee bit slower (363 -> 364 sec)
>   * Index size was ~23% smaller (472 MB -> 363 MB)
>   * Query time was a wee bit slower (5.39 -> 5.41 sec)
>   * Heap usage is the same
> Index time can be a bit slower since there are two passes now per leaf block 
> (first to find the common prefix per dimension, and second pass must then 
> strip those prefixes).
> Query time is slower because there's more work per hit that needs value 
> filtering, i.e. collating the suffixes onto the prefixes, per dimension.  
> This affects 2D much more than 1D because 1D has fewer leaf blocks that need 
> filtering (typically 0, 1 or 2, unless there are many duplicate values in the 
> index).
> I suspect the index size savings is use-case dependent, e.g. if you index a 
> bunch of ipv4 addresses along with a few ipv6 addresses, you'd probably see 
> sizable savings.
> I also suspect the more docs you index, the greater the savings, because the 
> cells will generally be smaller.
> Net/net I think the opto is worth it, even if it slows query time a bit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-6891) Lucene30DimensionalFormat should use block prefix coding when writing values

Reply via email to