[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

Nicholas Knize (JIRA) Wed, 15 Apr 2015 12:53:49 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496796#comment-14496796
 ]


Nicholas Knize commented on LUCENE-6422:
----------------------------------------

bq. I think a stack vs completely stream one by one is unlikely to be 
noticeable in a benchmark.

That's a possibility. It definitely warrants a benchmark to know for sure. I 
think it could be noticeable for "deep" QuadTrees? (aka: anything with high 
precision) as the string representation for one child can get upwards of 29 
bytes vs. 8. But its really speculation without benchmark confirmation.  Also, 
just eliminating the need for stack-based DFS logic is a more straightforward 
approach, no? (no crazy data structures with transient state needed since we're 
just lexicographically incrementing a compact bit representation on demand)

bq. ...couldn't you provide this via overriding 
SpatialPrefixTree.getTreeCellIterator instead of overriding the SpatialStrategy?

I like this suggestion.  If I set pruneLeafyBranches to false and override 
getTreeCellIterator in PackedQuadPrefixTree then, yes, RPT will call the 
PackedQuadPrefixTree.getTreeCellIterator from RPT.createCellIteratorToIndex and 
I can eliminate the StreamingPrefixTreeStrategy class altogether.  At that 
point I would suggest we rename RecursivePrefixTreeStrategy to something more 
"accurate", but that can be an API discussion.

bq. I think we may be talking about different things.

We're talking about the same thing.  With one difference in this 
implementation.  At the moment leaves that are on the edge (e.g., intersects) 
are not pruned. They probably could be - and this optimization would save even 
more space than its already saving. 

bq. and the end effect is semantically equivalent in terms of the search 
predicates.

I forgot to mention, this enhancement allows one to disable that "memory saving 
error factor" (by setting distance_error_pct to 0) and the result is quad cell 
precision regardless of shape size (with minimal increase in index size).  This 
was the initial driver for the improvement.  Allowing one to eliminate false 
positives - achieve results that are within the precision they specify without 
having a fuzzy error factor based on the size of the shape.  It doesn't 
eliminate that factor, it just provides the ability to disable it and achieve 
your desired precision (subject to earth curvature error) without saturating 
memory and blowing out the index.

bq. Yes but I need to apply the patch and see how much code is at stake here.

I honestly think leaving that to another issue is the best strategy (progress 
not perfection).  That's something I can look at as a next step.  Until that 
time comes I don't think there's a problem having this enhancement subclass 
QuadCell and QuadPrefixTree.  Refactoring can be deferred to a separate issue.

> Add StreamingQuadPrefixTree
> ---------------------------
>
>                 Key: LUCENE-6422
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6422
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/spatial
>    Affects Versions: 5.x
>            Reporter: Nicholas Knize
>         Attachments: LUCENE-6422.patch
>
>
> To conform to Lucene's inverted index, SpatialStrategies use strings to 
> represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
> bits per GeoHash cell, respectively.  To create the terms representing a 
> Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
> ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
> invert() which creates a second lexicographically sorted array of Terms. This 
> doubles the memory consumption when indexing a shape.
> This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
> accomplish the following:
> 1.  Create a packed 8byte representation for a QuadCell
> 2.  Build the Packed cells 'on demand' when incrementToken is called
> Improvements over this approach include the generation of the packed cells 
> using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

Reply via email to