[
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496796#comment-14496796
]
Nicholas Knize commented on LUCENE-6422:
----------------------------------------
bq. I think a stack vs completely stream one by one is unlikely to be
noticeable in a benchmark.
That's a possibility. It definitely warrants a benchmark to know for sure. I
think it could be noticeable for "deep" QuadTrees? (aka: anything with high
precision) as the string representation for one child can get upwards of 29
bytes vs. 8. But its really speculation without benchmark confirmation. Also,
just eliminating the need for stack-based DFS logic is a more straightforward
approach, no? (no crazy data structures with transient state needed since we're
just lexicographically incrementing a compact bit representation on demand)
bq. ...couldn't you provide this via overriding
SpatialPrefixTree.getTreeCellIterator instead of overriding the SpatialStrategy?
I like this suggestion. If I set pruneLeafyBranches to false and override
getTreeCellIterator in PackedQuadPrefixTree then, yes, RPT will call the
PackedQuadPrefixTree.getTreeCellIterator from RPT.createCellIteratorToIndex and
I can eliminate the StreamingPrefixTreeStrategy class altogether. At that
point I would suggest we rename RecursivePrefixTreeStrategy to something more
"accurate", but that can be an API discussion.
bq. I think we may be talking about different things.
We're talking about the same thing. With one difference in this
implementation. At the moment leaves that are on the edge (e.g., intersects)
are not pruned. They probably could be - and this optimization would save even
more space than its already saving.
bq. and the end effect is semantically equivalent in terms of the search
predicates.
I forgot to mention, this enhancement allows one to disable that "memory saving
error factor" (by setting distance_error_pct to 0) and the result is quad cell
precision regardless of shape size (with minimal increase in index size). This
was the initial driver for the improvement. Allowing one to eliminate false
positives - achieve results that are within the precision they specify without
having a fuzzy error factor based on the size of the shape. It doesn't
eliminate that factor, it just provides the ability to disable it and achieve
your desired precision (subject to earth curvature error) without saturating
memory and blowing out the index.
bq. Yes but I need to apply the patch and see how much code is at stake here.
I honestly think leaving that to another issue is the best strategy (progress
not perfection). That's something I can look at as a next step. Until that
time comes I don't think there's a problem having this enhancement subclass
QuadCell and QuadPrefixTree. Refactoring can be deferred to a separate issue.
> Add StreamingQuadPrefixTree
> ---------------------------
>
> Key: LUCENE-6422
> URL: https://issues.apache.org/jira/browse/LUCENE-6422
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/spatial
> Affects Versions: 5.x
> Reporter: Nicholas Knize
> Attachments: LUCENE-6422.patch
>
>
> To conform to Lucene's inverted index, SpatialStrategies use strings to
> represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5
> bits per GeoHash cell, respectively. To create the terms representing a
> Shape, the BytesRefIteratorTokenStream first builds all of the terms into an
> ArrayList of Cells in memory, then passes the ArrayList.Iterator back to
> invert() which creates a second lexicographically sorted array of Terms. This
> doubles the memory consumption when indexing a shape.
> This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to
> accomplish the following:
> 1. Create a packed 8byte representation for a QuadCell
> 2. Build the Packed cells 'on demand' when incrementToken is called
> Improvements over this approach include the generation of the packed cells
> using an AutoPrefixAutomaton
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]