[
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496135#comment-14496135
]
David Smiley commented on LUCENE-6422:
--------------------------------------
*Awesome work Nick!* It's so nice to see meaty spatial contributions like this
(Geo3d is another example).
RE "Streaming" (transient memory use while indexing): I appreciate that the
out-of-the box configuration of RPT with either LegacyPrefixTree (be it quad or
geohash) will use a lot of memory for indexing. But since... I don't know how
long now, this only occurs if the "leafy branch pruning" optimization is
enabled on RPT. That algorithm, existing on RecursivePrefixTreeStrategy,
unfortunately buffers all the cells. It's somewhat simple; it could be improved
to not buffer all cells but it would need to buffer some. Recently I did some
benchmarking and found that the leafy branch pruning yielded lots of index size
savings, particularly with the quad tree. I'd love to chat with you about the
subject of "leaves" on the SPT and an idea I have on doing better. Any way, I
suggest you do another memory benchmark with leafy branch pruning disabled with
the PackedQuadTree but not the StreamingQuad...Strategy. With it disabled, the
underlying BytesRefIteratortokenStream will consume a Iterator<Cell> that is a
direct instance of TreeCellIterator, and then you get the "streaming" effect.
The existing TreeCellIterator is quite similar to the
Streaming...PrefixTreeIterator here. If I'm right about there being no
appreciable memory savings, then this part of the patch can be removed as it's
redundant.
I really like the new PackedQuadPrefixTree.java. (IMO that's what this JIRA
issue is mostly about) Can you consider _not_ subclassing Legacy* ? I'd like
to leave the legacy trees as-is and new SPTs not inherit from it. Can you base
your next patch off of trunk? And can you *either* post on
reviewboard.apache.org or use a GitHub fork & branch so I can provide by-line
feedback?
> Add StreamingQuadPrefixTree
> ---------------------------
>
> Key: LUCENE-6422
> URL: https://issues.apache.org/jira/browse/LUCENE-6422
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/spatial
> Affects Versions: 5.x
> Reporter: Nicholas Knize
> Attachments: LUCENE-6422.patch
>
>
> To conform to Lucene's inverted index, SpatialStrategies use strings to
> represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5
> bits per GeoHash cell, respectively. To create the terms representing a
> Shape, the BytesRefIteratorTokenStream first builds all of the terms into an
> ArrayList of Cells in memory, then passes the ArrayList.Iterator back to
> invert() which creates a second lexicographically sorted array of Terms. This
> doubles the memory consumption when indexing a shape.
> This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to
> accomplish the following:
> 1. Create a packed 8byte representation for a QuadCell
> 2. Build the Packed cells 'on demand' when incrementToken is called
> Improvements over this approach include the generation of the packed cells
> using an AutoPrefixAutomaton
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]