[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498476#comment-14498476
 ] 

Nicholas Knize commented on LUCENE-6422:
----------------------------------------

bq. Please don't call what StreamingPrefixTreeStrategy does as leafy branch 
pruning; it confuses the important distinction I'm trying to make.

That's fair. pruneLeafyBranches isn't a Geo text book term and I didn't have a 
common nomenclature within this context for discussion re: the behavior so I 
used the closest shared term, for discussion.  If I change it so the leaves are 
pruned on intersects then its the same behavior. I'll add that to the patch for 
benchmarking and it will result in a smaller index.  Thinking about it more, I 
think it's the right way to go and a simple enough change that can be iterated 
in future improvements.

bq. All SPTs should stop traversing when the relation is within – that's 
expected/normal.

I want to make sure there is no confusion here (especially for anyone else 
willing to participate and the sensitivity surrounding my use of the 
pruneLeafyBranches term).  SPT's in the context lucene-spatial's definition, 
no? (since SPT is not a common geo data structure its a derivative)  There are 
use-cases for traversing beyond the within state in GeoSpatial data structures. 
So someone might come along and contribute the "unexpected".  Just want to be 
prepared for that future discussion.

bq. I just don't yet understand how this packed quad encoding is going to allow 
for distErrPct=0

That helps me better understand the confusion. Your description of the 
transient memory behavior surrounding pruneLeafyBranches helped me understand 
why high res shapes w/ distErrPct=0 causes QuadTree to OOM on even reasonably 
sized shapes (thank you for that clarification).  In your benchmark there's a 
closer performance to the PackedQuad with it disabled (so I still don't 
understand why enabled is the default).  In essence, the rough benchmark you 
performed doesn't benchmark default Lucene spatial behavior. Is that not a 
problem?

bq. I would love to have more realistic shapes to test with.

Awesome! I'll work on adding some of the realistic data to the benchmark. A lot 
of it is quite big so I'll have to add to the build.xml to pull a compressed 
version from somewhere.  What's cool is that It includes a healthy amount of 
exotic shapes which better reflect complex geospatial use-cases.  Again, that's 
where you see drastic performance improvements with PackedQuad.  The Leaves are 
always 8 bytes regardless of depth (to a max depth of 29)  It introduces at 
most a 7 byte overhead at low precision (which are fewer terms anyway), so just 
use QuadTree in those cases, but there's a 21 byte savings on leaves (again, 
exotic shapes).  This is how it improves over QuadTree with distErrPct=0. You 
don't need to reduce resolution for those large shapes.

bq. Are you and Mike in a hurry to see this committed?

Apologies if you perceived a level of urgency on my behalf. Regarding Mike, I 
was giving credit for his quote because I agree re: non-destructive 
enhancements such as this. Other than that, I can't speak on his behalf and I 
don't think its fair to speculate on his level of interest re: this single 
contribution.

(soapbox)  So you know where I stand, I would simply prefer this (and future 
spatial contributions) not drag out as long as the geo3d contribution has. 
IMHO, I agree with you 100% that peer review and discussion is healthy and 
necessary.  I would add, though, there's a tipping point where it becomes a 
blocker and prevents others from participating.  And for this package to 
improve beyond the great work already done, there needs to be more involvement 
and some level of organic growth (iterative improvements).

> Add StreamingQuadPrefixTree
> ---------------------------
>
>                 Key: LUCENE-6422
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6422
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/spatial
>    Affects Versions: 5.x
>            Reporter: Nicholas Knize
>         Attachments: LUCENE-6422.patch, 
> LUCENE-6422_with_SPT_factory_and_benchmark.patch
>
>
> To conform to Lucene's inverted index, SpatialStrategies use strings to 
> represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
> bits per GeoHash cell, respectively.  To create the terms representing a 
> Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
> ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
> invert() which creates a second lexicographically sorted array of Terms. This 
> doubles the memory consumption when indexing a shape.
> This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
> accomplish the following:
> 1.  Create a packed 8byte representation for a QuadCell
> 2.  Build the Packed cells 'on demand' when incrementToken is called
> Improvements over this approach include the generation of the packed cells 
> using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to