[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-22 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507064#comment-14507064
 ] 

David Smiley commented on LUCENE-6422:
--

My patch last night was also on trunk.  Is there anything in your -TRUNK patch 
of interest or can I ignore it?

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422-TRUNK.patch, LUCENE-6422.patch, 
 LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-22 Thread Nicholas Knize (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507076#comment-14507076
 ] 

Nicholas Knize commented on LUCENE-6422:


Good deal. That was not clear so -TRUNK.patch is redundant and can be ignored.  
Nothing additional since yesterday's changes so I'm good as is for both trunk 
and _5x

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422-TRUNK.patch, LUCENE-6422.patch, 
 LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-20 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502886#comment-14502886
 ] 

David Smiley commented on LUCENE-6422:
--

I'll check out your patch tonight, tomorrow at the latest.  Karl/Geo3d has kept 
me busy :-)

RE naming: in both cases it seems the current names actually aren't bad 
relative to your suggestions.  prune is a suffix of pruneLeafyBranches 
(current name is more descriptive; in any case one would still need to look at 
the javadocs to understand), and SpatialTrie is synonymous with 
SpatialPrefixTree given that Trie and PrefixTree are synonyms.  I'm +1 to 
rename these as you want to 6.x if you think it's worth it.  There are 
back-compat issues with renaming them _now_.  Again, we agree more javadocs 
(including suggested alternative names) to add clarification now would be 
great.  I'll create a patch and seek your input.

RE sandbox: It's not clear to me what is really needed/useful.  If someone 
comes along with some newfangled index/search spatial approach, it could go in 
the module and not hook into any existing interface... except a Lucene Query 
class, and something like a Lucene TokenStream/Field for indexing.

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499520#comment-14499520
 ] 

Michael McCandless commented on LUCENE-6422:


bq. baselined on Lucene trunk (standard practice for contributing to Lucene)

Please don't require this of contributors: it is not standard practice.  Make 
the bar as low as possible!

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-17 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499814#comment-14499814
 ] 

David Smiley commented on LUCENE-6422:
--

That's fair; you're right that the bar shouldn't be the same for committers vs. 
contributors.

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-17 Thread Nicholas Knize (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500327#comment-14500327
 ] 

Nicholas Knize commented on LUCENE-6422:


bq.  If you can suggest a better name to what leafy branch pruning does, then 
at a minimum it could be expressed in the javadocs

++.  Just 'prune' is probably more clear since its universally used all over 
data structures.  We can add a javadoc comment that describes it in further 
detail if necessary.

bq. if SpatialPrefixTree might have a better literature/industry based name 
then I'd love to know what that is.

There are trie based spatial trees all over (kd-Trie, kd-b-trie, buddy tree) 
industry and the literature. The one you call QuadPrefixTree was originally 
introduced in 1991 called the QuadTrie.  (reference: Gaston H. Gonnet and 
Ricardo Baeza-Yates, Handbook of Algorithms and Data Structures -- in Pascal 
and C, 2nd edition, Addison-Wesley, 1991.)  Dr. Hanan Samet from UMD has a 
great section on MX and PR QuadTrees (same as QuadPrefixTree and a name someone 
mentioned to you in another issue).  He provides a nice discussion on the 
differences between MX, PR and their point based counterparts (compared by the 
decomposition methods).  There's certainly nothing wrong with an implementation 
specific name. If you are asking for suggestions then I offer: SpatialTrie, 
GeoHashTrie, QuadTrie as being shorter, more to the point, and probably more 
relate-able to other spatial SMEs (whom I'm hoping would be willing to get more 
involved). 

bq. It's not obvious to me but where in the code of PackedQuadCell are the 5 
depth bits encoded  decoded?

PackedQuadCell.getLevel() decodes, and its encoded in PackedQuadCell.nextTerm()

bq. Preferably it would stay enabled but I think you indicated it's not 
supported by PackedQuadTree? I didn't look closer.

That's correct.

bq. I also wonder whether we need a new, lighter weight spatial module 
(spatial2? spatia_light?), or maybe spatial_sandbox, where the barrier is lower?

+++ I think this is a great idea for experimental/research features we don't 
want cluttering up the spatial module.

bq. RE abstractions: I respect your opinion although I don't agree that there 
are too many here.

IMHO, this is a slippery slope.  There are so many diverse spatial data 
structures we should be taking a look at for improving spatial search in higher 
order dimensions (4D space-time for just a start).  That's a personal interest 
area for me, in how the most powerful high dimension structures (that already 
exist) can fit within the design and implementation of lucene-core (a green 
field to explore).  Something like this does require a sophisticated 
abstraction framework and this particular one has a bit of a learning curve. I 
think that can work itself out over time with a bit of refactoring (which it 
sounds like all are open to?).  In the meantime it does set the bar rather high 
for new contributors. This is another +1 for a spatial sandbox for experimental 
research (heck make it a separate repo). 

bq. Sigh; these conversations are stressfull 

They're very verbose, but maybe that's the kick in the pants needed to help the 
spatial module really take off. That is, after all, the common goal of the 
community?




 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-16 Thread Nicholas Knize (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498096#comment-14498096
 ] 

Nicholas Knize commented on LUCENE-6422:


Awesome for putting together the benchmark (even a rough one).  Can you attach 
the spatial.alg file you used so I can verify (I'm assuming its a variant on 
one you've posted before?) I'm getting very different numbers with production 
data sets (e.g., high res peninsulas, islands, global political boundaries, 
planet osm data - e.g., more exotic shapes than circles larger than a few 
hundred KM)

A few questions... and other observations.

bq. chosen arbitrarily; with 27 it choked on memory given 2GB heap

What choked specifically? I'm using PackedQuad with depth between 26 and 29. 
1GB heap size using the shapes I described above.

bq. disabling leafy branch pruning to compare apples to apples

Out of curiosity, why is this option enabled by default if it uses transient 
storage that doubles memory consumption? Seems backwards to me. 

bq. I was skeptical there would be index size savings and the benchmark shows 
there aren't any.

IMHO I would avoid these kinds of absolute statements (especially with the 
highly variable nature of spatial use-cases). In this situation your numbers do 
not surprise me when disabling that leafyBranchPrune option (which still 
confuses me why its there?), and using a single shape type with variable size.  
There is an outstanding issue in the existing patch (I'll see if I can't push 
out a fix today) - the TermsEnum is returning Terms with BytesRef containing 
bytes[] that are double the size than they should be (e.g., 16 bytes instead of 
8 - all padded w/ zeros). I suspect its some improper configuration in the 
reader?  So for every high res cell (e.g.), the term will be 16 bytes (still 
better than, say, 27).  

I think we can do better on simulated test data in the test framework. I love 
the randomization and what minimal real data sets that are there are great. 
It does not provide the coverage necessary, though, to best simulate some real 
world scenarios. That's okay, to steal Mike's quote progress not perfection. 
I'll definitely work to provide some more real world tests so we have better 
coverage and benchmarking options using real world data. Its a good way to 
recommend one indexing structure over another (this one's just the beginning. 
There are more indexing structures in trial mode, and even more improvements 
for the packed version)

Let's keep this going... Since this patch is non-destructive I don't see a 
reason it can't be committed as another option and I can submit enhancement 
patches to this feature.  That would be up to the community. 

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-16 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498260#comment-14498260
 ] 

David Smiley commented on LUCENE-6422:
--

bq. Can you attach the spatial.alg file you used so I can verify 

It's in the patch.

bq. What choked specifically? I'm using PackedQuad with depth between 26 and 
29. 1GB heap size using the shapes I described above.

-Xmx2G resulted in an OutOfMemoryError, be it for the legacy quad or this new 
quad one.  

bq. Out of curiosity, why is this option enabled by default if it uses 
transient storage that doubles memory consumption? Seems backwards to me.

leafy branch pruning is enabled by default for RPT, although 
StreamingPrefixTreeStrategy overrides the method that would trigger it.  In 
order to compare another tree (legacy quad in this case) fairly to packed-quad, 
I disabled leafy branch pruning.

Please don't call what StreamingPrefixTreeStrategy does as leafy branch 
pruning; it confuses the important distinction I'm trying to make.  All SPTs 
should stop traversing when the relation is _within_ -- that's expected/normal.

bq. IMHO I would avoid these kinds of absolute statements (especially with the 
highly variable nature of spatial use-cases).

I'm sorry.  To be more clear, I just don't yet understand how this packed quad 
encoding is going to allow for distErrPct=0.  I'd like to understand; please 
help me.  Knowing what I do know about the SPTs, the results were what I 
expected -- similar disk size to existing quad.  

bq. I think we can do better on simulated test data in the test framework.

Yes!  I would love to have more realistic shapes to test with.

RE progress not perfection:
Yes, Mike uses that quote constantly and I even saw it in your code :-)  I have 
my catch-phrases too.  Are you and Mike in a hurry to see this committed?  It's 
very normal, in this open-source project anyway, that there is back  forth  
peer-review and changes that are asked of the contributor.  Don't worry; 
something is going to get committed -- the query speed is a nice improvement!  
It's not quite done -- that's all.

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-16 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498567#comment-14498567
 ] 

David Smiley commented on LUCENE-6422:
--

I welcome any suggestions you have pertaining naming (or again on anything).  
If you can suggest a better name to what leafy branch pruning does, then at a 
minimum it could be expressed in the javadocs.  Not long ago it had another 
name but it was even worse ;-).  Naming is hard.  Likewise... if 
SpatialPrefixTree might have a better literature/industry based name then I'd 
love to know what that is.  When I named it as-such I looked around but didn't 
seem to find anything that was perfect.  It's a type of Trie, a spatial trie, 
and PrefixTree is a synonym for Trie, so... there you go.  Maybe I didn't 
look hard enough.  I'm sorry if I _seemed_ touchy on the terminology; I merely 
want to ensure we mutually understood what we were and weren't talking about -- 
that's all.  So when you proposed that what StreamingQuadPrefixTree did was 
leafy branch pruning, and as the coiner of the term I can see it didn't meet my 
definition; clarification is in order.

bq. I want to make sure there is no confusion here (especially for anyone else 
willing to participate and the sensitivity surrounding my use of the 
pruneLeafyBranches term). SPT's in the context of lucene-spatial's definition, 
no? (since SPT is not a common name for a geo data structure its a derivative)

Yes, in the context of lucene-spatial's definition.  

bq. There are use-cases for traversing beyond the within state in GeoSpatial 
data structures. So someone might come along and contribute the unexpected. 
Just want to be prepared for that future discussion.

Interesting; I'm curious what that might be.

bq. A lot of it is quite big so I'll have to add to the build.xml to pull a 
compressed version from somewhere.

Yes, as you may have observed, this is how all existing test data is handled in 
the benchmark module:  fetched  decompressed on-demand.  I added the Geonames 
point data.  And I added facilities to SpatialDocMaker to automatically turn 
the points into circles  rects of random size.

I'm looking forward to seeing PackedQuadPrefixTree kick ass in being compressed 
allowing distErrPct=0 -- I'm still puzzled but I look forward to being 
pleasantly astonished. I do understand that it uses all 8 bits (instead of just 
2 as the legacy impl) of the first number of bytes for the quadrant info, and 
that should lead to shorter terms / prefixes for higher-precision data.  But I 
don't see that there would be any change in _the number_ of terms, which is, as 
I see, the scalability challenge.  I have an idea on solving this floating in 
my head but maybe I needn't ponder it longer if PackedQuadPrefixTree handles it 
somehow.  It's not obvious to me but where in the code of PackedQuadCell are 
the 5 depth bits encoded  decoded?

bq. In essence, the rough benchmark you performed doesn't benchmark default 
Lucene spatial behavior. Is that not a problem?

It is a problem -- it's not ideal, that is.  Preferably it would stay enabled 
but I think you indicated it's not supported by PackedQuadTree?  I didn't look 
closer.

RE Geo3d:  As an example for anything with me; it's an outlier, not an example 
that proves the rule.  If I am blind to facts I can't  see then show me.  
Hopefully you noticed that Karl and I are working together and it's come real 
far now (lots of discussion on ReviewBoard  bugs I helped Karl find via 
randomized testing).  Your patch here has nothing in common with it -- 
PackedQuadTree _obviously_ belongs here, and it should be quite evident that 
I'm here helping by providing feedback and running a benchmark. And yes, being 
critical.  Anyway... lets get back to work.

The only thing about this patch that is a blocker (-1) for me is 
StreamingPrefixTreeStrategy.  Not most of the code in it, but it's existence as 
a SpatialStrategy subclass instead of an SPT subclass.  I don't mind that it 
optimizes something that I don't think needed optimization, though I suspect if 
you noticed TreeCellIterator originally you wouldn't have bothered.

I have other by-line feedback I'd like to give (and would _prefer_ to do so via 
ReviewBoard/GitHub) but we needn't steamroll this in  under the mantra of 
progress not perfection.  

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per 

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-16 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498829#comment-14498829
 ] 

Michael McCandless commented on LUCENE-6422:


bq. It's very normal, in this open-source project anyway, that there is back  
forth  peer-review and changes that are asked of the contributor. Don't worry; 
something is going to get committed – the query speed is a nice improvement! 
It's not quite done – that's all.

I agree iterating/peer review is healthy, but growing the community is
also *very* important in open-source, and at least for the geo3d issue
and now this one it worries me when I see barriers being put up that I
feel should not be blockers issues for committing.

Especially when bus factor is essentially one, in an area as important
as spatial, I think encouraging contributions / growing community
becomes incredibly important.  It's like when we humans intervene for
an endangered species... after having caused their predicament in
the first place ... sigh.

Of course, if there are real technical objection/problems/quality
issues for a given patch, those *should* be addressed before committing.

It's important to show new people we are eager and excited for their
contributions, that the bar is not so high for them to have an impact.
We can always review/iterate/benchmark after they are committed, as
long as net/net the patch is a step forward as (I think?) this one is.

I also wonder whether we need a new, lighter weight spatial module
(spatial2?  spatia_light?), or maybe spatial_sandbox, where the
barrier is lower?  The levels of abtractions in the current module
look excessive to me and with both the geo3d issue and this issue,
correctly fitting in to the existing abstractions seemed to be one
of the barriers (e.g. your only blocker here (The only thing about
this patch that is a blocker (-1) for me is
StreamingPrefixTreeStrategy..)  seems to be such an issue).  So if we
had a more free sandbox the barrier is lower by design.


 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-16 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499138#comment-14499138
 ] 

David Smiley commented on LUCENE-6422:
--

bq. We can always review/iterate/benchmark after they are committed, as long as 
net/net the patch is a step forward as (I think?) this one is.

Mike, this project is defacto run as review then commit, not the other way 
around.  This isn't news to you, of course; it's weird to feel I need to remind 
you -- we all know this for good  bad.  I have the same expectation of patches 
I send elsewhere.  I'm not looking for perfection, just moving one thing around 
(a quality issue as-is IMO), and the rest is minor.  I appreciate the points 
you make generally, but I don't feel I'm setting the bar too high on this 
issue.  Because of the fuss being made here; it's going to be harder for me to 
give quality feedback on future patches without fear of inducing needless 
drama.  And that's not good; Lucene improves through solid peer review that I 
see on many patches (particularly yours, by the way).

bq. It's important to show new people we are eager and excited for their 
contributions

Let it be known that weeks ago, upon discovery that Elastic had a new spatial 
SME, I contacted Nick to do a video chat so that I could basically welcome him 
into the Lucene/Spatial4j spatial area, that I looked forward to seeing what he 
will contribute, and that I seek'ed his honest impressions / feedback.  I 
simply can't be more sincere, and I hope this demonstrates that I don't want to 
be the only spatial person around here.

RE abstractions: I respect your opinion although I don't agree that there are 
too many here.  On the Geo3D front, it's extremely close now to being 
committed, and the _only_ abstraction it is fitting into is a Spatial4j Shape 
interface (BTW Geo3D internally has a bunch of abstractions and I didn't 
complain to Karl as to the merits of them).  It would be good to have one more 
Spatial4j abstraction hook but it's not a blocker.  On this patch, I'm asking 
Nick _to scale it back_ one, to just the SpatialPrefixTree (which technically 
includes Cell as that's what SPTs generate).  And because of these 
abstractions, both of them in fact (Shape  SpatialPrefixTree), it's possible 
to re-use indexing and search predicates (Intersects, Within, Contains) and 
other code that work with these abstractions, without either implementations 
needing to know they exist.  I think it's very powerful and awesome that these 
things can be combined / leveraged.  Of course I don't believe any of these 
APIs are perfect; the specifics are debatable and I wish there were more people 
improving it than just me to help make it even better.  Now it looks like there 
are :-)

_Sigh_; these conversations are stressfull -- I look forward to getting back to 
the technical matters of the patch and moving it forward.  [~nknize] can you 
please post a new patch:
* baselined on Lucene trunk (standard practice for contributing to Lucene)
* with the streamlined cell iterator implementation (what you call streaming) 
moved to the PackedQuadPrefixTree

My added patch had two little extras: the factory  benchmark module 
integration.  It would be nice if you could add those.  If you don't, I will 
any how, perhaps with a factory test which I didn't yet do.  In fact if I sadly 
never hear from you again, I'll do all that is spoken of here because we'd all 
love for this awesomeness to get into Lucene spatial.  You've done all the hard 
work; what remains is small.

You certainly don't have to prove to me that the indexes are lean... or rather 
can be lean under realistic circumstances.  I remain very curious about that.

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements 

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-16 Thread Nicholas Knize (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498476#comment-14498476
 ] 

Nicholas Knize commented on LUCENE-6422:


bq. Please don't call what StreamingPrefixTreeStrategy does as leafy branch 
pruning; it confuses the important distinction I'm trying to make.

That's fair. pruneLeafyBranches isn't a Geo text book term and I didn't have a 
common nomenclature within this context for discussion re: the behavior so I 
used the closest shared term, for discussion.  If I change it so the leaves are 
pruned on intersects then its the same behavior. I'll add that to the patch for 
benchmarking and it will result in a smaller index.  Thinking about it more, I 
think it's the right way to go and a simple enough change that can be iterated 
in future improvements.

bq. All SPTs should stop traversing when the relation is within – that's 
expected/normal.

I want to make sure there is no confusion here (especially for anyone else 
willing to participate and the sensitivity surrounding my use of the 
pruneLeafyBranches term).  SPT's in the context lucene-spatial's definition, 
no? (since SPT is not a common geo data structure its a derivative)  There are 
use-cases for traversing beyond the within state in GeoSpatial data structures. 
So someone might come along and contribute the unexpected.  Just want to be 
prepared for that future discussion.

bq. I just don't yet understand how this packed quad encoding is going to allow 
for distErrPct=0

That helps me better understand the confusion. Your description of the 
transient memory behavior surrounding pruneLeafyBranches helped me understand 
why high res shapes w/ distErrPct=0 causes QuadTree to OOM on even reasonably 
sized shapes (thank you for that clarification).  In your benchmark there's a 
closer performance to the PackedQuad with it disabled (so I still don't 
understand why enabled is the default).  In essence, the rough benchmark you 
performed doesn't benchmark default Lucene spatial behavior. Is that not a 
problem?

bq. I would love to have more realistic shapes to test with.

Awesome! I'll work on adding some of the realistic data to the benchmark. A lot 
of it is quite big so I'll have to add to the build.xml to pull a compressed 
version from somewhere.  What's cool is that It includes a healthy amount of 
exotic shapes which better reflect complex geospatial use-cases.  Again, that's 
where you see drastic performance improvements with PackedQuad.  The Leaves are 
always 8 bytes regardless of depth (to a max depth of 29)  It introduces at 
most a 7 byte overhead at low precision (which are fewer terms anyway), so just 
use QuadTree in those cases, but there's a 21 byte savings on leaves (again, 
exotic shapes).  This is how it improves over QuadTree with distErrPct=0. You 
don't need to reduce resolution for those large shapes.

bq. Are you and Mike in a hurry to see this committed?

Apologies if you perceived a level of urgency on my behalf. Regarding Mike, I 
was giving credit for his quote because I agree re: non-destructive 
enhancements such as this. Other than that, I can't speak on his behalf and I 
don't think its fair to speculate on his level of interest re: this single 
contribution.

(soapbox)  So you know where I stand, I would simply prefer this (and future 
spatial contributions) not drag out as long as the geo3d contribution has. 
IMHO, I agree with you 100% that peer review and discussion is healthy and 
necessary.  I would add, though, there's a tipping point where it becomes a 
blocker and prevents others from participating.  And for this package to 
improve beyond the great work already done, there needs to be more involvement 
and some level of organic growth (iterative improvements).

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the 

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-15 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496135#comment-14496135
 ] 

David Smiley commented on LUCENE-6422:
--

*Awesome work Nick!*  It's so nice to see meaty spatial contributions like this 
(Geo3d is another example).

RE Streaming (transient memory use while indexing):  I appreciate that the 
out-of-the box configuration of RPT with either LegacyPrefixTree (be it quad or 
geohash)  will use a lot of memory for indexing.  But since... I don't know how 
long now, this only occurs if the leafy branch pruning optimization is 
enabled on RPT.  That algorithm, existing on RecursivePrefixTreeStrategy, 
unfortunately buffers all the cells. It's somewhat simple; it could be improved 
to not buffer all cells but it would need to buffer some.  Recently I did some 
benchmarking and found that the leafy branch pruning yielded lots of index size 
savings, particularly with the quad tree.  I'd love to chat with you about the 
subject of leaves on the SPT and an idea I have on doing better.  Any way, I 
suggest you do another memory benchmark with leafy branch pruning disabled with 
the PackedQuadTree but not the StreamingQuad...Strategy.  With it disabled, the 
underlying BytesRefIteratortokenStream will consume a IteratorCell that is a 
direct instance of TreeCellIterator, and then you get the streaming effect.  
The existing TreeCellIterator is quite similar to the 
Streaming...PrefixTreeIterator here.  If I'm right about there being no 
appreciable memory savings, then this part of the patch can be removed as it's 
redundant.

I really like the new PackedQuadPrefixTree.java.  (IMO that's what this JIRA 
issue is mostly about)  Can you consider _not_ subclassing Legacy* ?  I'd like 
to leave the legacy trees as-is and new SPTs not inherit from it.  Can you base 
your next patch off of trunk?  And can you *either* post on 
reviewboard.apache.org or use a GitHub fork  branch so I can provide by-line 
feedback?

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-15 Thread Nicholas Knize (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496326#comment-14496326
 ] 

Nicholas Knize commented on LUCENE-6422:


Thanks David. I will certainly be doing a more thorough benchmark and will 
start with the suggestions. I imagine the savings will not be as extreme as in 
the situation with Wales (that was just the most interesting case.)  

bq. With it disabled, the underlying BytesRefIteratortokenStream will consume a 
IteratorCell that is a direct instance of TreeCellIterator, and then you get 
the streaming effect.

Just a few thoughts, the StreamingPrefixTreeIterator gives the benefit of a few 
worlds:  
1. It uses an on-demand DFS through bit shifting completely eliminating the 
need for the stack DFS logic in TreeCellIterator.hasNext.  I suppose code-wise 
it would be cleaner to subclass TreeCellIterator and just override hasNext (and 
possibly next since I don't set thisCell/current in next)?  That's a good idea 
for code maintenance and reuse.
2. The on-demand DFS traversal already achieves a leafy branch pruning effect 
by not descending on Cells that already fall within the shape. This gives you 
pruning without having to buffer anything (other than the current and next 
cell). This does vary a little bit in that the RPT simply prunes all 4 leaves 
that intersect the shape.

bq. Can you consider not subclassing Legacy* ?

I'll certainly take a look at this. I saw the comment about not subclassing and 
thought about it, but since there is so much reuse with the bytes[], b_off, and 
b_len (which could be a BytesRef) it didn't make much sense duplicating code. 
Are you suggesting duplicating code and eventually deprecating the LegacyCell?



 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-15 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496658#comment-14496658
 ] 

David Smiley commented on LUCENE-6422:
--

bq. 1. It uses an on-demand DFS through bit shifting completely eliminating the 
need for the stack DFS logic in TreeCellIterator.hasNext. I suppose code-wise 
it would be cleaner to subclass TreeCellIterator and just override hasNext (and 
possibly next since I don't set thisCell/current in next)? That's a good idea 
for code maintenance and reuse.

Interesting; I need to look at that code closer.  Nonetheless, I think a stack 
vs completely stream one by one is unlikely to be noticeable in a benchmark.  
The memory use is capped at a small amount either way.  If we find there are 
measurable advantages here, then couldn't you provide this via overriding 
SpatialPrefixTree.getTreeCellIterator instead of overriding the SpatialStrategy?

bq. 2. The on-demand DFS traversal already achieves a leafy branch pruning 
effect by not descending on Cells that already fall within the shape. This 
gives you pruning without having to buffer anything (other than the current and 
next cell). This does vary a little bit in that the RPT simply prunes all 4 
leaves that intersect the shape.

I think we may be talking about different things.  A leafy branch as I 
defined it in Lucene spatial is a parent cell in which all sub-cells that could 
theoretically exist are there for this shape and are also leaves.  So for a 
quad-tree, it's a parent with 4 sub-cells that are all leaves.  A leaf is a 
cell that is _either_ within the shape from which it was derived _or_ it 
overlaps the edge but we've reached a precision threshold.  Indexing the 4 
leaves produces a larger index than not emitting those leaves at all and 
instead marking the parent as a leaf -- and the end effect is semantically 
equivalent in terms of the search predicates.  There are some trade-offs; but I 
won't digress now.  

bq. Are you suggesting duplicating code and eventually deprecating the 
LegacyCell?

Yes but I need to apply the patch and see how much code is at stake here.  At 
the time I introduced LegacyCell, I refined the SpatialPrefixTree API and was 
unsatisfied with the only two implementations at that time, in terms of 
efficiencies, so I called it LegacyPrefixTree and LegacyCell.  Perhaps these 
could still stick around still; I welcome your input on what it might be named 
or if there is too little code to worry about.  But keeping having this extend 
LegacyCell is problematic because RecursivePrefixTreeStrategy makes an 
assumption for the branch pruning you can see there, right at the beginning.

I welcome whatever thoughts you may have on the API to make it more clear, 
faster, whatever.


 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496686#comment-14496686
 ] 

Michael McCandless commented on LUCENE-6422:


Can we commit both approaches for streaming and refactor later (progress not 
perfection)?

I like that the new streaming approach has fewer abstractions on quick glance.

I also think further benchmarks need not block progress: they can come later.

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-15 Thread Nicholas Knize (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496796#comment-14496796
 ] 

Nicholas Knize commented on LUCENE-6422:


bq. I think a stack vs completely stream one by one is unlikely to be 
noticeable in a benchmark.

That's a possibility. It definitely warrants a benchmark to know for sure. I 
think it could be noticeable for deep QuadTrees? (aka: anything with high 
precision) as the string representation for one child can get upwards of 29 
bytes vs. 8. But its really speculation without benchmark confirmation.  Also, 
just eliminating the need for stack-based DFS logic is a more straightforward 
approach, no? (no crazy data structures with transient state needed since we're 
just lexicographically incrementing a compact bit representation on demand)

bq. ...couldn't you provide this via overriding 
SpatialPrefixTree.getTreeCellIterator instead of overriding the SpatialStrategy?

I like this suggestion.  If I set pruneLeafyBranches to false and override 
getTreeCellIterator in PackedQuadPrefixTree then, yes, RPT will call the 
PackedQuadPrefixTree.getTreeCellIterator from RPT.createCellIteratorToIndex and 
I can eliminate the StreamingPrefixTreeStrategy class altogether.  At that 
point I would suggest we rename RecursivePrefixTreeStrategy to something more 
accurate, but that can be an API discussion.

bq. I think we may be talking about different things.

We're talking about the same thing.  With one difference in this 
implementation.  At the moment leaves that are on the edge (e.g., intersects) 
are not pruned. They probably could be - and this optimization would save even 
more space than its already saving. 

bq. and the end effect is semantically equivalent in terms of the search 
predicates.

I forgot to mention, this enhancement allows one to disable that memory saving 
error factor (by setting distance_error_pct to 0) and the result is quad cell 
precision regardless of shape size (with minimal increase in index size).  This 
was the initial driver for the improvement.  Allowing one to eliminate false 
positives - achieve results that are within the precision they specify without 
having a fuzzy error factor based on the size of the shape.  It doesn't 
eliminate that factor, it just provides the ability to disable it and achieve 
your desired precision (subject to earth curvature error) without saturating 
memory and blowing out the index.

bq. Yes but I need to apply the patch and see how much code is at stake here.

I honestly think leaving that to another issue is the best strategy (progress 
not perfection).  That's something I can look at as a next step.  Until that 
time comes I don't think there's a problem having this enhancement subclass 
QuadCell and QuadPrefixTree.  Refactoring can be deferred to a separate issue.

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-15 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497475#comment-14497475
 ] 

David Smiley commented on LUCENE-6422:
--

Just a quick comment... I can see this patch assumes Java 8, yet Lucene 5x is 
on Java 7.  Long.BYTES doesn't exist, and there is no default remove method 
on Iterator (for PrefixTreeIterator).

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org