[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507064#comment-14507064 ] David Smiley commented on LUCENE-6422: -- My patch last night was also on trunk. Is there anything in your -TRUNK patch of interest or can I ignore it? Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422-TRUNK.patch, LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507076#comment-14507076 ] Nicholas Knize commented on LUCENE-6422: Good deal. That was not clear so -TRUNK.patch is redundant and can be ignored. Nothing additional since yesterday's changes so I'm good as is for both trunk and _5x Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422-TRUNK.patch, LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502886#comment-14502886 ] David Smiley commented on LUCENE-6422: -- I'll check out your patch tonight, tomorrow at the latest. Karl/Geo3d has kept me busy :-) RE naming: in both cases it seems the current names actually aren't bad relative to your suggestions. prune is a suffix of pruneLeafyBranches (current name is more descriptive; in any case one would still need to look at the javadocs to understand), and SpatialTrie is synonymous with SpatialPrefixTree given that Trie and PrefixTree are synonyms. I'm +1 to rename these as you want to 6.x if you think it's worth it. There are back-compat issues with renaming them _now_. Again, we agree more javadocs (including suggested alternative names) to add clarification now would be great. I'll create a patch and seek your input. RE sandbox: It's not clear to me what is really needed/useful. If someone comes along with some newfangled index/search spatial approach, it could go in the module and not hook into any existing interface... except a Lucene Query class, and something like a Lucene TokenStream/Field for indexing. Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499520#comment-14499520 ] Michael McCandless commented on LUCENE-6422: bq. baselined on Lucene trunk (standard practice for contributing to Lucene) Please don't require this of contributors: it is not standard practice. Make the bar as low as possible! Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499814#comment-14499814 ] David Smiley commented on LUCENE-6422: -- That's fair; you're right that the bar shouldn't be the same for committers vs. contributors. Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500327#comment-14500327 ] Nicholas Knize commented on LUCENE-6422: bq. If you can suggest a better name to what leafy branch pruning does, then at a minimum it could be expressed in the javadocs ++. Just 'prune' is probably more clear since its universally used all over data structures. We can add a javadoc comment that describes it in further detail if necessary. bq. if SpatialPrefixTree might have a better literature/industry based name then I'd love to know what that is. There are trie based spatial trees all over (kd-Trie, kd-b-trie, buddy tree) industry and the literature. The one you call QuadPrefixTree was originally introduced in 1991 called the QuadTrie. (reference: Gaston H. Gonnet and Ricardo Baeza-Yates, Handbook of Algorithms and Data Structures -- in Pascal and C, 2nd edition, Addison-Wesley, 1991.) Dr. Hanan Samet from UMD has a great section on MX and PR QuadTrees (same as QuadPrefixTree and a name someone mentioned to you in another issue). He provides a nice discussion on the differences between MX, PR and their point based counterparts (compared by the decomposition methods). There's certainly nothing wrong with an implementation specific name. If you are asking for suggestions then I offer: SpatialTrie, GeoHashTrie, QuadTrie as being shorter, more to the point, and probably more relate-able to other spatial SMEs (whom I'm hoping would be willing to get more involved). bq. It's not obvious to me but where in the code of PackedQuadCell are the 5 depth bits encoded decoded? PackedQuadCell.getLevel() decodes, and its encoded in PackedQuadCell.nextTerm() bq. Preferably it would stay enabled but I think you indicated it's not supported by PackedQuadTree? I didn't look closer. That's correct. bq. I also wonder whether we need a new, lighter weight spatial module (spatial2? spatia_light?), or maybe spatial_sandbox, where the barrier is lower? +++ I think this is a great idea for experimental/research features we don't want cluttering up the spatial module. bq. RE abstractions: I respect your opinion although I don't agree that there are too many here. IMHO, this is a slippery slope. There are so many diverse spatial data structures we should be taking a look at for improving spatial search in higher order dimensions (4D space-time for just a start). That's a personal interest area for me, in how the most powerful high dimension structures (that already exist) can fit within the design and implementation of lucene-core (a green field to explore). Something like this does require a sophisticated abstraction framework and this particular one has a bit of a learning curve. I think that can work itself out over time with a bit of refactoring (which it sounds like all are open to?). In the meantime it does set the bar rather high for new contributors. This is another +1 for a spatial sandbox for experimental research (heck make it a separate repo). bq. Sigh; these conversations are stressfull They're very verbose, but maybe that's the kick in the pants needed to help the spatial module really take off. That is, after all, the common goal of the community? Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498096#comment-14498096 ] Nicholas Knize commented on LUCENE-6422: Awesome for putting together the benchmark (even a rough one). Can you attach the spatial.alg file you used so I can verify (I'm assuming its a variant on one you've posted before?) I'm getting very different numbers with production data sets (e.g., high res peninsulas, islands, global political boundaries, planet osm data - e.g., more exotic shapes than circles larger than a few hundred KM) A few questions... and other observations. bq. chosen arbitrarily; with 27 it choked on memory given 2GB heap What choked specifically? I'm using PackedQuad with depth between 26 and 29. 1GB heap size using the shapes I described above. bq. disabling leafy branch pruning to compare apples to apples Out of curiosity, why is this option enabled by default if it uses transient storage that doubles memory consumption? Seems backwards to me. bq. I was skeptical there would be index size savings and the benchmark shows there aren't any. IMHO I would avoid these kinds of absolute statements (especially with the highly variable nature of spatial use-cases). In this situation your numbers do not surprise me when disabling that leafyBranchPrune option (which still confuses me why its there?), and using a single shape type with variable size. There is an outstanding issue in the existing patch (I'll see if I can't push out a fix today) - the TermsEnum is returning Terms with BytesRef containing bytes[] that are double the size than they should be (e.g., 16 bytes instead of 8 - all padded w/ zeros). I suspect its some improper configuration in the reader? So for every high res cell (e.g.), the term will be 16 bytes (still better than, say, 27). I think we can do better on simulated test data in the test framework. I love the randomization and what minimal real data sets that are there are great. It does not provide the coverage necessary, though, to best simulate some real world scenarios. That's okay, to steal Mike's quote progress not perfection. I'll definitely work to provide some more real world tests so we have better coverage and benchmarking options using real world data. Its a good way to recommend one indexing structure over another (this one's just the beginning. There are more indexing structures in trial mode, and even more improvements for the packed version) Let's keep this going... Since this patch is non-destructive I don't see a reason it can't be committed as another option and I can submit enhancement patches to this feature. That would be up to the community. Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498260#comment-14498260 ] David Smiley commented on LUCENE-6422: -- bq. Can you attach the spatial.alg file you used so I can verify It's in the patch. bq. What choked specifically? I'm using PackedQuad with depth between 26 and 29. 1GB heap size using the shapes I described above. -Xmx2G resulted in an OutOfMemoryError, be it for the legacy quad or this new quad one. bq. Out of curiosity, why is this option enabled by default if it uses transient storage that doubles memory consumption? Seems backwards to me. leafy branch pruning is enabled by default for RPT, although StreamingPrefixTreeStrategy overrides the method that would trigger it. In order to compare another tree (legacy quad in this case) fairly to packed-quad, I disabled leafy branch pruning. Please don't call what StreamingPrefixTreeStrategy does as leafy branch pruning; it confuses the important distinction I'm trying to make. All SPTs should stop traversing when the relation is _within_ -- that's expected/normal. bq. IMHO I would avoid these kinds of absolute statements (especially with the highly variable nature of spatial use-cases). I'm sorry. To be more clear, I just don't yet understand how this packed quad encoding is going to allow for distErrPct=0. I'd like to understand; please help me. Knowing what I do know about the SPTs, the results were what I expected -- similar disk size to existing quad. bq. I think we can do better on simulated test data in the test framework. Yes! I would love to have more realistic shapes to test with. RE progress not perfection: Yes, Mike uses that quote constantly and I even saw it in your code :-) I have my catch-phrases too. Are you and Mike in a hurry to see this committed? It's very normal, in this open-source project anyway, that there is back forth peer-review and changes that are asked of the contributor. Don't worry; something is going to get committed -- the query speed is a nice improvement! It's not quite done -- that's all. Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498567#comment-14498567 ] David Smiley commented on LUCENE-6422: -- I welcome any suggestions you have pertaining naming (or again on anything). If you can suggest a better name to what leafy branch pruning does, then at a minimum it could be expressed in the javadocs. Not long ago it had another name but it was even worse ;-). Naming is hard. Likewise... if SpatialPrefixTree might have a better literature/industry based name then I'd love to know what that is. When I named it as-such I looked around but didn't seem to find anything that was perfect. It's a type of Trie, a spatial trie, and PrefixTree is a synonym for Trie, so... there you go. Maybe I didn't look hard enough. I'm sorry if I _seemed_ touchy on the terminology; I merely want to ensure we mutually understood what we were and weren't talking about -- that's all. So when you proposed that what StreamingQuadPrefixTree did was leafy branch pruning, and as the coiner of the term I can see it didn't meet my definition; clarification is in order. bq. I want to make sure there is no confusion here (especially for anyone else willing to participate and the sensitivity surrounding my use of the pruneLeafyBranches term). SPT's in the context of lucene-spatial's definition, no? (since SPT is not a common name for a geo data structure its a derivative) Yes, in the context of lucene-spatial's definition. bq. There are use-cases for traversing beyond the within state in GeoSpatial data structures. So someone might come along and contribute the unexpected. Just want to be prepared for that future discussion. Interesting; I'm curious what that might be. bq. A lot of it is quite big so I'll have to add to the build.xml to pull a compressed version from somewhere. Yes, as you may have observed, this is how all existing test data is handled in the benchmark module: fetched decompressed on-demand. I added the Geonames point data. And I added facilities to SpatialDocMaker to automatically turn the points into circles rects of random size. I'm looking forward to seeing PackedQuadPrefixTree kick ass in being compressed allowing distErrPct=0 -- I'm still puzzled but I look forward to being pleasantly astonished. I do understand that it uses all 8 bits (instead of just 2 as the legacy impl) of the first number of bytes for the quadrant info, and that should lead to shorter terms / prefixes for higher-precision data. But I don't see that there would be any change in _the number_ of terms, which is, as I see, the scalability challenge. I have an idea on solving this floating in my head but maybe I needn't ponder it longer if PackedQuadPrefixTree handles it somehow. It's not obvious to me but where in the code of PackedQuadCell are the 5 depth bits encoded decoded? bq. In essence, the rough benchmark you performed doesn't benchmark default Lucene spatial behavior. Is that not a problem? It is a problem -- it's not ideal, that is. Preferably it would stay enabled but I think you indicated it's not supported by PackedQuadTree? I didn't look closer. RE Geo3d: As an example for anything with me; it's an outlier, not an example that proves the rule. If I am blind to facts I can't see then show me. Hopefully you noticed that Karl and I are working together and it's come real far now (lots of discussion on ReviewBoard bugs I helped Karl find via randomized testing). Your patch here has nothing in common with it -- PackedQuadTree _obviously_ belongs here, and it should be quite evident that I'm here helping by providing feedback and running a benchmark. And yes, being critical. Anyway... lets get back to work. The only thing about this patch that is a blocker (-1) for me is StreamingPrefixTreeStrategy. Not most of the code in it, but it's existence as a SpatialStrategy subclass instead of an SPT subclass. I don't mind that it optimizes something that I don't think needed optimization, though I suspect if you noticed TreeCellIterator originally you wouldn't have bothered. I have other by-line feedback I'd like to give (and would _prefer_ to do so via ReviewBoard/GitHub) but we needn't steamroll this in under the mantra of progress not perfection. Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498829#comment-14498829 ] Michael McCandless commented on LUCENE-6422: bq. It's very normal, in this open-source project anyway, that there is back forth peer-review and changes that are asked of the contributor. Don't worry; something is going to get committed – the query speed is a nice improvement! It's not quite done – that's all. I agree iterating/peer review is healthy, but growing the community is also *very* important in open-source, and at least for the geo3d issue and now this one it worries me when I see barriers being put up that I feel should not be blockers issues for committing. Especially when bus factor is essentially one, in an area as important as spatial, I think encouraging contributions / growing community becomes incredibly important. It's like when we humans intervene for an endangered species... after having caused their predicament in the first place ... sigh. Of course, if there are real technical objection/problems/quality issues for a given patch, those *should* be addressed before committing. It's important to show new people we are eager and excited for their contributions, that the bar is not so high for them to have an impact. We can always review/iterate/benchmark after they are committed, as long as net/net the patch is a step forward as (I think?) this one is. I also wonder whether we need a new, lighter weight spatial module (spatial2? spatia_light?), or maybe spatial_sandbox, where the barrier is lower? The levels of abtractions in the current module look excessive to me and with both the geo3d issue and this issue, correctly fitting in to the existing abstractions seemed to be one of the barriers (e.g. your only blocker here (The only thing about this patch that is a blocker (-1) for me is StreamingPrefixTreeStrategy..) seems to be such an issue). So if we had a more free sandbox the barrier is lower by design. Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499138#comment-14499138 ] David Smiley commented on LUCENE-6422: -- bq. We can always review/iterate/benchmark after they are committed, as long as net/net the patch is a step forward as (I think?) this one is. Mike, this project is defacto run as review then commit, not the other way around. This isn't news to you, of course; it's weird to feel I need to remind you -- we all know this for good bad. I have the same expectation of patches I send elsewhere. I'm not looking for perfection, just moving one thing around (a quality issue as-is IMO), and the rest is minor. I appreciate the points you make generally, but I don't feel I'm setting the bar too high on this issue. Because of the fuss being made here; it's going to be harder for me to give quality feedback on future patches without fear of inducing needless drama. And that's not good; Lucene improves through solid peer review that I see on many patches (particularly yours, by the way). bq. It's important to show new people we are eager and excited for their contributions Let it be known that weeks ago, upon discovery that Elastic had a new spatial SME, I contacted Nick to do a video chat so that I could basically welcome him into the Lucene/Spatial4j spatial area, that I looked forward to seeing what he will contribute, and that I seek'ed his honest impressions / feedback. I simply can't be more sincere, and I hope this demonstrates that I don't want to be the only spatial person around here. RE abstractions: I respect your opinion although I don't agree that there are too many here. On the Geo3D front, it's extremely close now to being committed, and the _only_ abstraction it is fitting into is a Spatial4j Shape interface (BTW Geo3D internally has a bunch of abstractions and I didn't complain to Karl as to the merits of them). It would be good to have one more Spatial4j abstraction hook but it's not a blocker. On this patch, I'm asking Nick _to scale it back_ one, to just the SpatialPrefixTree (which technically includes Cell as that's what SPTs generate). And because of these abstractions, both of them in fact (Shape SpatialPrefixTree), it's possible to re-use indexing and search predicates (Intersects, Within, Contains) and other code that work with these abstractions, without either implementations needing to know they exist. I think it's very powerful and awesome that these things can be combined / leveraged. Of course I don't believe any of these APIs are perfect; the specifics are debatable and I wish there were more people improving it than just me to help make it even better. Now it looks like there are :-) _Sigh_; these conversations are stressfull -- I look forward to getting back to the technical matters of the patch and moving it forward. [~nknize] can you please post a new patch: * baselined on Lucene trunk (standard practice for contributing to Lucene) * with the streamlined cell iterator implementation (what you call streaming) moved to the PackedQuadPrefixTree My added patch had two little extras: the factory benchmark module integration. It would be nice if you could add those. If you don't, I will any how, perhaps with a factory test which I didn't yet do. In fact if I sadly never hear from you again, I'll do all that is spoken of here because we'd all love for this awesomeness to get into Lucene spatial. You've done all the hard work; what remains is small. You certainly don't have to prove to me that the indexes are lean... or rather can be lean under realistic circumstances. I remain very curious about that. Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498476#comment-14498476 ] Nicholas Knize commented on LUCENE-6422: bq. Please don't call what StreamingPrefixTreeStrategy does as leafy branch pruning; it confuses the important distinction I'm trying to make. That's fair. pruneLeafyBranches isn't a Geo text book term and I didn't have a common nomenclature within this context for discussion re: the behavior so I used the closest shared term, for discussion. If I change it so the leaves are pruned on intersects then its the same behavior. I'll add that to the patch for benchmarking and it will result in a smaller index. Thinking about it more, I think it's the right way to go and a simple enough change that can be iterated in future improvements. bq. All SPTs should stop traversing when the relation is within – that's expected/normal. I want to make sure there is no confusion here (especially for anyone else willing to participate and the sensitivity surrounding my use of the pruneLeafyBranches term). SPT's in the context lucene-spatial's definition, no? (since SPT is not a common geo data structure its a derivative) There are use-cases for traversing beyond the within state in GeoSpatial data structures. So someone might come along and contribute the unexpected. Just want to be prepared for that future discussion. bq. I just don't yet understand how this packed quad encoding is going to allow for distErrPct=0 That helps me better understand the confusion. Your description of the transient memory behavior surrounding pruneLeafyBranches helped me understand why high res shapes w/ distErrPct=0 causes QuadTree to OOM on even reasonably sized shapes (thank you for that clarification). In your benchmark there's a closer performance to the PackedQuad with it disabled (so I still don't understand why enabled is the default). In essence, the rough benchmark you performed doesn't benchmark default Lucene spatial behavior. Is that not a problem? bq. I would love to have more realistic shapes to test with. Awesome! I'll work on adding some of the realistic data to the benchmark. A lot of it is quite big so I'll have to add to the build.xml to pull a compressed version from somewhere. What's cool is that It includes a healthy amount of exotic shapes which better reflect complex geospatial use-cases. Again, that's where you see drastic performance improvements with PackedQuad. The Leaves are always 8 bytes regardless of depth (to a max depth of 29) It introduces at most a 7 byte overhead at low precision (which are fewer terms anyway), so just use QuadTree in those cases, but there's a 21 byte savings on leaves (again, exotic shapes). This is how it improves over QuadTree with distErrPct=0. You don't need to reduce resolution for those large shapes. bq. Are you and Mike in a hurry to see this committed? Apologies if you perceived a level of urgency on my behalf. Regarding Mike, I was giving credit for his quote because I agree re: non-destructive enhancements such as this. Other than that, I can't speak on his behalf and I don't think its fair to speculate on his level of interest re: this single contribution. (soapbox) So you know where I stand, I would simply prefer this (and future spatial contributions) not drag out as long as the geo3d contribution has. IMHO, I agree with you 100% that peer review and discussion is healthy and necessary. I would add, though, there's a tipping point where it becomes a blocker and prevents others from participating. And for this package to improve beyond the great work already done, there needs to be more involvement and some level of organic growth (iterative improvements). Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496135#comment-14496135 ] David Smiley commented on LUCENE-6422: -- *Awesome work Nick!* It's so nice to see meaty spatial contributions like this (Geo3d is another example). RE Streaming (transient memory use while indexing): I appreciate that the out-of-the box configuration of RPT with either LegacyPrefixTree (be it quad or geohash) will use a lot of memory for indexing. But since... I don't know how long now, this only occurs if the leafy branch pruning optimization is enabled on RPT. That algorithm, existing on RecursivePrefixTreeStrategy, unfortunately buffers all the cells. It's somewhat simple; it could be improved to not buffer all cells but it would need to buffer some. Recently I did some benchmarking and found that the leafy branch pruning yielded lots of index size savings, particularly with the quad tree. I'd love to chat with you about the subject of leaves on the SPT and an idea I have on doing better. Any way, I suggest you do another memory benchmark with leafy branch pruning disabled with the PackedQuadTree but not the StreamingQuad...Strategy. With it disabled, the underlying BytesRefIteratortokenStream will consume a IteratorCell that is a direct instance of TreeCellIterator, and then you get the streaming effect. The existing TreeCellIterator is quite similar to the Streaming...PrefixTreeIterator here. If I'm right about there being no appreciable memory savings, then this part of the patch can be removed as it's redundant. I really like the new PackedQuadPrefixTree.java. (IMO that's what this JIRA issue is mostly about) Can you consider _not_ subclassing Legacy* ? I'd like to leave the legacy trees as-is and new SPTs not inherit from it. Can you base your next patch off of trunk? And can you *either* post on reviewboard.apache.org or use a GitHub fork branch so I can provide by-line feedback? Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496326#comment-14496326 ] Nicholas Knize commented on LUCENE-6422: Thanks David. I will certainly be doing a more thorough benchmark and will start with the suggestions. I imagine the savings will not be as extreme as in the situation with Wales (that was just the most interesting case.) bq. With it disabled, the underlying BytesRefIteratortokenStream will consume a IteratorCell that is a direct instance of TreeCellIterator, and then you get the streaming effect. Just a few thoughts, the StreamingPrefixTreeIterator gives the benefit of a few worlds: 1. It uses an on-demand DFS through bit shifting completely eliminating the need for the stack DFS logic in TreeCellIterator.hasNext. I suppose code-wise it would be cleaner to subclass TreeCellIterator and just override hasNext (and possibly next since I don't set thisCell/current in next)? That's a good idea for code maintenance and reuse. 2. The on-demand DFS traversal already achieves a leafy branch pruning effect by not descending on Cells that already fall within the shape. This gives you pruning without having to buffer anything (other than the current and next cell). This does vary a little bit in that the RPT simply prunes all 4 leaves that intersect the shape. bq. Can you consider not subclassing Legacy* ? I'll certainly take a look at this. I saw the comment about not subclassing and thought about it, but since there is so much reuse with the bytes[], b_off, and b_len (which could be a BytesRef) it didn't make much sense duplicating code. Are you suggesting duplicating code and eventually deprecating the LegacyCell? Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496658#comment-14496658 ] David Smiley commented on LUCENE-6422: -- bq. 1. It uses an on-demand DFS through bit shifting completely eliminating the need for the stack DFS logic in TreeCellIterator.hasNext. I suppose code-wise it would be cleaner to subclass TreeCellIterator and just override hasNext (and possibly next since I don't set thisCell/current in next)? That's a good idea for code maintenance and reuse. Interesting; I need to look at that code closer. Nonetheless, I think a stack vs completely stream one by one is unlikely to be noticeable in a benchmark. The memory use is capped at a small amount either way. If we find there are measurable advantages here, then couldn't you provide this via overriding SpatialPrefixTree.getTreeCellIterator instead of overriding the SpatialStrategy? bq. 2. The on-demand DFS traversal already achieves a leafy branch pruning effect by not descending on Cells that already fall within the shape. This gives you pruning without having to buffer anything (other than the current and next cell). This does vary a little bit in that the RPT simply prunes all 4 leaves that intersect the shape. I think we may be talking about different things. A leafy branch as I defined it in Lucene spatial is a parent cell in which all sub-cells that could theoretically exist are there for this shape and are also leaves. So for a quad-tree, it's a parent with 4 sub-cells that are all leaves. A leaf is a cell that is _either_ within the shape from which it was derived _or_ it overlaps the edge but we've reached a precision threshold. Indexing the 4 leaves produces a larger index than not emitting those leaves at all and instead marking the parent as a leaf -- and the end effect is semantically equivalent in terms of the search predicates. There are some trade-offs; but I won't digress now. bq. Are you suggesting duplicating code and eventually deprecating the LegacyCell? Yes but I need to apply the patch and see how much code is at stake here. At the time I introduced LegacyCell, I refined the SpatialPrefixTree API and was unsatisfied with the only two implementations at that time, in terms of efficiencies, so I called it LegacyPrefixTree and LegacyCell. Perhaps these could still stick around still; I welcome your input on what it might be named or if there is too little code to worry about. But keeping having this extend LegacyCell is problematic because RecursivePrefixTreeStrategy makes an assumption for the branch pruning you can see there, right at the beginning. I welcome whatever thoughts you may have on the API to make it more clear, faster, whatever. Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496686#comment-14496686 ] Michael McCandless commented on LUCENE-6422: Can we commit both approaches for streaming and refactor later (progress not perfection)? I like that the new streaming approach has fewer abstractions on quick glance. I also think further benchmarks need not block progress: they can come later. Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496796#comment-14496796 ] Nicholas Knize commented on LUCENE-6422: bq. I think a stack vs completely stream one by one is unlikely to be noticeable in a benchmark. That's a possibility. It definitely warrants a benchmark to know for sure. I think it could be noticeable for deep QuadTrees? (aka: anything with high precision) as the string representation for one child can get upwards of 29 bytes vs. 8. But its really speculation without benchmark confirmation. Also, just eliminating the need for stack-based DFS logic is a more straightforward approach, no? (no crazy data structures with transient state needed since we're just lexicographically incrementing a compact bit representation on demand) bq. ...couldn't you provide this via overriding SpatialPrefixTree.getTreeCellIterator instead of overriding the SpatialStrategy? I like this suggestion. If I set pruneLeafyBranches to false and override getTreeCellIterator in PackedQuadPrefixTree then, yes, RPT will call the PackedQuadPrefixTree.getTreeCellIterator from RPT.createCellIteratorToIndex and I can eliminate the StreamingPrefixTreeStrategy class altogether. At that point I would suggest we rename RecursivePrefixTreeStrategy to something more accurate, but that can be an API discussion. bq. I think we may be talking about different things. We're talking about the same thing. With one difference in this implementation. At the moment leaves that are on the edge (e.g., intersects) are not pruned. They probably could be - and this optimization would save even more space than its already saving. bq. and the end effect is semantically equivalent in terms of the search predicates. I forgot to mention, this enhancement allows one to disable that memory saving error factor (by setting distance_error_pct to 0) and the result is quad cell precision regardless of shape size (with minimal increase in index size). This was the initial driver for the improvement. Allowing one to eliminate false positives - achieve results that are within the precision they specify without having a fuzzy error factor based on the size of the shape. It doesn't eliminate that factor, it just provides the ability to disable it and achieve your desired precision (subject to earth curvature error) without saturating memory and blowing out the index. bq. Yes but I need to apply the patch and see how much code is at stake here. I honestly think leaving that to another issue is the best strategy (progress not perfection). That's something I can look at as a next step. Until that time comes I don't think there's a problem having this enhancement subclass QuadCell and QuadPrefixTree. Refactoring can be deferred to a separate issue. Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497475#comment-14497475 ] David Smiley commented on LUCENE-6422: -- Just a quick comment... I can see this patch assumes Java 8, yet Lucene 5x is on Java 7. Long.BYTES doesn't exist, and there is no default remove method on Iterator (for PrefixTreeIterator). Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org