[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Knize updated LUCENE-6422: --- Attachment: LUCENE-6422-TRUNK.patch LGTM. I've also attached a patch off trunk as requested. I'll have a look on LOE for deprecating LegacyCell, and LegacyPrefixTree along with some general code cleanup and javadoc clarity. Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422-TRUNK.patch, LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Knize updated LUCENE-6422: --- Attachment: LUCENE-6422.patch Updated PackedQuadPrefixTree to iinclude optional leafyBranchPruning. Other minor code cleanup also included. Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Smiley updated LUCENE-6422: - Attachment: LUCENE-6422.patch Latest patch: * Removed some redundant casts * moved PrefixTreeIterator.leaves to a local variable of pruned() since that's the only place it was used. * implemented toString() so that the leafyPrune state could be printed * Removed modifications to RecursivePrefixTreeStrategy -- an instanceof check related to leafy branch pruning. PackedQuadPrefixTree does in fact support RPT doing the leaf pruning, so there's no instance-of check needed. It's up to the user to have this tree do it or have RPT do it. They aren't the same since RPT's impl is recursive whereas PQPT's impl is only at the last level (the level where it has the most benefit, yes). I also modified the fuzzy test to independently set the leafy branch prune option on them, just to test it works. * Enhanced getTreeCellIterator/PrefixTreeIterator to honor the detailLevel parameter instead of always going to maxLevels. This was a simple matter of passing through this parameter to the iterator and renaming maxLevels to detailLevel there. * PQPT.Cell.toString: moved Long.numberOfLeadingZeros out of the loop, and changed the O(N^2) String append to use a StringBuilder. * Optimized compareToNoLeaf to simply compare the longs instead of converting to bytes and comparing byte by byte. I put in an assert to check for parity with the old algorithm. * Added back in Benchmark SpatialDocMaker stuff, but with the leafy branch prune option set on PQPT grid if it's of that type instead of the strategy. I did some testing; notably running the fuzzy test with 10k iterations and temporarily set to test just the packed quad. I think it's ready from my point of view but I'd like to get your input on my changes Nick. Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Knize updated LUCENE-6422: --- Attachment: LUCENE-6422.patch Updated to remove StreamingPrefixTreeStrategy. PackedQuadTree is now self contained to one file and uses the RecursivePrefixTreeStrategy but ignores leafyBranchPruning. Still only integrated and tested on branch_5x, per discussion below. Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Smiley updated LUCENE-6422: - Attachment: LUCENE-6422_with_SPT_factory_and_benchmark.patch See this new patch. * Fixed compilation against Java 7 * Added SpatialPrefixTreeFactory support so this tree can be chosen by-name (in e.g. the benchmark module or Solr for that matter) * Added benchmark module support by enhancing SpatialDocMaker (which uses the aforementioned factory). * included a tweaked spatial.alg I did some benchmarking -- pretty quick rough right now. I set max levels to 20 (chosen arbitrarily; with 27 it choked on memory given 2GB heap), with distErrPct of 0.0, and indexing random circles up to 3 decimal degrees (a few hundred KM or so), and disabling leafy branch pruning to compare apples to apples. I ran it with quad and packedQuad with the same settings otherwise. * Index size: Quad: 1.4GB, PackedQuad 1.6GB * Index time: Quad: 1.46 rec/sec, PackedQuad: 1.91 rec/sec * Query time: Quad: 6.35 rec/sec, PackedQuad: 7.21 rec/sec * The benchmark module shows average memory use but I always look at that with a grain of salt. Seems PackedQuad *might* use a little more mem during indexing and less during search. Shrug. I was skeptical there would be index size savings and the benchmark shows there aren't any. Please prove me wrong, Nick! I like the indexing query speed improvements -- not surprised given the nice code here without the ugly recursion that was in legacy Quad. off to bed now... Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Knize updated LUCENE-6422: --- Issue Type: Improvement (was: Bug) Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Reporter: Nicholas Knize To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Knize updated LUCENE-6422: --- Affects Version/s: 5.x Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree
[ https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Knize updated LUCENE-6422: --- Attachment: LUCENE-6422.patch Developed and tested against branch_5x. I need to run more rigorous (and systematic) benchmarking but preliminary tests show a 90% reduction in index size on exotic cases (high precision). One particular shape (the political boundary of Wales): using QuadPrefixTree (w/ RecursivePrefixStrategy) consumed 1G of memory at TreeLevel 26 with distance_error_pct: 0. The new PackedQuadPrefixTree brought this down to just over 80mb with the same precision. There are many improvements remaining (including using variable byte array instead of 8 bytes for even the lowest levels). But this provides initial progress that should open the door for better precision on extreme shapes. Add StreamingQuadPrefixTree --- Key: LUCENE-6422 URL: https://issues.apache.org/jira/browse/LUCENE-6422 Project: Lucene - Core Issue Type: Improvement Components: modules/spatial Affects Versions: 5.x Reporter: Nicholas Knize Attachments: LUCENE-6422.patch To conform to Lucene's inverted index, SpatialStrategies use strings to represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 bits per GeoHash cell, respectively. To create the terms representing a Shape, the BytesRefIteratorTokenStream first builds all of the terms into an ArrayList of Cells in memory, then passes the ArrayList.Iterator back to invert() which creates a second lexicographically sorted array of Terms. This doubles the memory consumption when indexing a shape. This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to accomplish the following: 1. Create a packed 8byte representation for a QuadCell 2. Build the Packed cells 'on demand' when incrementToken is called Improvements over this approach include the generation of the packed cells using an AutoPrefixAutomaton -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org