[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-22 Thread Nicholas Knize (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Knize updated LUCENE-6422:
---
Attachment: LUCENE-6422-TRUNK.patch

LGTM. I've also attached a patch off trunk as requested. I'll have a look on 
LOE for deprecating LegacyCell, and LegacyPrefixTree along with some general 
code cleanup and javadoc clarity. 

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422-TRUNK.patch, LUCENE-6422.patch, 
 LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-21 Thread Nicholas Knize (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Knize updated LUCENE-6422:
---
Attachment: LUCENE-6422.patch

Updated PackedQuadPrefixTree to iinclude optional leafyBranchPruning.  Other 
minor code cleanup also included.

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-21 Thread David Smiley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-6422:
-
Attachment: LUCENE-6422.patch

Latest patch:
* Removed some redundant casts
* moved PrefixTreeIterator.leaves to a local variable of pruned() since that's 
the only place it was used.
* implemented toString() so that the leafyPrune state could be printed
* Removed modifications to RecursivePrefixTreeStrategy -- an instanceof check 
related to leafy branch pruning.  PackedQuadPrefixTree does in fact support RPT 
doing the leaf pruning, so there's no instance-of check needed.  It's up to the 
user to have this tree do it or have RPT do it.  They aren't the same since 
RPT's impl is recursive whereas PQPT's impl is only at the last level (the 
level where it has the most benefit, yes).  I also modified the fuzzy test to 
independently set the leafy branch prune option on them, just to test it works. 
 
* Enhanced getTreeCellIterator/PrefixTreeIterator to honor the detailLevel 
parameter instead of always going to maxLevels. This was a simple matter of 
passing through this parameter to the iterator and renaming maxLevels to 
detailLevel there.
* PQPT.Cell.toString: moved Long.numberOfLeadingZeros out of the loop, and 
changed the O(N^2) String append to use a StringBuilder.
* Optimized compareToNoLeaf to simply compare the longs instead of converting 
to bytes and comparing byte by byte.  I put in an assert to check for parity 
with the old algorithm.
* Added back in Benchmark SpatialDocMaker stuff, but with the leafy branch 
prune option set on PQPT grid if it's of that type instead of the strategy.

I did some testing; notably running the fuzzy test with 10k iterations and 
temporarily set to test just the packed quad.

I think it's ready from my point of view but I'd like to get your input on my 
changes Nick.

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, 
 LUCENE-6422.patch, LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-17 Thread Nicholas Knize (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Knize updated LUCENE-6422:
---
Attachment: LUCENE-6422.patch

Updated to remove StreamingPrefixTreeStrategy.  PackedQuadTree is now self 
contained to one file and uses the RecursivePrefixTreeStrategy but ignores 
leafyBranchPruning.  

Still only integrated and tested on branch_5x, per discussion below.

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-15 Thread David Smiley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated LUCENE-6422:
-
Attachment: LUCENE-6422_with_SPT_factory_and_benchmark.patch

See this new patch.
* Fixed compilation against Java 7
* Added SpatialPrefixTreeFactory support so this tree can be chosen by-name (in 
e.g. the benchmark module or Solr for that matter)
* Added benchmark module support by enhancing SpatialDocMaker (which uses the 
aforementioned factory).
* included a tweaked spatial.alg

I did some benchmarking -- pretty quick  rough right now.  I set max levels to 
20 (chosen arbitrarily; with 27 it choked on memory given 2GB heap), with 
distErrPct of 0.0, and indexing random circles up to 3 decimal degrees (a few 
hundred KM or so), and disabling leafy branch pruning to compare apples to 
apples.

I ran it with quad and packedQuad with the same settings otherwise.
* Index size: Quad: 1.4GB, PackedQuad 1.6GB
* Index time: Quad: 1.46 rec/sec, PackedQuad: 1.91 rec/sec
* Query time: Quad: 6.35 rec/sec, PackedQuad: 7.21 rec/sec
* The benchmark module shows average memory use but I always look at that with 
a grain of salt.  Seems PackedQuad *might* use a little more mem during 
indexing and less during search.  Shrug.

I was skeptical there would be index size savings and the benchmark shows there 
aren't any.  Please prove me wrong, Nick!  I like the indexing  query speed 
improvements -- not surprised given the nice code here without the ugly 
recursion that was in legacy Quad.

off to bed now...

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-14 Thread Nicholas Knize (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Knize updated LUCENE-6422:
---
Issue Type: Improvement  (was: Bug)

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Reporter: Nicholas Knize

 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-14 Thread Nicholas Knize (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Knize updated LUCENE-6422:
---
Affects Version/s: 5.x

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize

 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-14 Thread Nicholas Knize (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Knize updated LUCENE-6422:
---
Attachment: LUCENE-6422.patch

Developed and tested against branch_5x. I need to run more rigorous (and 
systematic) benchmarking but preliminary tests show a 90% reduction in index 
size on exotic cases (high precision).

One particular shape (the political boundary of Wales): using QuadPrefixTree 
(w/ RecursivePrefixStrategy) consumed 1G of memory at TreeLevel 26 with 
distance_error_pct: 0.  The new PackedQuadPrefixTree brought this down to just 
over 80mb with the same precision.

There are many improvements remaining (including using variable byte array 
instead of 8 bytes for even the lowest levels). But this provides initial 
progress that should open the door for better precision on extreme shapes.

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org