subject:"\[jira\] \[Commented\] \(LUCENE\-6422\) Add StreamingQuadPrefixTree"

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-22 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507064#comment-14507064
 ] 

David Smiley commented on LUCENE-6422:
--

My patch last night was also on trunk.  Is there anything in your -TRUNK patch 
of interest or can I ignore it?

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422-TRUNK.patch, LUCENE-6422.patch, 
 LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-22 Thread Nicholas Knize (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507076#comment-14507076
 ] 

Nicholas Knize commented on LUCENE-6422:


Good deal. That was not clear so -TRUNK.patch is redundant and can be ignored.  
Nothing additional since yesterday's changes so I'm good as is for both trunk 
and _5x

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422-TRUNK.patch, LUCENE-6422.patch, 
 LUCENE-6422.patch, LUCENE-6422.patch, LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-20 Thread David Smiley (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14502886#comment-14502886
]

David Smiley commented on LUCENE-6422:
--

I'll check out your patch tonight, tomorrow at the latest. Karl/Geo3d has kept
me busy :-)

RE naming: in both cases it seems the current names actually aren't bad
relative to your suggestions. prune is a suffix of pruneLeafyBranches
(current name is more descriptive; in any case one would still need to look at
the javadocs to understand), and SpatialTrie is synonymous with
SpatialPrefixTree given that Trie and PrefixTree are synonyms. I'm +1 to
rename these as you want to 6.x if you think it's worth it. There are
back-compat issues with renaming them _now_. Again, we agree more javadocs
(including suggested alternative names) to add clarification now would be
great. I'll create a patch and seek your input.

RE sandbox: It's not clear to me what is really needed/useful. If someone
comes along with some newfangled index/search spatial approach, it could go in
the module and not hook into any existing interface... except a Lucene Query
class, and something like a Lucene TokenStream/Field for indexing.

Add StreamingQuadPrefixTree
---

Key: LUCENE-6422
URL: https://issues.apache.org/jira/browse/LUCENE-6422
Project: Lucene - Core
Issue Type: Improvement
Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
Attachments: LUCENE-6422.patch, LUCENE-6422.patch,
LUCENE-6422_with_SPT_factory_and_benchmark.patch

To conform to Lucene's inverted index, SpatialStrategies use strings to
represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5
bits per GeoHash cell, respectively. To create the terms representing a
Shape, the BytesRefIteratorTokenStream first builds all of the terms into an
ArrayList of Cells in memory, then passes the ArrayList.Iterator back to
invert() which creates a second lexicographically sorted array of Terms. This
doubles the memory consumption when indexing a shape.
This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to
accomplish the following:
1. Create a packed 8byte representation for a QuadCell
2. Build the Packed cells 'on demand' when incrementToken is called
Improvements over this approach include the generation of the packed cells
using an AutoPrefixAutomaton

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-17 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499520#comment-14499520
 ] 

Michael McCandless commented on LUCENE-6422:


bq. baselined on Lucene trunk (standard practice for contributing to Lucene)

Please don't require this of contributors: it is not standard practice.  Make 
the bar as low as possible!

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-17 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499814#comment-14499814
 ] 

David Smiley commented on LUCENE-6422:
--

That's fair; you're right that the bar shouldn't be the same for committers vs. 
contributors.

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-17 Thread Nicholas Knize (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14500327#comment-14500327
 ] 

Nicholas Knize commented on LUCENE-6422:


bq.  If you can suggest a better name to what leafy branch pruning does, then 
at a minimum it could be expressed in the javadocs

++.  Just 'prune' is probably more clear since its universally used all over 
data structures.  We can add a javadoc comment that describes it in further 
detail if necessary.

bq. if SpatialPrefixTree might have a better literature/industry based name 
then I'd love to know what that is.

There are trie based spatial trees all over (kd-Trie, kd-b-trie, buddy tree) 
industry and the literature. The one you call QuadPrefixTree was originally 
introduced in 1991 called the QuadTrie.  (reference: Gaston H. Gonnet and 
Ricardo Baeza-Yates, Handbook of Algorithms and Data Structures -- in Pascal 
and C, 2nd edition, Addison-Wesley, 1991.)  Dr. Hanan Samet from UMD has a 
great section on MX and PR QuadTrees (same as QuadPrefixTree and a name someone 
mentioned to you in another issue).  He provides a nice discussion on the 
differences between MX, PR and their point based counterparts (compared by the 
decomposition methods).  There's certainly nothing wrong with an implementation 
specific name. If you are asking for suggestions then I offer: SpatialTrie, 
GeoHashTrie, QuadTrie as being shorter, more to the point, and probably more 
relate-able to other spatial SMEs (whom I'm hoping would be willing to get more 
involved). 

bq. It's not obvious to me but where in the code of PackedQuadCell are the 5 
depth bits encoded  decoded?

PackedQuadCell.getLevel() decodes, and its encoded in PackedQuadCell.nextTerm()

bq. Preferably it would stay enabled but I think you indicated it's not 
supported by PackedQuadTree? I didn't look closer.

That's correct.

bq. I also wonder whether we need a new, lighter weight spatial module 
(spatial2? spatia_light?), or maybe spatial_sandbox, where the barrier is lower?

+++ I think this is a great idea for experimental/research features we don't 
want cluttering up the spatial module.

bq. RE abstractions: I respect your opinion although I don't agree that there 
are too many here.

IMHO, this is a slippery slope.  There are so many diverse spatial data 
structures we should be taking a look at for improving spatial search in higher 
order dimensions (4D space-time for just a start).  That's a personal interest 
area for me, in how the most powerful high dimension structures (that already 
exist) can fit within the design and implementation of lucene-core (a green 
field to explore).  Something like this does require a sophisticated 
abstraction framework and this particular one has a bit of a learning curve. I 
think that can work itself out over time with a bit of refactoring (which it 
sounds like all are open to?).  In the meantime it does set the bar rather high 
for new contributors. This is another +1 for a spatial sandbox for experimental 
research (heck make it a separate repo). 

bq. Sigh; these conversations are stressfull 

They're very verbose, but maybe that's the kick in the pants needed to help the 
spatial module really take off. That is, after all, the common goal of the 
community?




 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch, LUCENE-6422.patch, 
 LUCENE-6422_with_SPT_factory_and_benchmark.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-16 Thread Nicholas Knize (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498096#comment-14498096
]

Nicholas Knize commented on LUCENE-6422:

Awesome for putting together the benchmark (even a rough one). Can you attach
the spatial.alg file you used so I can verify (I'm assuming its a variant on
one you've posted before?) I'm getting very different numbers with production
data sets (e.g., high res peninsulas, islands, global political boundaries,
planet osm data - e.g., more exotic shapes than circles larger than a few
hundred KM)

A few questions... and other observations.

bq. chosen arbitrarily; with 27 it choked on memory given 2GB heap

What choked specifically? I'm using PackedQuad with depth between 26 and 29.
1GB heap size using the shapes I described above.

bq. disabling leafy branch pruning to compare apples to apples

Out of curiosity, why is this option enabled by default if it uses transient
storage that doubles memory consumption? Seems backwards to me.

bq. I was skeptical there would be index size savings and the benchmark shows
there aren't any.

IMHO I would avoid these kinds of absolute statements (especially with the
highly variable nature of spatial use-cases). In this situation your numbers do
not surprise me when disabling that leafyBranchPrune option (which still
confuses me why its there?), and using a single shape type with variable size.
There is an outstanding issue in the existing patch (I'll see if I can't push
out a fix today) - the TermsEnum is returning Terms with BytesRef containing
bytes[] that are double the size than they should be (e.g., 16 bytes instead of
8 - all padded w/ zeros). I suspect its some improper configuration in the
reader? So for every high res cell (e.g.), the term will be 16 bytes (still
better than, say, 27).

I think we can do better on simulated test data in the test framework. I love
the randomization and what minimal real data sets that are there are great.
It does not provide the coverage necessary, though, to best simulate some real
world scenarios. That's okay, to steal Mike's quote progress not perfection.
I'll definitely work to provide some more real world tests so we have better
coverage and benchmarking options using real world data. Its a good way to
recommend one indexing structure over another (this one's just the beginning.
There are more indexing structures in trial mode, and even more improvements
for the packed version)

Let's keep this going... Since this patch is non-destructive I don't see a
reason it can't be committed as another option and I can submit enhancement
patches to this feature. That would be up to the community.

Add StreamingQuadPrefixTree
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-16 Thread David Smiley (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498260#comment-14498260
]

David Smiley commented on LUCENE-6422:
--

bq. Can you attach the spatial.alg file you used so I can verify

It's in the patch.

bq. What choked specifically? I'm using PackedQuad with depth between 26 and
29. 1GB heap size using the shapes I described above.

-Xmx2G resulted in an OutOfMemoryError, be it for the legacy quad or this new
quad one.

bq. Out of curiosity, why is this option enabled by default if it uses
transient storage that doubles memory consumption? Seems backwards to me.

leafy branch pruning is enabled by default for RPT, although
StreamingPrefixTreeStrategy overrides the method that would trigger it. In
order to compare another tree (legacy quad in this case) fairly to packed-quad,
I disabled leafy branch pruning.

Please don't call what StreamingPrefixTreeStrategy does as leafy branch
pruning; it confuses the important distinction I'm trying to make. All SPTs
should stop traversing when the relation is _within_ -- that's expected/normal.

bq. IMHO I would avoid these kinds of absolute statements (especially with the
highly variable nature of spatial use-cases).

I'm sorry. To be more clear, I just don't yet understand how this packed quad
encoding is going to allow for distErrPct=0. I'd like to understand; please
help me. Knowing what I do know about the SPTs, the results were what I
expected -- similar disk size to existing quad.

bq. I think we can do better on simulated test data in the test framework.

Yes! I would love to have more realistic shapes to test with.

RE progress not perfection:
Yes, Mike uses that quote constantly and I even saw it in your code :-) I have
my catch-phrases too. Are you and Mike in a hurry to see this committed? It's
very normal, in this open-source project anyway, that there is back forth
peer-review and changes that are asked of the contributor. Don't worry;
something is going to get committed -- the query speed is a nice improvement!
It's not quite done -- that's all.

Add StreamingQuadPrefixTree
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-16 Thread David Smiley (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498567#comment-14498567
]

David Smiley commented on LUCENE-6422:
--

I welcome any suggestions you have pertaining naming (or again on anything).
If you can suggest a better name to what leafy branch pruning does, then at a
minimum it could be expressed in the javadocs. Not long ago it had another
name but it was even worse ;-). Naming is hard. Likewise... if
SpatialPrefixTree might have a better literature/industry based name then I'd
love to know what that is. When I named it as-such I looked around but didn't
seem to find anything that was perfect. It's a type of Trie, a spatial trie,
and PrefixTree is a synonym for Trie, so... there you go. Maybe I didn't
look hard enough. I'm sorry if I _seemed_ touchy on the terminology; I merely
want to ensure we mutually understood what we were and weren't talking about --
that's all. So when you proposed that what StreamingQuadPrefixTree did was
leafy branch pruning, and as the coiner of the term I can see it didn't meet my
definition; clarification is in order.

bq. I want to make sure there is no confusion here (especially for anyone else
willing to participate and the sensitivity surrounding my use of the
pruneLeafyBranches term). SPT's in the context of lucene-spatial's definition,
no? (since SPT is not a common name for a geo data structure its a derivative)

Yes, in the context of lucene-spatial's definition.

bq. There are use-cases for traversing beyond the within state in GeoSpatial
data structures. So someone might come along and contribute the unexpected.
Just want to be prepared for that future discussion.

Interesting; I'm curious what that might be.

bq. A lot of it is quite big so I'll have to add to the build.xml to pull a
compressed version from somewhere.

Yes, as you may have observed, this is how all existing test data is handled in
the benchmark module: fetched decompressed on-demand. I added the Geonames
point data. And I added facilities to SpatialDocMaker to automatically turn
the points into circles rects of random size.

I'm looking forward to seeing PackedQuadPrefixTree kick ass in being compressed
allowing distErrPct=0 -- I'm still puzzled but I look forward to being
pleasantly astonished. I do understand that it uses all 8 bits (instead of just
2 as the legacy impl) of the first number of bytes for the quadrant info, and
that should lead to shorter terms / prefixes for higher-precision data. But I
don't see that there would be any change in _the number_ of terms, which is, as
I see, the scalability challenge. I have an idea on solving this floating in
my head but maybe I needn't ponder it longer if PackedQuadPrefixTree handles it
somehow. It's not obvious to me but where in the code of PackedQuadCell are
the 5 depth bits encoded decoded?

bq. In essence, the rough benchmark you performed doesn't benchmark default
Lucene spatial behavior. Is that not a problem?

It is a problem -- it's not ideal, that is. Preferably it would stay enabled
but I think you indicated it's not supported by PackedQuadTree? I didn't look
closer.

RE Geo3d: As an example for anything with me; it's an outlier, not an example
that proves the rule. If I am blind to facts I can't see then show me.
Hopefully you noticed that Karl and I are working together and it's come real
far now (lots of discussion on ReviewBoard bugs I helped Karl find via
randomized testing). Your patch here has nothing in common with it --
PackedQuadTree _obviously_ belongs here, and it should be quite evident that
I'm here helping by providing feedback and running a benchmark. And yes, being
critical. Anyway... lets get back to work.

The only thing about this patch that is a blocker (-1) for me is
StreamingPrefixTreeStrategy. Not most of the code in it, but it's existence as
a SpatialStrategy subclass instead of an SPT subclass. I don't mind that it
optimizes something that I don't think needed optimization, though I suspect if
you noticed TreeCellIterator originally you wouldn't have bothered.

I have other by-line feedback I'd like to give (and would _prefer_ to do so via
ReviewBoard/GitHub) but we needn't steamroll this in under the mantra of
progress not perfection.

Add StreamingQuadPrefixTree
---

To conform to Lucene's inverted index, SpatialStrategies use strings to
represent QuadCells and GeoHash cells. Yielding 1 byte per

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-16 Thread Michael McCandless (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498829#comment-14498829
]

Michael McCandless commented on LUCENE-6422:

bq. It's very normal, in this open-source project anyway, that there is back
forth peer-review and changes that are asked of the contributor. Don't worry;
something is going to get committed – the query speed is a nice improvement!
It's not quite done – that's all.

I agree iterating/peer review is healthy, but growing the community is
also *very* important in open-source, and at least for the geo3d issue
and now this one it worries me when I see barriers being put up that I
feel should not be blockers issues for committing.

Especially when bus factor is essentially one, in an area as important
as spatial, I think encouraging contributions / growing community
becomes incredibly important. It's like when we humans intervene for
an endangered species... after having caused their predicament in
the first place ... sigh.

Of course, if there are real technical objection/problems/quality
issues for a given patch, those *should* be addressed before committing.

It's important to show new people we are eager and excited for their
contributions, that the bar is not so high for them to have an impact.
We can always review/iterate/benchmark after they are committed, as
long as net/net the patch is a step forward as (I think?) this one is.

I also wonder whether we need a new, lighter weight spatial module
(spatial2? spatia_light?), or maybe spatial_sandbox, where the
barrier is lower? The levels of abtractions in the current module
look excessive to me and with both the geo3d issue and this issue,
correctly fitting in to the existing abstractions seemed to be one
of the barriers (e.g. your only blocker here (The only thing about
this patch that is a blocker (-1) for me is
StreamingPrefixTreeStrategy..) seems to be such an issue). So if we
had a more free sandbox the barrier is lower by design.

Add StreamingQuadPrefixTree
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-16 Thread David Smiley (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14499138#comment-14499138
]

David Smiley commented on LUCENE-6422:
--

bq. We can always review/iterate/benchmark after they are committed, as long as
net/net the patch is a step forward as (I think?) this one is.

Mike, this project is defacto run as review then commit, not the other way
around. This isn't news to you, of course; it's weird to feel I need to remind
you -- we all know this for good bad. I have the same expectation of patches
I send elsewhere. I'm not looking for perfection, just moving one thing around
(a quality issue as-is IMO), and the rest is minor. I appreciate the points
you make generally, but I don't feel I'm setting the bar too high on this
issue. Because of the fuss being made here; it's going to be harder for me to
give quality feedback on future patches without fear of inducing needless
drama. And that's not good; Lucene improves through solid peer review that I
see on many patches (particularly yours, by the way).

bq. It's important to show new people we are eager and excited for their
contributions

Let it be known that weeks ago, upon discovery that Elastic had a new spatial
SME, I contacted Nick to do a video chat so that I could basically welcome him
into the Lucene/Spatial4j spatial area, that I looked forward to seeing what he
will contribute, and that I seek'ed his honest impressions / feedback. I
simply can't be more sincere, and I hope this demonstrates that I don't want to
be the only spatial person around here.

RE abstractions: I respect your opinion although I don't agree that there are
too many here. On the Geo3D front, it's extremely close now to being
committed, and the _only_ abstraction it is fitting into is a Spatial4j Shape
interface (BTW Geo3D internally has a bunch of abstractions and I didn't
complain to Karl as to the merits of them). It would be good to have one more
Spatial4j abstraction hook but it's not a blocker. On this patch, I'm asking
Nick _to scale it back_ one, to just the SpatialPrefixTree (which technically
includes Cell as that's what SPTs generate). And because of these
abstractions, both of them in fact (Shape SpatialPrefixTree), it's possible
to re-use indexing and search predicates (Intersects, Within, Contains) and
other code that work with these abstractions, without either implementations
needing to know they exist. I think it's very powerful and awesome that these
things can be combined / leveraged. Of course I don't believe any of these
APIs are perfect; the specifics are debatable and I wish there were more people
improving it than just me to help make it even better. Now it looks like there
are :-)

_Sigh_; these conversations are stressfull -- I look forward to getting back to
the technical matters of the patch and moving it forward. [~nknize] can you
please post a new patch:
* baselined on Lucene trunk (standard practice for contributing to Lucene)
* with the streamlined cell iterator implementation (what you call streaming)
moved to the PackedQuadPrefixTree

My added patch had two little extras: the factory benchmark module
integration. It would be nice if you could add those. If you don't, I will
any how, perhaps with a factory test which I didn't yet do. In fact if I sadly
never hear from you again, I'll do all that is spoken of here because we'd all
love for this awesomeness to get into Lucene spatial. You've done all the hard
work; what remains is small.

You certainly don't have to prove to me that the indexes are lean... or rather
can be lean under realistic circumstances. I remain very curious about that.

Add StreamingQuadPrefixTree
---

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-16 Thread Nicholas Knize (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498476#comment-14498476
]

Nicholas Knize commented on LUCENE-6422:

bq. Please don't call what StreamingPrefixTreeStrategy does as leafy branch
pruning; it confuses the important distinction I'm trying to make.

That's fair. pruneLeafyBranches isn't a Geo text book term and I didn't have a
common nomenclature within this context for discussion re: the behavior so I
used the closest shared term, for discussion. If I change it so the leaves are
pruned on intersects then its the same behavior. I'll add that to the patch for
benchmarking and it will result in a smaller index. Thinking about it more, I
think it's the right way to go and a simple enough change that can be iterated
in future improvements.

bq. All SPTs should stop traversing when the relation is within – that's
expected/normal.

I want to make sure there is no confusion here (especially for anyone else
willing to participate and the sensitivity surrounding my use of the
pruneLeafyBranches term). SPT's in the context lucene-spatial's definition,
no? (since SPT is not a common geo data structure its a derivative) There are
use-cases for traversing beyond the within state in GeoSpatial data structures.
So someone might come along and contribute the unexpected. Just want to be
prepared for that future discussion.

bq. I just don't yet understand how this packed quad encoding is going to allow
for distErrPct=0

That helps me better understand the confusion. Your description of the
transient memory behavior surrounding pruneLeafyBranches helped me understand
why high res shapes w/ distErrPct=0 causes QuadTree to OOM on even reasonably
sized shapes (thank you for that clarification). In your benchmark there's a
closer performance to the PackedQuad with it disabled (so I still don't
understand why enabled is the default). In essence, the rough benchmark you
performed doesn't benchmark default Lucene spatial behavior. Is that not a
problem?

bq. I would love to have more realistic shapes to test with.

Awesome! I'll work on adding some of the realistic data to the benchmark. A lot
of it is quite big so I'll have to add to the build.xml to pull a compressed
version from somewhere. What's cool is that It includes a healthy amount of
exotic shapes which better reflect complex geospatial use-cases. Again, that's
where you see drastic performance improvements with PackedQuad. The Leaves are
always 8 bytes regardless of depth (to a max depth of 29) It introduces at
most a 7 byte overhead at low precision (which are fewer terms anyway), so just
use QuadTree in those cases, but there's a 21 byte savings on leaves (again,
exotic shapes). This is how it improves over QuadTree with distErrPct=0. You
don't need to reduce resolution for those large shapes.

bq. Are you and Mike in a hurry to see this committed?

Apologies if you perceived a level of urgency on my behalf. Regarding Mike, I
was giving credit for his quote because I agree re: non-destructive
enhancements such as this. Other than that, I can't speak on his behalf and I
don't think its fair to speculate on his level of interest re: this single
contribution.

(soapbox) So you know where I stand, I would simply prefer this (and future
spatial contributions) not drag out as long as the geo3d contribution has.
IMHO, I agree with you 100% that peer review and discussion is healthy and
necessary. I would add, though, there's a tipping point where it becomes a
blocker and prevents others from participating. And for this package to
improve beyond the great work already done, there needs to be more involvement
and some level of organic growth (iterative improvements).

Add StreamingQuadPrefixTree
---

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-15 Thread David Smiley (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496135#comment-14496135
]

David Smiley commented on LUCENE-6422:
--

*Awesome work Nick!* It's so nice to see meaty spatial contributions like this
(Geo3d is another example).

RE Streaming (transient memory use while indexing): I appreciate that the
out-of-the box configuration of RPT with either LegacyPrefixTree (be it quad or
geohash) will use a lot of memory for indexing. But since... I don't know how
long now, this only occurs if the leafy branch pruning optimization is
enabled on RPT. That algorithm, existing on RecursivePrefixTreeStrategy,
unfortunately buffers all the cells. It's somewhat simple; it could be improved
to not buffer all cells but it would need to buffer some. Recently I did some
benchmarking and found that the leafy branch pruning yielded lots of index size
savings, particularly with the quad tree. I'd love to chat with you about the
subject of leaves on the SPT and an idea I have on doing better. Any way, I
suggest you do another memory benchmark with leafy branch pruning disabled with
the PackedQuadTree but not the StreamingQuad...Strategy. With it disabled, the
underlying BytesRefIteratortokenStream will consume a IteratorCell that is a
direct instance of TreeCellIterator, and then you get the streaming effect.
The existing TreeCellIterator is quite similar to the
Streaming...PrefixTreeIterator here. If I'm right about there being no
appreciable memory savings, then this part of the patch can be removed as it's
redundant.

I really like the new PackedQuadPrefixTree.java. (IMO that's what this JIRA
issue is mostly about) Can you consider _not_ subclassing Legacy* ? I'd like
to leave the legacy trees as-is and new SPTs not inherit from it. Can you base
your next patch off of trunk? And can you *either* post on
reviewboard.apache.org or use a GitHub fork branch so I can provide by-line
feedback?

Add StreamingQuadPrefixTree
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-15 Thread Nicholas Knize (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496326#comment-14496326
]

Nicholas Knize commented on LUCENE-6422:

Thanks David. I will certainly be doing a more thorough benchmark and will
start with the suggestions. I imagine the savings will not be as extreme as in
the situation with Wales (that was just the most interesting case.)

bq. With it disabled, the underlying BytesRefIteratortokenStream will consume a
IteratorCell that is a direct instance of TreeCellIterator, and then you get
the streaming effect.

Just a few thoughts, the StreamingPrefixTreeIterator gives the benefit of a few
worlds:
1. It uses an on-demand DFS through bit shifting completely eliminating the
need for the stack DFS logic in TreeCellIterator.hasNext. I suppose code-wise
it would be cleaner to subclass TreeCellIterator and just override hasNext (and
possibly next since I don't set thisCell/current in next)? That's a good idea
for code maintenance and reuse.
2. The on-demand DFS traversal already achieves a leafy branch pruning effect
by not descending on Cells that already fall within the shape. This gives you
pruning without having to buffer anything (other than the current and next
cell). This does vary a little bit in that the RPT simply prunes all 4 leaves
that intersect the shape.

bq. Can you consider not subclassing Legacy* ?

I'll certainly take a look at this. I saw the comment about not subclassing and
thought about it, but since there is so much reuse with the bytes[], b_off, and
b_len (which could be a BytesRef) it didn't make much sense duplicating code.
Are you suggesting duplicating code and eventually deprecating the LegacyCell?

Add StreamingQuadPrefixTree
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-15 Thread David Smiley (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496658#comment-14496658
]

David Smiley commented on LUCENE-6422:
--

bq. 1. It uses an on-demand DFS through bit shifting completely eliminating the
need for the stack DFS logic in TreeCellIterator.hasNext. I suppose code-wise
it would be cleaner to subclass TreeCellIterator and just override hasNext (and
possibly next since I don't set thisCell/current in next)? That's a good idea
for code maintenance and reuse.

Interesting; I need to look at that code closer. Nonetheless, I think a stack
vs completely stream one by one is unlikely to be noticeable in a benchmark.
The memory use is capped at a small amount either way. If we find there are
measurable advantages here, then couldn't you provide this via overriding
SpatialPrefixTree.getTreeCellIterator instead of overriding the SpatialStrategy?

bq. 2. The on-demand DFS traversal already achieves a leafy branch pruning
effect by not descending on Cells that already fall within the shape. This
gives you pruning without having to buffer anything (other than the current and
next cell). This does vary a little bit in that the RPT simply prunes all 4
leaves that intersect the shape.

I think we may be talking about different things. A leafy branch as I
defined it in Lucene spatial is a parent cell in which all sub-cells that could
theoretically exist are there for this shape and are also leaves. So for a
quad-tree, it's a parent with 4 sub-cells that are all leaves. A leaf is a
cell that is _either_ within the shape from which it was derived _or_ it
overlaps the edge but we've reached a precision threshold. Indexing the 4
leaves produces a larger index than not emitting those leaves at all and
instead marking the parent as a leaf -- and the end effect is semantically
equivalent in terms of the search predicates. There are some trade-offs; but I
won't digress now.

bq. Are you suggesting duplicating code and eventually deprecating the
LegacyCell?

Yes but I need to apply the patch and see how much code is at stake here. At
the time I introduced LegacyCell, I refined the SpatialPrefixTree API and was
unsatisfied with the only two implementations at that time, in terms of
efficiencies, so I called it LegacyPrefixTree and LegacyCell. Perhaps these
could still stick around still; I welcome your input on what it might be named
or if there is too little code to worry about. But keeping having this extend
LegacyCell is problematic because RecursivePrefixTreeStrategy makes an
assumption for the branch pruning you can see there, right at the beginning.

I welcome whatever thoughts you may have on the API to make it more clear,
faster, whatever.

Add StreamingQuadPrefixTree
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-15 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496686#comment-14496686
 ] 

Michael McCandless commented on LUCENE-6422:


Can we commit both approaches for streaming and refactor later (progress not 
perfection)?

I like that the new streaming approach has fewer abstractions on quick glance.

I also think further benchmarks need not block progress: they can come later.

 Add StreamingQuadPrefixTree
 ---

 Key: LUCENE-6422
 URL: https://issues.apache.org/jira/browse/LUCENE-6422
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/spatial
Affects Versions: 5.x
Reporter: Nicholas Knize
 Attachments: LUCENE-6422.patch


 To conform to Lucene's inverted index, SpatialStrategies use strings to 
 represent QuadCells and GeoHash cells. Yielding 1 byte per QuadCell and 5 
 bits per GeoHash cell, respectively.  To create the terms representing a 
 Shape, the BytesRefIteratorTokenStream first builds all of the terms into an 
 ArrayList of Cells in memory, then passes the ArrayList.Iterator back to 
 invert() which creates a second lexicographically sorted array of Terms. This 
 doubles the memory consumption when indexing a shape.
 This task introduces a PackedQuadPrefixTree that uses a StreamingStrategy to 
 accomplish the following:
 1.  Create a packed 8byte representation for a QuadCell
 2.  Build the Packed cells 'on demand' when incrementToken is called
 Improvements over this approach include the generation of the packed cells 
 using an AutoPrefixAutomaton



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-15 Thread Nicholas Knize (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496796#comment-14496796
]

Nicholas Knize commented on LUCENE-6422:

bq. I think a stack vs completely stream one by one is unlikely to be
noticeable in a benchmark.

That's a possibility. It definitely warrants a benchmark to know for sure. I
think it could be noticeable for deep QuadTrees? (aka: anything with high
precision) as the string representation for one child can get upwards of 29
bytes vs. 8. But its really speculation without benchmark confirmation. Also,
just eliminating the need for stack-based DFS logic is a more straightforward
approach, no? (no crazy data structures with transient state needed since we're
just lexicographically incrementing a compact bit representation on demand)

bq. ...couldn't you provide this via overriding
SpatialPrefixTree.getTreeCellIterator instead of overriding the SpatialStrategy?

I like this suggestion. If I set pruneLeafyBranches to false and override
getTreeCellIterator in PackedQuadPrefixTree then, yes, RPT will call the
PackedQuadPrefixTree.getTreeCellIterator from RPT.createCellIteratorToIndex and
I can eliminate the StreamingPrefixTreeStrategy class altogether. At that
point I would suggest we rename RecursivePrefixTreeStrategy to something more
accurate, but that can be an API discussion.

bq. I think we may be talking about different things.

We're talking about the same thing. With one difference in this
implementation. At the moment leaves that are on the edge (e.g., intersects)
are not pruned. They probably could be - and this optimization would save even
more space than its already saving.

bq. and the end effect is semantically equivalent in terms of the search
predicates.

I forgot to mention, this enhancement allows one to disable that memory saving
error factor (by setting distance_error_pct to 0) and the result is quad cell
precision regardless of shape size (with minimal increase in index size). This
was the initial driver for the improvement. Allowing one to eliminate false
positives - achieve results that are within the precision they specify without
having a fuzzy error factor based on the size of the shape. It doesn't
eliminate that factor, it just provides the ability to disable it and achieve
your desired precision (subject to earth curvature error) without saturating
memory and blowing out the index.

bq. Yes but I need to apply the patch and see how much code is at stake here.

I honestly think leaving that to another issue is the best strategy (progress
not perfection). That's something I can look at as a next step. Until that
time comes I don't think there's a problem having this enhancement subclass
QuadCell and QuadPrefixTree. Refactoring can be deferred to a separate issue.

Add StreamingQuadPrefixTree
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

2015-04-15 Thread David Smiley (JIRA)

[
https://issues.apache.org/jira/browse/LUCENE-6422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497475#comment-14497475
]

David Smiley commented on LUCENE-6422:
--

Just a quick comment... I can see this patch assumes Java 8, yet Lucene 5x is
on Java 7. Long.BYTES doesn't exist, and there is no default remove method
on Iterator (for PrefixTreeIterator).

Add StreamingQuadPrefixTree
---

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

[jira] [Commented] (LUCENE-6422) Add StreamingQuadPrefixTree

18 matches

Site Navigation

Mail list logo

Footer information