[
https://issues.apache.org/jira/browse/LUCENE-4942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley updated LUCENE-4942:
---------------------------------
Attachment: LUCENE-4942_non-point_excessive_terms.patch
The attached patch _does not_ have the "+" / "*" (approximated leaf vs
contained leaf) leaf type differentiation; that can wait.
Summary of patch changes:
* CellTokenStream: removed the dual/redundant indexing it was doing for leaf
cells. This simplified it, and I further simplified it to the point that CTS
is now really a generic TokenStream for a BytesRefIterator you give it. I have
a nocommit to rename CellTokenStream to BytesRefIteratorTokenStream.
* Related to the CellTokenStream change, I refactored PrefixTreeStrategy a
little to now have a protected createCellIteratorToIndex() and protected
newCellToBytesRefIterator(), and added a CellToBytesRefIterator class. The
particular arrangement paves the way for TokenStream re-use — LUCENE-5776
although leaves the actual re-use to occur later in a future patch on that
issue.
* TermQueryPrefixTreeStrategy overrides newCellToBytesRefIterator to return a
CTBRI subclass that does not have the leaf byte (since this strategy doesn’t
query for them).
* Primary search-time changes were in AbstractVisitingPrefixTreeFilter (the
base of Intersects, Within, heatmaps), WithinPrefixTreeFilter, and
ContainsPrefixTreeFilter.
* ContainsPrefixTreeFilter now does more leap-frogging than it used to; it’s
probably a bit faster as a result.
* Enhanced the toString()’s in the Filters to include the query shape.
* (Refactoring) Cell.isLeaf() should always return true if it’s level ==
maxLevels, and I clarified that when cell.isLeaf is false then this means this
cell is a “prefix” (effectively the opposite of a leaf) which means there are
cells at further resolutions (greater levels). For Quad & Geohash PrefixTree’s,
it’s an implementation detail that it doesn’t append the ‘+’ because doing so
is redundant/implied.
* (Refactoring) AbstractVisitingPrefixTreeFilter (the base of Intersects,
Within, heatmaps) no longer has a hasIndexedLeaves boolean flag to supposedly
make it faster for the all-points case. The checks where it might be relevant
are very cheap so I’d rather keep this class simpler.
Tests pass; I'll try precommit later. I've yet to try lucene/benchmark and
examine the index size change.
> Indexed non-point shapes index excessive terms
> ----------------------------------------------
>
> Key: LUCENE-4942
> URL: https://issues.apache.org/jira/browse/LUCENE-4942
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/spatial
> Reporter: David Smiley
> Assignee: David Smiley
> Attachments: LUCENE-4942_non-point_excessive_terms.patch
>
>
> Indexed non-point shapes are comprised of a set of terms that represent grid
> cells. Cells completely within the shape or cells on the intersecting edge
> that are at the maximum detail depth being indexed for the shape are denoted
> as "leaf" cells. Such cells have a trailing '\+' at the end. _Such tokens
> are actually indexed twice_, one with the leaf byte and one without.
> The TermQuery based PrefixTree Strategy doesn't consider the notion of 'leaf'
> cells and so the tokens with '+' are completely redundant.
> The Recursive [algorithm] based PrefixTree Strategy better supports correct
> search of indexed non-point shapes than TermQuery does and the distinction is
> relevant. However, the foundational search algorithms used by this strategy
> (Intersects & Contains; the other 2 are based on these) could each be
> upgraded to deal with this correctly. Not trivial but very doable.
> In the end, spatial non-point indexes can probably be trimmed my ~40% by
> doing this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]