[ 
https://issues.apache.org/jira/browse/LUCENE-4942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635440#comment-13635440
 ] 

David Smiley commented on LUCENE-4942:
--------------------------------------

You don't ;-)   This is why I believe TermQueryStrategy is fundamentally flawed 
for indexing non-point shapes.  Yet AFAIK it's the choice ElasticSearch wants 
to use (or at least wanted).  In ES if you indexed a country and your search 
box is something small in the middle of that country, you *won't* match that 
country.

To be clear I'm recommending two things:
* Have TermQueryStrategy _not_ index its leaves with the '+' -- it doesn't use 
them.
* Have RecursivePrefixTreeStrategy _only_ index the leaf versions of those leaf 
cells, not a redundant non-leaf version.  Some non-trivial code needs to change 
in a few of the search algorithms.

In *both* cases, the semantics are the same; no new or fewer documents match.  
But the spatial index is ~40% smaller I figure, faster indexing as well.  It's 
_possible_ some of the search algorithms for RecursivePrefixTreeStrategy will 
be slightly slower since sometimes they'll need to visit an additional token at 
certain parts of the algorithms to check for both leaf and non-leaf indexed 
cells but I think it'll be quite negligible.
                
> Indexed non-point shapes index excessive terms
> ----------------------------------------------
>
>                 Key: LUCENE-4942
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4942
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/spatial
>            Reporter: David Smiley
>
> Indexed non-point shapes are comprised of a set of terms that represent grid 
> cells.  Cells completely within the shape or cells on the intersecting edge 
> that are at the maximum detail depth being indexed for the shape are denoted 
> as "leaf" cells.  Such cells have a trailing '\+' at the end.  _Such tokens 
> are actually indexed twice_, one with the leaf byte and one without.
> The TermQuery based PrefixTree Strategy doesn't consider the notion of 'leaf' 
> cells and so the tokens with '+' are completely redundant.
> The Recursive [algorithm] based PrefixTree Strategy better supports correct 
> search of indexed non-point shapes than TermQuery does and the distinction is 
> relevant.  However, the foundational search algorithms used by this strategy 
> (Intersects & Contains; the other 2 are based on these) could each be 
> upgraded to deal with this correctly.  Not trivial but very doable.
> In the end, spatial non-point indexes can probably be trimmed my ~40% by 
> doing this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to