[ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880490#comment-16880490
 ] 

Michael Gibney commented on LUCENE-4312:
----------------------------------------

Thank you for the feedback, [~sokolov] and [~jpountz]!
{quote}Recording position lengths in the index is the easy part of the problem 
in my opinion.
{quote}
Yes, this is my view as well; and looking to the future, _respecting_ position 
length would certainly add complexity to phrase queries. But in terms of 
performance impact, the complexity of query execution would be driven by what's 
actually in the index (so for many use cases performance should be roughly 
equivalent to that of an implementation that ignores position length).

Regarding the challenges of query implementation... I'm taking a fresh look at 
this issue in the context of work done on LUCENE-7398, which seeks to implement 
backtracking phrase queries in an efficient way (including sloppy, nested, 
etc.). Despite that issue being nominally about "nested Span queries", it's 
really more generally about "proximity search over variable-length subclauses", 
and the techniques used in the implementation for LUCENE-7398 would be 
transferable to interval queries as well.

It's a fair point about the arbitrariness of sloppy phrase queries with 
intervening multi-term synonyms, but I wouldn't call such queries 
"meaningless"; in any case, I think that problem already exists for multi-term 
indexed synonyms, and is not exacerbated by the introduction of indexed 
position length. Sloppy phrase queries (and, for that matter, tokenization 
itself) are somewhat arbitrary by nature. Following that tangent, I can imagine 
some potential ways to mitigate such arbitrariness ... all of which themselves 
rely on the ability to index token graph structure (i.e., position length).

> Index format to store position length per position
> --------------------------------------------------
>
>                 Key: LUCENE-4312
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4312
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 6.0
>            Reporter: Gang Luo
>            Priority: Minor
>              Labels: Suggestion
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to