[
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880490#comment-16880490
]
Michael Gibney commented on LUCENE-4312:
----------------------------------------
Thank you for the feedback, [~sokolov] and [~jpountz]!
{quote}Recording position lengths in the index is the easy part of the problem
in my opinion.
{quote}
Yes, this is my view as well; and looking to the future, _respecting_ position
length would certainly add complexity to phrase queries. But in terms of
performance impact, the complexity of query execution would be driven by what's
actually in the index (so for many use cases performance should be roughly
equivalent to that of an implementation that ignores position length).
Regarding the challenges of query implementation... I'm taking a fresh look at
this issue in the context of work done on LUCENE-7398, which seeks to implement
backtracking phrase queries in an efficient way (including sloppy, nested,
etc.). Despite that issue being nominally about "nested Span queries", it's
really more generally about "proximity search over variable-length subclauses",
and the techniques used in the implementation for LUCENE-7398 would be
transferable to interval queries as well.
It's a fair point about the arbitrariness of sloppy phrase queries with
intervening multi-term synonyms, but I wouldn't call such queries
"meaningless"; in any case, I think that problem already exists for multi-term
indexed synonyms, and is not exacerbated by the introduction of indexed
position length. Sloppy phrase queries (and, for that matter, tokenization
itself) are somewhat arbitrary by nature. Following that tangent, I can imagine
some potential ways to mitigate such arbitrariness ... all of which themselves
rely on the ability to index token graph structure (i.e., position length).
> Index format to store position length per position
> --------------------------------------------------
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs
> Affects Versions: 6.0
> Reporter: Gang Luo
> Priority: Minor
> Labels: Suggestion
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and
> Codec APIs) to store an additional int position length per position.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]