[
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880635#comment-16880635
]
Michael Gibney commented on LUCENE-4312:
----------------------------------------
True, both good points. But it's kind of a chicken-or-egg situation ... there
would have been no point to address these implied challenges, so long as
position length has not been recorded in the index (and is thus not available
at query time). That doesn't mean there _aren't_ ways to address the challenges.
Regarding the "A B C" example, I addressed this in the LUCENE-7398 work by
indexing next start position as a lookahead. As a proof of concept this was
done with Payloads, but in principle I could see slight modifications
(somewhere at the intersection of codecs and postings API) that would natively
read next start position "early" and expose it as a lookahead. This would avoid
the type of problematic call to {{PostingsEnum.nextPosition()}} that would (as
you correctly point out) result in the need to buffer all information
associated with _every_ position. I've described this approach in more detail
[here|https://michaelgibney.net/2018/09/lucene-graph-queries-2/#index-lookahead-don-t-buffer-positions-if-you-don-t-have-to].
{quote}we can't advance positions on terms in the order we want anymore.
{quote}
Yes, I'd argue that's the toughest challenge. I addressed it indirectly by
constructing CommonGrams-style shingles used specifically for pre-filtering
conjunctions in the "approximation" phase of two-phase iteration (ensuring that
common terms at subclause index 0 don't kill performance). This is described in
more detail
[here|https://michaelgibney.net/2018/09/lucene-graph-queries-2/#shingle-based-pre-filtering-of-conjunctionspans].
I'm not intending this to be about these particular solutions, and you might
take issue with the solutions themselves. The more general point I guess is
that indexed position length is fundamental, and is a prerequisite for the
development of ways to address these challenges.
> Index format to store position length per position
> --------------------------------------------------
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs
> Affects Versions: 6.0
> Reporter: Gang Luo
> Priority: Minor
> Labels: Suggestion
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and
> Codec APIs) to store an additional int position length per position.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]