[jira] [Commented] (LUCENE-4312) Index format to store position length per position

Michael Gibney (JIRA) Mon, 08 Jul 2019 12:01:08 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880635#comment-16880635
 ]


Michael Gibney commented on LUCENE-4312:
----------------------------------------

True, both good points. But it's kind of a chicken-or-egg situation ... there 
would have been no point to address these implied challenges, so long as 
position length has not been recorded in the index (and is thus not available 
at query time). That doesn't mean there _aren't_ ways to address the challenges.

Regarding the "A B C" example, I addressed this in the LUCENE-7398 work by 
indexing next start position as a lookahead. As a proof of concept this was 
done with Payloads, but in principle I could see slight modifications 
(somewhere at the intersection of codecs and postings API) that would natively 
read next start position "early" and expose it as a lookahead. This would avoid 
the type of problematic call to {{PostingsEnum.nextPosition()}} that would (as 
you correctly point out) result in the need to buffer all information 
associated with _every_ position. I've described this approach in more detail 
[here|https://michaelgibney.net/2018/09/lucene-graph-queries-2/#index-lookahead-don-t-buffer-positions-if-you-don-t-have-to].
{quote}we can't advance positions on terms in the order we want anymore.
{quote}
Yes, I'd argue that's the toughest challenge. I addressed it indirectly by 
constructing CommonGrams-style shingles used specifically for pre-filtering 
conjunctions in the "approximation" phase of two-phase iteration (ensuring that 
common terms at subclause index 0 don't kill performance). This is described in 
more detail 
[here|https://michaelgibney.net/2018/09/lucene-graph-queries-2/#shingle-based-pre-filtering-of-conjunctionspans].

I'm not intending this to be about these particular solutions, and you might 
take issue with the solutions themselves. The more general point I guess is 
that indexed position length is fundamental, and is a prerequisite for the 
development of ways to address these challenges.

> Index format to store position length per position
> --------------------------------------------------
>
>                 Key: LUCENE-4312
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4312
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 6.0
>            Reporter: Gang Luo
>            Priority: Minor
>              Labels: Suggestion
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-4312) Index format to store position length per position

Reply via email to