[ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892990#comment-16892990
 ] 

Michael Gibney commented on LUCENE-4312:
----------------------------------------

For the sake of facilitating discussion around something more concrete, I 
uploaded a patch ([^positionLength-postings.patch]) for a straw-man proposal 
for {{PostingsEnum}} modifications to support position length (also visible as 
a pseudo-PR here: [https://github.com/magibney/lucene-solr/pull/1]). The patch 
won't compile, of course (no corresponding modifications to subclasses of 
{{PostingsEnum}}).

The proposal goes a bit beyond simply adding a {{positionLength()}} method, 
with a few additional fundamental methods to support optimizations that proved 
helpful in implementing performant positional queries (for LUCENE-7398).

Any feedback would be much appreciated, especially given the acknowledged 
provisional (and potentially controversial) nature of this proposal.

[~jpountz], I've given some more thought to the challenge of not being able to 
"advance positions on terms in the order we want anymore". I think there should 
be a general-purpose way to preserve this ability (in a way that doesn't depend 
on the kind of corpus-specific shingle-based filtering that I previously 
suggested). I'm considering an approach leveraging something analogous to a 
reverse token filter, except rather than reversing the token text, it (sort of) 
reverses start/end positions: start position of the new token is end position 
of the original token, and end position of the new token is 
{{originalEndPosition + positionLength}}. Then you could use the least-cost 
term as an entrypoint, and build forward with original tokens, backward with 
the modified-positions tokens. Query implementation would be responsible for 
properly interpreting flipped positions.

 

> Index format to store position length per position
> --------------------------------------------------
>
>                 Key: LUCENE-4312
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4312
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 6.0
>            Reporter: Gang Luo
>            Priority: Minor
>              Labels: Suggestion
>         Attachments: positionLength-postings.patch
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to