[
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16892990#comment-16892990
]
Michael Gibney commented on LUCENE-4312:
----------------------------------------
For the sake of facilitating discussion around something more concrete, I
uploaded a patch ([^positionLength-postings.patch]) for a straw-man proposal
for {{PostingsEnum}} modifications to support position length (also visible as
a pseudo-PR here: [https://github.com/magibney/lucene-solr/pull/1]). The patch
won't compile, of course (no corresponding modifications to subclasses of
{{PostingsEnum}}).
The proposal goes a bit beyond simply adding a {{positionLength()}} method,
with a few additional fundamental methods to support optimizations that proved
helpful in implementing performant positional queries (for LUCENE-7398).
Any feedback would be much appreciated, especially given the acknowledged
provisional (and potentially controversial) nature of this proposal.
[~jpountz], I've given some more thought to the challenge of not being able to
"advance positions on terms in the order we want anymore". I think there should
be a general-purpose way to preserve this ability (in a way that doesn't depend
on the kind of corpus-specific shingle-based filtering that I previously
suggested). I'm considering an approach leveraging something analogous to a
reverse token filter, except rather than reversing the token text, it (sort of)
reverses start/end positions: start position of the new token is end position
of the original token, and end position of the new token is
{{originalEndPosition + positionLength}}. Then you could use the least-cost
term as an entrypoint, and build forward with original tokens, backward with
the modified-positions tokens. Query implementation would be responsible for
properly interpreting flipped positions.
> Index format to store position length per position
> --------------------------------------------------
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs
> Affects Versions: 6.0
> Reporter: Gang Luo
> Priority: Minor
> Labels: Suggestion
> Attachments: positionLength-postings.patch
>
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and
> Codec APIs) to store an additional int position length per position.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]