[
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879425#comment-16879425
]
Michael Gibney commented on LUCENE-4312:
----------------------------------------
Following up on discussion at Berlin Buzzwords with [~mikemccand], [~sokolov],
[~simonw], and [~romseygeek]:
A lot of useful context (for, e.g., synonym generation, etc.) is available at
index time that is not available at query time. Leveraging this context can
result in index-time TokenStream manipulations that produce token graphs. Since
position length is not indexed, it is impossible at query time to reconstruct
index-time TokenStream "graph" structure.
Indexed position length is a prerequisite for any use case that calls for:
1. index-time graph TokenStreams
2. precise/accurate proximity query (via spans, intervals, etc.)
Could we discuss adding first-class support for this structural "position
length" information?
Updating PostingsEnum to include endPosition() -- returning {{position+1}} by
default -- would be a meaningful first step. This would facilitate the
development of query implementations without requiring an API fork, and would
signal an intention to move in the direction of supporting index-time token
graphs.
Beyond that, I'm optimistic that codecs could be enhanced to index position
length without introducing much additional overhead (I'd guess that position
length for the common case of linear/non-graph index-time token streams could
compress quite well).
> Index format to store position length per position
> --------------------------------------------------
>
> Key: LUCENE-4312
> URL: https://issues.apache.org/jira/browse/LUCENE-4312
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs
> Affects Versions: 6.0
> Reporter: Gang Luo
> Priority: Minor
> Labels: Suggestion
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and
> Codec APIs) to store an additional int position length per position.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]