[ 
https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879425#comment-16879425
 ] 

Michael Gibney commented on LUCENE-4312:
----------------------------------------

Following up on discussion at Berlin Buzzwords with [~mikemccand], [~sokolov], 
[~simonw], and [~romseygeek]:

A lot of useful context (for, e.g., synonym generation, etc.) is available at 
index time that is not available at query time. Leveraging this context can 
result in index-time TokenStream manipulations that produce token graphs. Since 
position length is not indexed, it is impossible at query time to reconstruct 
index-time TokenStream "graph" structure.

Indexed position length is a prerequisite for any use case that calls for:
1. index-time graph TokenStreams
2. precise/accurate proximity query (via spans, intervals, etc.)

Could we discuss adding first-class support for this structural "position 
length" information?

Updating PostingsEnum to include endPosition() -- returning {{position+1}} by 
default -- would be a meaningful first step. This would facilitate the 
development of query implementations without requiring an API fork, and would 
signal an intention to move in the direction of supporting index-time token 
graphs.

Beyond that, I'm optimistic that codecs could be enhanced to index position 
length without introducing much additional overhead (I'd guess that position 
length for the common case of linear/non-graph index-time token streams could 
compress quite well).

> Index format to store position length per position
> --------------------------------------------------
>
>                 Key: LUCENE-4312
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4312
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/codecs
>    Affects Versions: 6.0
>            Reporter: Gang Luo
>            Priority: Minor
>              Labels: Suggestion
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Mike Mccandless said:TokenStreams are actually graphs.
> Indexer ignores PositionLengthAttribute.Need change the index format (and 
> Codec APIs) to store an additional int position length per position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to