implement PositionLengthAttribute for all tokenstreams where its appropriate
----------------------------------------------------------------------------

                 Key: LUCENE-3843
                 URL: https://issues.apache.org/jira/browse/LUCENE-3843
             Project: Lucene - Java
          Issue Type: Improvement
            Reporter: Robert Muir
             Fix For: 3.6, 4.0


LUCENE-3767 introduces PositionLengthAttribute, which extends the tokenstream 
API
from a sausage to a real graph. 

Currently tokenstreams such as WordDelimiterFilter and SynonymsFilter 
theoretically
work at a graph level, but then serialize themselves to a sausage, for example:

wi-fi with WDF creates:
wi(posinc=1), fi(posinc=1), wifi(posinc=0)

So the lossiness is that the 'wifi' is simply stacked ontop of 'fi'

PositionLengthAttribute fixes this by allowing a token to declare how far it 
"spans",
so we don't lose any information.

While the indexer currently can only support sausages anyway (and for 
performance reasons,
this is probably just fine!), other tokenstream consumers such as queryparsers 
and suggesters
such as LUCENE-3842 can actually make use of this information for better 
behavior.

So I think its ideal if the TokenStream API doesn't reflect the lossiness of 
the index format,
but instead keeps all information, and after LUCENE-3767 is committed we should 
fix tokenstreams
to preserve this information for consumers that can use it.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to