Alan Woodward reassigned LUCENE-3475:

             Assignee: Alan Woodward
    Affects Version/s:     (was: 3.4)

Here's a patch with a try at implementing a ShingleGraphFilter.  Still requires 
javadocs, more testing, randomized testing, adding into RandomChains, etc, but 
it would be good to get some more eyes on it.

It has a slightly different output to ShingleFilter on non-graph tokenstreams, 
in that it emits shingles longest first, due to the way I went about it.  It 
might not be too difficult to change that though.

To support backtracking in the TokenStream, the underlying stream is read 
(lazily) into a linked list, with Token objects reused once the filter has 
moved past them.  The current shingle is just an array of references into the 
list.  Shingles are built using the nextTokenInGraph(Token) method, which will 
use a token's length attribute to move through the linked list and find its 
successor.  If there are multiple tokens sharing a position, then 
incrementToken() will iterate through each one, rebuilding the graph each time.

> ShingleFilter should handle positionIncrement of zero, e.g. synonyms
> --------------------------------------------------------------------
>                 Key: LUCENE-3475
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3475
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Cameron
>            Assignee: Alan Woodward
>            Priority: Minor
>              Labels: newdev
>         Attachments: LUCENE-3475.patch
> ShingleFilter is creating shingles for a single term that has been expanded 
> by synonyms when it shouldn't. The position increment is 0.
> As an example, I have an Analyzer with a SynonymFilter followed by a 
> ShingleFilter. Assuming car and auto are synonyms, the SynonymFilter produces 
> two tokens and position 1: car, auto. The ShingleFilter is then producing 3 
> tokens, when there should only be two: car, car auto, auto. This behavior 
> seems incorrect.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to