[ 
https://issues.apache.org/jira/browse/LUCENE-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409382#comment-16409382
 ] 

Jim Ferenczi commented on LUCENE-8202:
--------------------------------------

+1 to set position length to 1, this is a fixed size shingle filter so there's 
no additional information in this attribute.
Regarding the explosion of the number of terms can you track the total number 
of tokens that need to produce a shingle at the next position and ignore new 
tokens with posIncr=0 if the number is too high (1000 ?) ?


> Add a FixedShingleFilter
> ------------------------
>
>                 Key: LUCENE-8202
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8202
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: 7.4
>
>         Attachments: LUCENE-8202.patch, LUCENE-8202.patch, LUCENE-8202.patch
>
>
> In LUCENE-3475 I tried to make a ShingleGraphFilter that could accept and 
> emit arbitrary graphs, while duplicating all the functionality of the 
> existing ShingleFilter.  This ends up being extremely hairy, and doesn't play 
> well with query parsers.
> I'd like to step back and try and create a simpler shingle filter that can be 
> used for index-time phrase tokenization only.  It will have a single fixed 
> shingle size, can deal with single-token synonyms, and won't emit unigrams.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to