[ 
https://issues.apache.org/jira/browse/LUCENE-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16400746#comment-16400746
 ] 

Adrien Grand commented on LUCENE-8202:
--------------------------------------

Thanks Alan, I agree this issue looks easier to address if we don't have to 
support all the features that ShingleFilter accumulated over the years and 
focus on index-time phrase tokenization. Some comments about the patch:
 - Can you make this filter experimental, ideally it would get merged with 
ShingleFilter eventually?
 - Can you clarify in the sentence about stacked shingles that this is how this 
filter differs from ShingleFilter. I think it's important for users to realize 
when it makes sense to use this one.
 - Let's not use upper case for variables that are not static? (GAP_TOKEN and 
END_TOKEN)
 - Let's make tokenPool an ArrayDeque instead for efficiency?
 - Token doesn't need the type attribute, does it?
 - advanceStack runs in quadratic time with shingleSize, is it something we can 
fix? If not then let's put an upper bound on the value that shingleSize may 
take (which might be a good move regardless).
 - Token.reset should use AttributeSource.copyTo rather than 
captureState+restoreState? This would save a lot of cloning.
 - If it only cares about index-time I'm confused why it tries to work on graph 
inputs, I think it's safe to assume and document that the FlattenGraphFilter 
must be applied prior to the FixedShingleFilter? Said otherwise, say you have 
an existing analyzer. I think we should have the goal that wrapping this 
analyzer (AnalyzerWrapper) with a FixedShingleFilter should emit exactly all 
the exact phrases of length `shingleSize` that you could search if you indexed 
with the wrapped analyzer directly? For instance if I look at the 
{{testMultiLengthPathShingles}} test, it indexes the shingle "the usa is". But 
this is not a phrase that you can search if you create the same analysis chain 
without the shingle filter in the end ("the" and "usa" are at position 0 and 1, 
but "is" is at position 6).

> Add a FixedShingleFilter
> ------------------------
>
>                 Key: LUCENE-8202
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8202
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>         Attachments: LUCENE-8202.patch
>
>
> In LUCENE-3475 I tried to make a ShingleGraphFilter that could accept and 
> emit arbitrary graphs, while duplicating all the functionality of the 
> existing ShingleFilter.  This ends up being extremely hairy, and doesn't play 
> well with query parsers.
> I'd like to step back and try and create a simpler shingle filter that can be 
> used for index-time phrase tokenization only.  It will have a single fixed 
> shingle size, can deal with single-token synonyms, and won't emit unigrams.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to