[
https://issues.apache.org/jira/browse/LUCENE-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16400746#comment-16400746
]
Adrien Grand commented on LUCENE-8202:
--------------------------------------
Thanks Alan, I agree this issue looks easier to address if we don't have to
support all the features that ShingleFilter accumulated over the years and
focus on index-time phrase tokenization. Some comments about the patch:
- Can you make this filter experimental, ideally it would get merged with
ShingleFilter eventually?
- Can you clarify in the sentence about stacked shingles that this is how this
filter differs from ShingleFilter. I think it's important for users to realize
when it makes sense to use this one.
- Let's not use upper case for variables that are not static? (GAP_TOKEN and
END_TOKEN)
- Let's make tokenPool an ArrayDeque instead for efficiency?
- Token doesn't need the type attribute, does it?
- advanceStack runs in quadratic time with shingleSize, is it something we can
fix? If not then let's put an upper bound on the value that shingleSize may
take (which might be a good move regardless).
- Token.reset should use AttributeSource.copyTo rather than
captureState+restoreState? This would save a lot of cloning.
- If it only cares about index-time I'm confused why it tries to work on graph
inputs, I think it's safe to assume and document that the FlattenGraphFilter
must be applied prior to the FixedShingleFilter? Said otherwise, say you have
an existing analyzer. I think we should have the goal that wrapping this
analyzer (AnalyzerWrapper) with a FixedShingleFilter should emit exactly all
the exact phrases of length `shingleSize` that you could search if you indexed
with the wrapped analyzer directly? For instance if I look at the
{{testMultiLengthPathShingles}} test, it indexes the shingle "the usa is". But
this is not a phrase that you can search if you create the same analysis chain
without the shingle filter in the end ("the" and "usa" are at position 0 and 1,
but "is" is at position 6).
> Add a FixedShingleFilter
> ------------------------
>
> Key: LUCENE-8202
> URL: https://issues.apache.org/jira/browse/LUCENE-8202
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Alan Woodward
> Assignee: Alan Woodward
> Priority: Major
> Attachments: LUCENE-8202.patch
>
>
> In LUCENE-3475 I tried to make a ShingleGraphFilter that could accept and
> emit arbitrary graphs, while duplicating all the functionality of the
> existing ShingleFilter. This ends up being extremely hairy, and doesn't play
> well with query parsers.
> I'd like to step back and try and create a simpler shingle filter that can be
> used for index-time phrase tokenization only. It will have a single fixed
> shingle size, can deal with single-token synonyms, and won't emit unigrams.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]