[
https://issues.apache.org/jira/browse/LUCENE-8202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409361#comment-16409361
]
Alan Woodward commented on LUCENE-8202:
---------------------------------------
TestRandomChains has found two issues:
* positionLength should be 1, rather than the shingle length. We don't have
any intermediary tokens, only shingles, so we're not building graphs. TRC
found this by feeding the output into FlattenGraphFilter, which then complained.
* we need somehow limit either the length of the shingle, or the number of
stacked positions we iterate through, as we can otherwise get a combinatorial
explosion of terms. TRC found this by feeding long strings into a
decompounding filter, and then building shingles of length 11. The
decompounding filter was producing up to 50 tokens in the same position, which
lead to 50^11 shingles being generated, resulting in OOM. I'm not sure of the
best way of dealing with this one though - we could just limit shingle length
to a maximum of 3 or 4, but that seems like too harsh a restriction for this.
The other possibility would be to have a (configurable) maximum number of
shingles emitted at a single position, and throw IllegalStateException if this
is hit.
> Add a FixedShingleFilter
> ------------------------
>
> Key: LUCENE-8202
> URL: https://issues.apache.org/jira/browse/LUCENE-8202
> Project: Lucene - Core
> Issue Type: New Feature
> Reporter: Alan Woodward
> Assignee: Alan Woodward
> Priority: Major
> Fix For: 7.4
>
> Attachments: LUCENE-8202.patch, LUCENE-8202.patch, LUCENE-8202.patch
>
>
> In LUCENE-3475 I tried to make a ShingleGraphFilter that could accept and
> emit arbitrary graphs, while duplicating all the functionality of the
> existing ShingleFilter. This ends up being extremely hairy, and doesn't play
> well with query parsers.
> I'd like to step back and try and create a simpler shingle filter that can be
> used for index-time phrase tokenization only. It will have a single fixed
> shingle size, can deal with single-token synonyms, and won't emit unigrams.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]