[
https://issues.apache.org/jira/browse/SOLR-13190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865172#comment-16865172
]
Mike Drob commented on SOLR-13190:
----------------------------------
Looks like I failed to mention in this initial reporting that the query causing
these issues for us was a mixed Japanese and English term. Based on
conversations and additional digging, I suspect that our algorithm
fundamentally doesn't work on multi-code point characters. when converting to
utf8, I suspect that the transitions are tearing the characters apart and might
be producing nonsense suggestions for deleting or adding "half" a character. I
don't have proof of this though, so I'll have to assume that the algorithm
works.
That aside, the number of states scales linearly with length by a factor of 45,
as can be confirmed in Lev2TParametricDescription. For English (or any single
byte character text, I expect), that number of states does not change in the
32->8 conversion. But for JP text, we get approximately 256 states per
character post-conversion, starting with the 3rd character, and slightly
varying with the exact text itself. I start to see the TooComplexToDeterminize
come back with random strings of length 41 or 42.
Since we know that these aren't adversarial regular expressions, then I think
we should be safe to pass in a dynamically determined maximum number of states.
For pure utf8 text, I don't think it becomes undetermined, so the bound doesn't
matter. For other cases, length*256+100 for maybe a little bit of buffer might
be good enough.
> Fuzzy search treated as server error instead of client error when terms are
> too complex
> ---------------------------------------------------------------------------------------
>
> Key: SOLR-13190
> URL: https://issues.apache.org/jira/browse/SOLR-13190
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: master (9.0)
> Reporter: Mike Drob
> Assignee: Mike Drob
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> We've seen a fuzzy search end up breaking the automaton and getting reported
> as a server error. This usage should be improved by
> 1) reporting as a client error, because it's similar to something like too
> many boolean clauses queries in how an operator should deal with it
> 2) report what field is causing the error, since that currently must be
> deduced from adjacent query logs and can be difficult if there are multiple
> terms in the search
> This trigger was added to defend against adversarial regex but somehow hits
> fuzzy terms as well, I don't understand enough about the automaton mechanisms
> to really know how to approach a fix there, but improving the operability is
> a good first step.
> relevant stack trace:
> {noformat}
> org.apache.lucene.util.automaton.TooComplexToDeterminizeException:
> Determinizing automaton with 13632 states and 21348 transitions would result
> in more than 10000 states.
> at
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:746)
> at
> org.apache.lucene.util.automaton.RunAutomaton.<init>(RunAutomaton.java:69)
> at
> org.apache.lucene.util.automaton.ByteRunAutomaton.<init>(ByteRunAutomaton.java:32)
> at
> org.apache.lucene.util.automaton.CompiledAutomaton.<init>(CompiledAutomaton.java:247)
> at
> org.apache.lucene.util.automaton.CompiledAutomaton.<init>(CompiledAutomaton.java:133)
> at
> org.apache.lucene.search.FuzzyTermsEnum.<init>(FuzzyTermsEnum.java:143)
> at org.apache.lucene.search.FuzzyQuery.getTermsEnum(FuzzyQuery.java:154)
> at
> org.apache.lucene.search.MultiTermQuery$RewriteMethod.getTermsEnum(MultiTermQuery.java:78)
> at
> org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:58)
> at
> org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
> at
> org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
> at
> org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:667)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:442)
> at
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:200)
> at
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1604)
> at
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1420)
> at
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:567)
> at
> org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1435)
> at
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:374)
> at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:298)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2559)
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]