[jira] [Commented] (SOLR-13190) Fuzzy search treated as server error instead of client error when terms are too complex

2019-06-16 Thread Mike Drob (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865172#comment-16865172
 ] 

Mike Drob commented on SOLR-13190:
--

Looks like I failed to mention in this initial reporting that the query causing 
these issues for us was a mixed Japanese and English term. Based on 
conversations and additional digging, I suspect that our algorithm 
fundamentally doesn't work on multi-code point characters. when converting to 
utf8, I suspect that the transitions are tearing the characters apart and might 
be producing nonsense suggestions for deleting or adding "half" a character. I 
don't have proof of this though, so I'll have to assume that the algorithm 
works.

That aside, the number of states scales linearly with length by a factor of 45, 
as can be confirmed in Lev2TParametricDescription. For English (or any single 
byte character text, I expect), that number of states does not change in the 
32->8 conversion. But for JP text, we get approximately 256 states per 
character post-conversion, starting with the 3rd character, and slightly 
varying with the exact text itself. I start to see the TooComplexToDeterminize 
come back with random strings of length 41 or 42.

Since we know that these aren't adversarial regular expressions, then I think 
we should be safe to pass in a dynamically determined maximum number of states. 
For pure utf8 text, I don't think it becomes undetermined, so the bound doesn't 
matter. For other cases, length*256+100 for maybe a little bit of buffer might 
be good enough.



> Fuzzy search treated as server error instead of client error when terms are 
> too complex
> ---
>
> Key: SOLR-13190
> URL: https://issues.apache.org/jira/browse/SOLR-13190
> Project: Solr
>  Issue Type: Bug
>  Components: search
>Affects Versions: master (9.0)
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We've seen a fuzzy search end up breaking the automaton and getting reported 
> as a server error. This usage should be improved by
> 1) reporting as a client error, because it's similar to something like too 
> many boolean clauses queries in how an operator should deal with it
> 2) report what field is causing the error, since that currently must be 
> deduced from adjacent query logs and can be difficult if there are multiple 
> terms in the search
> This trigger was added to defend against adversarial regex but somehow hits 
> fuzzy terms as well, I don't understand enough about the automaton mechanisms 
> to really know how to approach a fix there, but improving the operability is 
> a good first step.
> relevant stack trace:
> {noformat}
> org.apache.lucene.util.automaton.TooComplexToDeterminizeException: 
> Determinizing automaton with 13632 states and 21348 transitions would result 
> in more than 1 states.
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:746)
>   at 
> org.apache.lucene.util.automaton.RunAutomaton.(RunAutomaton.java:69)
>   at 
> org.apache.lucene.util.automaton.ByteRunAutomaton.(ByteRunAutomaton.java:32)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:133)
>   at 
> org.apache.lucene.search.FuzzyTermsEnum.(FuzzyTermsEnum.java:143)
>   at org.apache.lucene.search.FuzzyQuery.getTermsEnum(FuzzyQuery.java:154)
>   at 
> org.apache.lucene.search.MultiTermQuery$RewriteMethod.getTermsEnum(MultiTermQuery.java:78)
>   at 
> org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:58)
>   at 
> org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
>   at 
> org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
>   at 
> org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:667)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:442)
>   at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:200)
>   at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1604)
>   at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1420)
>   at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:567)
>   at 
> org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1435)
>   at 
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:374)
>   at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandl

[jira] [Commented] (SOLR-13190) Fuzzy search treated as server error instead of client error when terms are too complex

2019-02-19 Thread Mike Drob (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772074#comment-16772074
 ] 

Mike Drob commented on SOLR-13190:
--

bq. Does converting an already deterministic automaton necessarily destroy the 
deterministic state? 
I stepped through the code and it looks like a new automaton is assumed to be 
deterministic until there is a transition added that shows it is not. So that's 
not the solution here.

Poke [~mikemccand] - any thoughts on how to fix this?

> Fuzzy search treated as server error instead of client error when terms are 
> too complex
> ---
>
> Key: SOLR-13190
> URL: https://issues.apache.org/jira/browse/SOLR-13190
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (9.0)
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We've seen a fuzzy search end up breaking the automaton and getting reported 
> as a server error. This usage should be improved by
> 1) reporting as a client error, because it's similar to something like too 
> many boolean clauses queries in how an operator should deal with it
> 2) report what field is causing the error, since that currently must be 
> deduced from adjacent query logs and can be difficult if there are multiple 
> terms in the search
> This trigger was added to defend against adversarial regex but somehow hits 
> fuzzy terms as well, I don't understand enough about the automaton mechanisms 
> to really know how to approach a fix there, but improving the operability is 
> a good first step.
> relevant stack trace:
> {noformat}
> org.apache.lucene.util.automaton.TooComplexToDeterminizeException: 
> Determinizing automaton with 13632 states and 21348 transitions would result 
> in more than 1 states.
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:746)
>   at 
> org.apache.lucene.util.automaton.RunAutomaton.(RunAutomaton.java:69)
>   at 
> org.apache.lucene.util.automaton.ByteRunAutomaton.(ByteRunAutomaton.java:32)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:133)
>   at 
> org.apache.lucene.search.FuzzyTermsEnum.(FuzzyTermsEnum.java:143)
>   at org.apache.lucene.search.FuzzyQuery.getTermsEnum(FuzzyQuery.java:154)
>   at 
> org.apache.lucene.search.MultiTermQuery$RewriteMethod.getTermsEnum(MultiTermQuery.java:78)
>   at 
> org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:58)
>   at 
> org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
>   at 
> org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
>   at 
> org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:667)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:442)
>   at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:200)
>   at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1604)
>   at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1420)
>   at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:567)
>   at 
> org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1435)
>   at 
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:374)
>   at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:298)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:2559)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13190) Fuzzy search treated as server error instead of client error when terms are too complex

2019-02-11 Thread Mike Drob (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765499#comment-16765499
 ] 

Mike Drob commented on SOLR-13190:
--

We'd have to modify AutomatonQuery to handle multiple Automata, so maybe that's 
not the clearest path. 

[~mikemccand] - any further insight?

> Fuzzy search treated as server error instead of client error when terms are 
> too complex
> ---
>
> Key: SOLR-13190
> URL: https://issues.apache.org/jira/browse/SOLR-13190
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (9.0)
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We've seen a fuzzy search end up breaking the automaton and getting reported 
> as a server error. This usage should be improved by
> 1) reporting as a client error, because it's similar to something like too 
> many boolean clauses queries in how an operator should deal with it
> 2) report what field is causing the error, since that currently must be 
> deduced from adjacent query logs and can be difficult if there are multiple 
> terms in the search
> This trigger was added to defend against adversarial regex but somehow hits 
> fuzzy terms as well, I don't understand enough about the automaton mechanisms 
> to really know how to approach a fix there, but improving the operability is 
> a good first step.
> relevant stack trace:
> {noformat}
> org.apache.lucene.util.automaton.TooComplexToDeterminizeException: 
> Determinizing automaton with 13632 states and 21348 transitions would result 
> in more than 1 states.
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:746)
>   at 
> org.apache.lucene.util.automaton.RunAutomaton.(RunAutomaton.java:69)
>   at 
> org.apache.lucene.util.automaton.ByteRunAutomaton.(ByteRunAutomaton.java:32)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:133)
>   at 
> org.apache.lucene.search.FuzzyTermsEnum.(FuzzyTermsEnum.java:143)
>   at org.apache.lucene.search.FuzzyQuery.getTermsEnum(FuzzyQuery.java:154)
>   at 
> org.apache.lucene.search.MultiTermQuery$RewriteMethod.getTermsEnum(MultiTermQuery.java:78)
>   at 
> org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:58)
>   at 
> org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
>   at 
> org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
>   at 
> org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:667)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:442)
>   at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:200)
>   at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1604)
>   at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1420)
>   at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:567)
>   at 
> org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1435)
>   at 
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:374)
>   at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:298)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:2559)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13190) Fuzzy search treated as server error instead of client error when terms are too complex

2019-02-06 Thread Mike Drob (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761956#comment-16761956
 ] 

Mike Drob commented on SOLR-13190:
--

Can you expand on that a little more? Tracing through the code path we...

1) Build a LevenshteinAutomaton
2) For each edit distance (0,1,2) create a separate instance of Automaton class
3) Each Automaton is converted UTF32toUTF8
4) Each UTF8 Automaton is wrapped in a ByteRunAutomaton, which attempts to 
determinize.

The objects produced by 2) are deterministic, the ones produced by 3) are not.

So, a few questions and a maybe a theory: Does converting an already 
deterministic automaton necessarily destroy the deterministic state? If so, are 
we sure that we need to be doing the conversion in the first place? The 
comments claim that PrefixQuery doesn't need the conversion, so maybe we can 
get away without it here too?

LUCENE-6367 makes me think that FuzzyQuery should subclass AutomatonQuery as 
well?

> Fuzzy search treated as server error instead of client error when terms are 
> too complex
> ---
>
> Key: SOLR-13190
> URL: https://issues.apache.org/jira/browse/SOLR-13190
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (9.0)
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We've seen a fuzzy search end up breaking the automaton and getting reported 
> as a server error. This usage should be improved by
> 1) reporting as a client error, because it's similar to something like too 
> many boolean clauses queries in how an operator should deal with it
> 2) report what field is causing the error, since that currently must be 
> deduced from adjacent query logs and can be difficult if there are multiple 
> terms in the search
> This trigger was added to defend against adversarial regex but somehow hits 
> fuzzy terms as well, I don't understand enough about the automaton mechanisms 
> to really know how to approach a fix there, but improving the operability is 
> a good first step.
> relevant stack trace:
> {noformat}
> org.apache.lucene.util.automaton.TooComplexToDeterminizeException: 
> Determinizing automaton with 13632 states and 21348 transitions would result 
> in more than 1 states.
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:746)
>   at 
> org.apache.lucene.util.automaton.RunAutomaton.(RunAutomaton.java:69)
>   at 
> org.apache.lucene.util.automaton.ByteRunAutomaton.(ByteRunAutomaton.java:32)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:133)
>   at 
> org.apache.lucene.search.FuzzyTermsEnum.(FuzzyTermsEnum.java:143)
>   at org.apache.lucene.search.FuzzyQuery.getTermsEnum(FuzzyQuery.java:154)
>   at 
> org.apache.lucene.search.MultiTermQuery$RewriteMethod.getTermsEnum(MultiTermQuery.java:78)
>   at 
> org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:58)
>   at 
> org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
>   at 
> org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
>   at 
> org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:667)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:442)
>   at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:200)
>   at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1604)
>   at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1420)
>   at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:567)
>   at 
> org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1435)
>   at 
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:374)
>   at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:298)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:2559)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13190) Fuzzy search treated as server error instead of client error when terms are too complex

2019-02-03 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16759431#comment-16759431
 ] 

Michael McCandless commented on SOLR-13190:
---

+1 to improve the exception message to include the field and fuzzy term that 
led to this.

However, this exception is baffling because the way our FuzzyQuery works is to 
directly produce an already determinized and minimized automaton – that's the 
beauty of the (efficient) Levenshtein automaton construction algorithm.

So why are we then trying to determinize it again?  Something bad is lurking 
here – somehow we lost track that the automaton is already determinized?

> Fuzzy search treated as server error instead of client error when terms are 
> too complex
> ---
>
> Key: SOLR-13190
> URL: https://issues.apache.org/jira/browse/SOLR-13190
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (9.0)
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We've seen a fuzzy search end up breaking the automaton and getting reported 
> as a server error. This usage should be improved by
> 1) reporting as a client error, because it's similar to something like too 
> many boolean clauses queries in how an operator should deal with it
> 2) report what field is causing the error, since that currently must be 
> deduced from adjacent query logs and can be difficult if there are multiple 
> terms in the search
> This trigger was added to defend against adversarial regex but somehow hits 
> fuzzy terms as well, I don't understand enough about the automaton mechanisms 
> to really know how to approach a fix there, but improving the operability is 
> a good first step.
> relevant stack trace:
> {noformat}
> org.apache.lucene.util.automaton.TooComplexToDeterminizeException: 
> Determinizing automaton with 13632 states and 21348 transitions would result 
> in more than 1 states.
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:746)
>   at 
> org.apache.lucene.util.automaton.RunAutomaton.(RunAutomaton.java:69)
>   at 
> org.apache.lucene.util.automaton.ByteRunAutomaton.(ByteRunAutomaton.java:32)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:133)
>   at 
> org.apache.lucene.search.FuzzyTermsEnum.(FuzzyTermsEnum.java:143)
>   at org.apache.lucene.search.FuzzyQuery.getTermsEnum(FuzzyQuery.java:154)
>   at 
> org.apache.lucene.search.MultiTermQuery$RewriteMethod.getTermsEnum(MultiTermQuery.java:78)
>   at 
> org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:58)
>   at 
> org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
>   at 
> org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
>   at 
> org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:667)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:442)
>   at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:200)
>   at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1604)
>   at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1420)
>   at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:567)
>   at 
> org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1435)
>   at 
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:374)
>   at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:298)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:2559)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13190) Fuzzy search treated as server error instead of client error when terms are too complex

2019-02-01 Thread Mike Drob (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758713#comment-16758713
 ] 

Mike Drob commented on SOLR-13190:
--

[~mikemccand] - WDYT? You were the original person to add this exception in 
LUCENE-6046, not sure if you knew that it was also affecting Fuzzy terms as 
well when planning for direct regex construction.

> Fuzzy search treated as server error instead of client error when terms are 
> too complex
> ---
>
> Key: SOLR-13190
> URL: https://issues.apache.org/jira/browse/SOLR-13190
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: search
>Affects Versions: master (9.0)
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We've seen a fuzzy search end up breaking the automaton and getting reported 
> as a server error. This usage should be improved by
> 1) reporting as a client error, because it's similar to something like too 
> many boolean clauses queries in how an operator should deal with it
> 2) report what field is causing the error, since that currently must be 
> deduced from adjacent query logs and can be difficult if there are multiple 
> terms in the search
> This trigger was added to defend against adversarial regex but somehow hits 
> fuzzy terms as well, I don't understand enough about the automaton mechanisms 
> to really know how to approach a fix there, but improving the operability is 
> a good first step.
> relevant stack trace:
> {noformat}
> org.apache.lucene.util.automaton.TooComplexToDeterminizeException: 
> Determinizing automaton with 13632 states and 21348 transitions would result 
> in more than 1 states.
>   at 
> org.apache.lucene.util.automaton.Operations.determinize(Operations.java:746)
>   at 
> org.apache.lucene.util.automaton.RunAutomaton.(RunAutomaton.java:69)
>   at 
> org.apache.lucene.util.automaton.ByteRunAutomaton.(ByteRunAutomaton.java:32)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:247)
>   at 
> org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:133)
>   at 
> org.apache.lucene.search.FuzzyTermsEnum.(FuzzyTermsEnum.java:143)
>   at org.apache.lucene.search.FuzzyQuery.getTermsEnum(FuzzyQuery.java:154)
>   at 
> org.apache.lucene.search.MultiTermQuery$RewriteMethod.getTermsEnum(MultiTermQuery.java:78)
>   at 
> org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:58)
>   at 
> org.apache.lucene.search.TopTermsRewrite.rewrite(TopTermsRewrite.java:67)
>   at 
> org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:310)
>   at 
> org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:667)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:442)
>   at 
> org.apache.solr.search.SolrIndexSearcher.buildAndRunCollectorChain(SolrIndexSearcher.java:200)
>   at 
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1604)
>   at 
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1420)
>   at 
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:567)
>   at 
> org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1435)
>   at 
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:374)
>   at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:298)
>   at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:2559)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org