[jira] [Commented] (SOLR-5440) UAX29URLEmailTokenizer thread hangs on getNextToken - causes cloud to stop accepting updates

Steve Rowe (JIRA) Mon, 13 Jan 2014 23:42:35 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-5440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870482#comment-13870482
 ]


Steve Rowe commented on SOLR-5440:
----------------------------------

[~bokkie] privately sent me a document that triggers this problem.  The 
document consists of an HTML snippet containing a {{<script>}} block, which 
contains a 3-megabyte-long URL-encoded string in single-quotes, given as a 
parameter to a javascript function defined elsewhere. (The purpose of the 
javascript function is to URL-decode the string.)

When I run this text through {{UAX29URLEmailTokenizer}}, it doesn't actually 
hang - it just tokenizes extremely slowly, consuming less than 100 characters 
per second on my laptop.  I didn't wait long enough to find out, but I estimate 
the average scan rate over the entire text is on the order of 200 characters 
per second, so it would probably take about 4 hours to finish.  (I also sent 
the same text through {{StandardTokenizer}}, which fortunately does not exhibit 
the slow tokenization behavior.)  To convince myself that this is not an 
endless loop of some kind, I ran shorter runs (hundreds of chars) of 
URL-encoded text through {{UAX29URLEmailTokenizer}}, and they successfully 
finished.

I guessed that the problem was with email addresses, so I commented out that 
part of the {{UAX29URLEmailTokenizer}} specification, and that caused the text 
to be scanned at the same speed as {{StandardTokenizer}}.

The email rule in {{UAX29URLEmailTokenizer}} is basically the sequence 
{{<local-part>, "@", <domain>}}. What's happening is that the entire 3-MB-long 
URL-encoded string matches {{<local-part>}} (the stuff before the "@" in an 
email address), so for each "%XX" URL-encoded byte, the scanner scans through 
most of the remaining text looking for a "@" character, then gives up when it 
reaches the end of the URL-encoded string without finding one, and finally 
falls back to tokenizing "XX" as {{<ALPHANUM>}}.  The scanner then starts again 
trying to match an email address over the remainder of the URL-encoded string, 
and so on.  So it's not much of a surprise that this is slow.

[RFC5321|http://tools.ietf.org/search/rfc5321] says:

{noformat}
4.5.3.1.  Size Limits and Minimums

   There are several objects that have required minimum/maximum sizes.
   Every implementation MUST be able to receive objects of at least
   these sizes.  Objects larger than these sizes SHOULD be avoided when
   possible.  However, some Internet mail constructs such as encoded
   X.400 addresses (RFC 2156 [35]) will often require larger objects.
   Clients MAY attempt to transmit these, but MUST be prepared for a
   server to reject them if they cannot be handled by it.  To the
   maximum extent possible, implementation techniques that impose no
   limits on the length of these objects should be used.

   Extensions to SMTP may involve the use of characters that occupy more
   than a single octet each.  This section therefore specifies lengths
   in octets where absolute lengths, rather than character counts, are
   intended.

4.5.3.1.1.  Local-part

   The maximum total length of a user name or other local-part is 64
   octets.
{noformat}

So local-parts of email addresses that are going to work everywhere are 
effectively limited to 64 bytes.  ([Section 3 of 
RFC3696|http://tools.ietf.org/html/rfc3696#section-3] says the same thing.)

One possible solution to this problem is to limit the allowable length of the 
local-part.  Currently the rule looks like:

{noformat}
EMAILquotedString = [\"] 
([\u0001-\u0008\u000B\u000C\u000E-\u0021\u0023-\u005B\u005D-\u007E] | [\\] 
[\u0000-\u007F])* [\"]
EMAILatomText = [A-Za-z0-9!#$%&'*+-/=?\^_`{|}~]
EMAILlabel = {EMAILatomText}+ | {EMAILquotedString}
EMAILlocalPart = {EMAILlabel} ("." {EMAILlabel})*
{noformat}

When I try to limit {{EMAILlabel}} as follows, JFlex takes forever (minutes) 
trying to generate the scanner, but then eventually OOMs, even with env. var. 
{{ANT_OPT=-Xmx2g}} (I didn't try larger):

{noformat}
EMAILlabel = {EMAILatomText}{1,64} | {EMAILquotedString}
{noformat}

(Note that {{EMAILquotedString}} has the same unlimited length problem - really 
long quoted ASCII strings could result in the same extremely slow tokenization 
behavior.)

I think a solution could include a rule matching a fixed-length 
longer-than-maximum local-part, the action for which sets a lexical state where 
email addresses aren't allowed, and then pushes back the matched text onto the 
input stream.  I haven't figured out exactly how to do this yet, though.

I'd welcome other ideas :)


> UAX29URLEmailTokenizer thread hangs on getNextToken - causes cloud to stop 
> accepting updates
> --------------------------------------------------------------------------------------------
>
>                 Key: SOLR-5440
>                 URL: https://issues.apache.org/jira/browse/SOLR-5440
>             Project: Solr
>          Issue Type: Bug
>    Affects Versions: 4.5
>            Reporter: Chris Geeringh
>
> This is a pretty nasty bug, and causes the cluster to stop accepting updates. 
> I'm not sure how to consistently reproduce it but I have done so numerous 
> times. Switching to a whitespace tokenizer improved indexing speed, and I 
> never got the issue again.
> I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous 
> versions of Solr, and have finally narrowed down the problem to this code, 
> which affects many/all(?) versions of Solr.
> When the thread hits this issue it uses 100% CPU, restarting the node which 
> has the error allows indexing to continue until hit again. Here is thread 
> dump:
> http-bio-8080-exec-45 (201)
>     
> org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343)
>     
> org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147)
>     
> org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
>     
> org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
>     
> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
>     
> org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
>     
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
>     
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453)
>     org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517)
>     
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217)
>     
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
>     
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
>     
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583)
>     
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719)
>     
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449)
>     
> org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89)
>     
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151)
>     
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131)
>     org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221)
>     
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116)
>     org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
>     org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112)
>     
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158)
>     
> org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99)
>     org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58)
>     
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>     
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>     
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>     org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>     
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
>     
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
>     
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
>     
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>     
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>     
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
>     
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
>     
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
>     
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
>     org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
>     
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>     
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
>     
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)
>     
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
>     
> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:312)
>     java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>     java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>     java.lang.Thread.run(Unknown Source)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-5440) UAX29URLEmailTokenizer thread hangs on getNextToken - causes cloud to stop accepting updates

Reply via email to