[
https://issues.apache.org/jira/browse/SOLR-5440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870482#comment-13870482
]
Steve Rowe commented on SOLR-5440:
----------------------------------
[~bokkie] privately sent me a document that triggers this problem. The
document consists of an HTML snippet containing a {{<script>}} block, which
contains a 3-megabyte-long URL-encoded string in single-quotes, given as a
parameter to a javascript function defined elsewhere. (The purpose of the
javascript function is to URL-decode the string.)
When I run this text through {{UAX29URLEmailTokenizer}}, it doesn't actually
hang - it just tokenizes extremely slowly, consuming less than 100 characters
per second on my laptop. I didn't wait long enough to find out, but I estimate
the average scan rate over the entire text is on the order of 200 characters
per second, so it would probably take about 4 hours to finish. (I also sent
the same text through {{StandardTokenizer}}, which fortunately does not exhibit
the slow tokenization behavior.) To convince myself that this is not an
endless loop of some kind, I ran shorter runs (hundreds of chars) of
URL-encoded text through {{UAX29URLEmailTokenizer}}, and they successfully
finished.
I guessed that the problem was with email addresses, so I commented out that
part of the {{UAX29URLEmailTokenizer}} specification, and that caused the text
to be scanned at the same speed as {{StandardTokenizer}}.
The email rule in {{UAX29URLEmailTokenizer}} is basically the sequence
{{<local-part>, "@", <domain>}}. What's happening is that the entire 3-MB-long
URL-encoded string matches {{<local-part>}} (the stuff before the "@" in an
email address), so for each "%XX" URL-encoded byte, the scanner scans through
most of the remaining text looking for a "@" character, then gives up when it
reaches the end of the URL-encoded string without finding one, and finally
falls back to tokenizing "XX" as {{<ALPHANUM>}}. The scanner then starts again
trying to match an email address over the remainder of the URL-encoded string,
and so on. So it's not much of a surprise that this is slow.
[RFC5321|http://tools.ietf.org/search/rfc5321] says:
{noformat}
4.5.3.1. Size Limits and Minimums
There are several objects that have required minimum/maximum sizes.
Every implementation MUST be able to receive objects of at least
these sizes. Objects larger than these sizes SHOULD be avoided when
possible. However, some Internet mail constructs such as encoded
X.400 addresses (RFC 2156 [35]) will often require larger objects.
Clients MAY attempt to transmit these, but MUST be prepared for a
server to reject them if they cannot be handled by it. To the
maximum extent possible, implementation techniques that impose no
limits on the length of these objects should be used.
Extensions to SMTP may involve the use of characters that occupy more
than a single octet each. This section therefore specifies lengths
in octets where absolute lengths, rather than character counts, are
intended.
4.5.3.1.1. Local-part
The maximum total length of a user name or other local-part is 64
octets.
{noformat}
So local-parts of email addresses that are going to work everywhere are
effectively limited to 64 bytes. ([Section 3 of
RFC3696|http://tools.ietf.org/html/rfc3696#section-3] says the same thing.)
One possible solution to this problem is to limit the allowable length of the
local-part. Currently the rule looks like:
{noformat}
EMAILquotedString = [\"]
([\u0001-\u0008\u000B\u000C\u000E-\u0021\u0023-\u005B\u005D-\u007E] | [\\]
[\u0000-\u007F])* [\"]
EMAILatomText = [A-Za-z0-9!#$%&'*+-/=?\^_`{|}~]
EMAILlabel = {EMAILatomText}+ | {EMAILquotedString}
EMAILlocalPart = {EMAILlabel} ("." {EMAILlabel})*
{noformat}
When I try to limit {{EMAILlabel}} as follows, JFlex takes forever (minutes)
trying to generate the scanner, but then eventually OOMs, even with env. var.
{{ANT_OPT=-Xmx2g}} (I didn't try larger):
{noformat}
EMAILlabel = {EMAILatomText}{1,64} | {EMAILquotedString}
{noformat}
(Note that {{EMAILquotedString}} has the same unlimited length problem - really
long quoted ASCII strings could result in the same extremely slow tokenization
behavior.)
I think a solution could include a rule matching a fixed-length
longer-than-maximum local-part, the action for which sets a lexical state where
email addresses aren't allowed, and then pushes back the matched text onto the
input stream. I haven't figured out exactly how to do this yet, though.
I'd welcome other ideas :)
> UAX29URLEmailTokenizer thread hangs on getNextToken - causes cloud to stop
> accepting updates
> --------------------------------------------------------------------------------------------
>
> Key: SOLR-5440
> URL: https://issues.apache.org/jira/browse/SOLR-5440
> Project: Solr
> Issue Type: Bug
> Affects Versions: 4.5
> Reporter: Chris Geeringh
>
> This is a pretty nasty bug, and causes the cluster to stop accepting updates.
> I'm not sure how to consistently reproduce it but I have done so numerous
> times. Switching to a whitespace tokenizer improved indexing speed, and I
> never got the issue again.
> I'm running a 4.6 Snapshot - I had issues with deadlocks with numerous
> versions of Solr, and have finally narrowed down the problem to this code,
> which affects many/all(?) versions of Solr.
> When the thread hits this issue it uses 100% CPU, restarting the node which
> has the error allows indexing to continue until hit again. Here is thread
> dump:
> http-bio-8080-exec-45 (201)
>
> org.apache.lucene.analysis.standard.UAX29URLEmailTokenizerImpl.getNextToken(UAX29URLEmailTokenizerImpl.java:4343)
>
> org.apache.lucene.analysis.standard.UAX29URLEmailTokenizer.incrementToken(UAX29URLEmailTokenizer.java:147)
>
> org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)
>
> org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
>
> org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
>
> org.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)
>
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:253)
>
> org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:453)
> org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1517)
>
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:217)
>
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
>
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
>
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:583)
>
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:719)
>
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:449)
>
> org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:89)
>
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:151)
>
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:131)
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221)
>
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:116)
> org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112)
>
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:158)
>
> org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:99)
> org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:58)
>
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
>
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
>
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
>
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
>
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)
> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
>
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
>
> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1023)
>
> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589)
>
> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:312)
> java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> java.lang.Thread.run(Unknown Source)
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]