[ https://issues.apache.org/jira/browse/OPENNLP-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461042#comment-17461042 ]
ASF GitHub Bot commented on OPENNLP-1350: ----------------------------------------- jonmv opened a new pull request #399: URL: https://github.com/apache/opennlp/pull/399 Addresses OPENNLP-1350 The `MAIL_REGEX` in `UrlCharSSequenceNormalizer` causes `replaceAll(...)` to become extremely costly when given an input string with a long sequence of characters from the first character set in the regex, but which ultimately fails to match the whole regex. This pull request fixes that, and also another detail: Allow `+` in the local part, and disallow `_` in the domain part. There are other characters that are allowed in the local part as well, but these are less common (https://en.wikipedia.org/wiki/Email_address). The speedup for unfortunate input is achieved by adding a negative lookbehind with a single characters from the first character set. Currently, the replaceAll(" ") on a string of ~100K characters from the set `[-_.0-9A-Za-z]` runs in ~1minute on modern hardware; adding a negative lookbehind with one of the characters from that set reduces this to a few milliseconds, and is functionally equivalent. (Consider the current pattern and a match from position `i` to `k`. If the character at `i-1` is in the character set, there would also be a match from `i-1` to `k`, which would already be replaced.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > MAIL_REGEX in UrlCharSequenceNormalizer causes quadratic complexity for > certain input, and is also a bit imprecise > ------------------------------------------------------------------------------------------------------------------ > > Key: OPENNLP-1350 > URL: https://issues.apache.org/jira/browse/OPENNLP-1350 > Project: OpenNLP > Issue Type: Bug > Components: Language Detector > Affects Versions: 1.9.3 > Reporter: Jon Marius Venstad > Priority: Minor > > The regex used to strip email addresses from input, in > UrlCharSequenceNormalizer, has quadratic complexity when used with > {{{}String.replaceAll{}}}, and when input is a long sequence of characters > from the first character set, i.e., {{{}[-_.0-9A-Za-z]{}}}, which fails to > match the whole regex; then, the regex is evaluated again for each suffix of > this sequence, with linear cost each time. > This problem is promptly solved by adding a negative lookbehind with a single > character from that same set, to the first part of the regex. > > Additionally, the character {{_}} is allowed in the domain part of the mail > address, where it is in fact illegal. Likewise, the character {{+}} is > disallowed in the local part (the first first), where it _is{_} legal, and > even quite common. The set of legal characters in the first part is actually > quite bonkers, per the RFC, but such usage is probably less common. See > [https://en.wikipedia.org/wiki/Email_address] for details. > > The suggested fix is to change the {{MAIL_REGEX}} declaration to > {code:java} > private static final Pattern MAIL_REGEX = > > Pattern.compile("(?<![-+_.0-9A-Za-z])[-+_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+"); > {code} > For a sequence of ~100k characters, the run time is ~1minute "on my machine". > With this change, it reduces to a few milliseconds. -- This message was sent by Atlassian Jira (v8.20.1#820001)