[
https://issues.apache.org/jira/browse/OPENNLP-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461042#comment-17461042
]
ASF GitHub Bot commented on OPENNLP-1350:
-----------------------------------------
jonmv opened a new pull request #399:
URL: https://github.com/apache/opennlp/pull/399
Addresses OPENNLP-1350
The `MAIL_REGEX` in `UrlCharSSequenceNormalizer` causes `replaceAll(...)` to
become extremely costly when given an input string with a long sequence of
characters from the first character set in the regex, but which ultimately
fails to match the whole regex. This pull request fixes that, and also another
detail:
Allow `+` in the local part, and disallow `_` in the domain part. There are
other characters that are allowed in the local part as well, but these are less
common (https://en.wikipedia.org/wiki/Email_address).
The speedup for unfortunate input is achieved by adding a negative
lookbehind with a single characters from the first character set.
Currently, the replaceAll(" ") on a string of ~100K characters from the set
`[-_.0-9A-Za-z]` runs in ~1minute on modern hardware; adding a negative
lookbehind with one of the characters from that set reduces this to a few
milliseconds, and is functionally equivalent. (Consider the current pattern and
a match from position `i` to `k`. If the character at `i-1` is in the character
set, there would also be a match from `i-1` to `k`, which would already be
replaced.)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> MAIL_REGEX in UrlCharSequenceNormalizer causes quadratic complexity for
> certain input, and is also a bit imprecise
> ------------------------------------------------------------------------------------------------------------------
>
> Key: OPENNLP-1350
> URL: https://issues.apache.org/jira/browse/OPENNLP-1350
> Project: OpenNLP
> Issue Type: Bug
> Components: Language Detector
> Affects Versions: 1.9.3
> Reporter: Jon Marius Venstad
> Priority: Minor
>
> The regex used to strip email addresses from input, in
> UrlCharSequenceNormalizer, has quadratic complexity when used with
> {{{}String.replaceAll{}}}, and when input is a long sequence of characters
> from the first character set, i.e., {{{}[-_.0-9A-Za-z]{}}}, which fails to
> match the whole regex; then, the regex is evaluated again for each suffix of
> this sequence, with linear cost each time.
> This problem is promptly solved by adding a negative lookbehind with a single
> character from that same set, to the first part of the regex.
>
> Additionally, the character {{_}} is allowed in the domain part of the mail
> address, where it is in fact illegal. Likewise, the character {{+}} is
> disallowed in the local part (the first first), where it _is{_} legal, and
> even quite common. The set of legal characters in the first part is actually
> quite bonkers, per the RFC, but such usage is probably less common. See
> [https://en.wikipedia.org/wiki/Email_address] for details.
>
> The suggested fix is to change the {{MAIL_REGEX}} declaration to
> {code:java}
> private static final Pattern MAIL_REGEX =
>
> Pattern.compile("(?<![-+_.0-9A-Za-z])[-+_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+");
> {code}
> For a sequence of ~100k characters, the run time is ~1minute "on my machine".
> With this change, it reduces to a few milliseconds.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)