[jira] [Commented] (OPENNLP-1350) MAIL_REGEX in UrlCharSequenceNormalizer causes quadratic complexity for certain input, and is also a bit imprecise

ASF GitHub Bot (Jira) Thu, 16 Dec 2021 13:05:06 -0800


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461042#comment-17461042
 ]


ASF GitHub Bot commented on OPENNLP-1350:
-----------------------------------------

jonmv opened a new pull request #399:
URL: https://github.com/apache/opennlp/pull/399


   Addresses OPENNLP-1350
   
   The `MAIL_REGEX` in `UrlCharSSequenceNormalizer` causes `replaceAll(...)` to 
become extremely costly when given an input string with a long sequence of 
characters from the first character set in the regex, but which ultimately 
fails to match the whole regex. This pull request fixes that, and also another 
detail:
   
   Allow `+` in the local part, and disallow `_` in the domain part. There are 
other characters that are allowed in the local part as well, but these are less 
common (https://en.wikipedia.org/wiki/Email_address).
   
   The speedup for unfortunate input is achieved by adding a negative 
lookbehind with a single characters from the first character set. 
   Currently, the replaceAll(" ") on a string of ~100K characters from the set 
`[-_.0-9A-Za-z]` runs in ~1minute on modern hardware; adding a negative 
lookbehind with one of the characters from that set reduces this to a few 
milliseconds, and is functionally equivalent. (Consider the current pattern and 
a match from position `i` to `k`. If the character at `i-1` is in the character 
set, there would also be a match from `i-1` to `k`, which would already be 
replaced.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> MAIL_REGEX in UrlCharSequenceNormalizer causes quadratic complexity for 
> certain input, and is also a bit imprecise
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: OPENNLP-1350
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1350
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Language Detector
>    Affects Versions: 1.9.3
>            Reporter: Jon Marius Venstad
>            Priority: Minor
>
> The regex used to strip email addresses from input, in 
> UrlCharSequenceNormalizer, has quadratic complexity when used with 
> {{{}String.replaceAll{}}}, and when input is a long sequence of characters 
> from the first character set, i.e., {{{}[-_.0-9A-Za-z]{}}}, which fails to 
> match the whole regex; then, the regex is evaluated again for each suffix of 
> this sequence, with linear cost each time. 
> This problem is promptly solved by adding a negative lookbehind with a single 
> character from that same set, to the first part of the regex. 
>  
> Additionally, the character {{_}} is allowed in the domain part of the mail 
> address, where it is in fact illegal. Likewise, the character {{+}} is 
> disallowed in the local part (the first first), where it _is{_} legal, and 
> even quite common. The set of legal characters in the first part is actually 
> quite bonkers, per the RFC, but such usage is probably less common. See 
> [https://en.wikipedia.org/wiki/Email_address] for details. 
>  
> The suggested fix is to change the {{MAIL_REGEX}} declaration to
> {code:java}
> private static final Pattern MAIL_REGEX =
>       
> Pattern.compile("(?<![-+_.0-9A-Za-z])[-+_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+");
>  {code}
> For a sequence of ~100k characters, the run time is ~1minute "on my machine". 
> With this change, it reduces to a few milliseconds. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (OPENNLP-1350) MAIL_REGEX in UrlCharSequenceNormalizer causes quadratic complexity for certain input, and is also a bit imprecise

Reply via email to