Jon Marius Venstad created OPENNLP-1350:
-------------------------------------------

             Summary: MAIL_REGEX in UrlCharSequenceNormalizer causes quadratic 
complexity for certain input, and is also a bit imprecise
                 Key: OPENNLP-1350
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1350
             Project: OpenNLP
          Issue Type: Bug
          Components: Language Detector
    Affects Versions: 1.9.3
            Reporter: Jon Marius Venstad


The regex used to strip email addresses from input, in 
UrlCharSequenceNormalizer, has quadratic complexity when used with 
{{{}String.replaceAll{}}}, and when input is a long sequence of characters from 
the first character set, i.e., {{[-_.0-9A-Za-z]}}, which fails to match the 
whole regex; then, the regex is evaluated again for each suffix of this 
sequence, with linear cost each time. 

This problem is promptly solved by adding a negative lookbehind with a single 
character from that same set, to the first part of the regex. 

 

Additionally, the character {{_}} is allowed in the domain part of the mail 
address, where it is in fact illegal. Likewise, the character {{+}} is 
disallowed in the local part (the first first), where it _is_ legal, and even 
quite common. The set of legal characters in the first part is actually quite 
bonkers, per the RFC, but such usage is probably less common. See 
[https://en.wikipedia.org/wiki/Email_address] for details. 

 

The suggested fix is to change the {{MAIL_REGEX}} declaration to
{code:java}
private static final Pattern MAIL_REGEX =
      
Pattern.compile("(?<![-+_.0-9A-Za-z])[-+_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+");
 {code}
For a sequence of ~10k characters, the run time is ~1minute "on my machine". 
With this change, it reduces to a few milliseconds. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to