[ https://issues.apache.org/jira/browse/OPENNLP-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruno P. Kinoshita updated OPENNLP-1350: ---------------------------------------- Fix Version/s: 1.9.4 > MAIL_REGEX in UrlCharSequenceNormalizer causes quadratic complexity for > certain input, and is also a bit imprecise > ------------------------------------------------------------------------------------------------------------------ > > Key: OPENNLP-1350 > URL: https://issues.apache.org/jira/browse/OPENNLP-1350 > Project: OpenNLP > Issue Type: Bug > Components: Language Detector > Affects Versions: 1.9.3 > Reporter: Jon Marius Venstad > Assignee: Bruno P. Kinoshita > Priority: Minor > Fix For: 1.9.4 > > > The regex used to strip email addresses from input, in > UrlCharSequenceNormalizer, has quadratic complexity when used with > {{{}String.replaceAll{}}}, and when input is a long sequence of characters > from the first character set, i.e., {{{}[-_.0-9A-Za-z]{}}}, which fails to > match the whole regex; then, the regex is evaluated again for each suffix of > this sequence, with linear cost each time. > This problem is promptly solved by adding a negative lookbehind with a single > character from that same set, to the first part of the regex. > > Additionally, the character {{_}} is allowed in the domain part of the mail > address, where it is in fact illegal. Likewise, the character {{+}} is > disallowed in the local part (the first first), where it _is{_} legal, and > even quite common. The set of legal characters in the first part is actually > quite bonkers, per the RFC, but such usage is probably less common. See > [https://en.wikipedia.org/wiki/Email_address] for details. > > The suggested fix is to change the {{MAIL_REGEX}} declaration to > {code:java} > private static final Pattern MAIL_REGEX = > > Pattern.compile("(?<![-+_.0-9A-Za-z])[-+_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+"); > {code} > For a sequence of ~100k characters, the run time is ~1minute "on my machine". > With this change, it reduces to a few milliseconds. -- This message was sent by Atlassian Jira (v8.20.1#820001)