jonmv opened a new pull request #399: URL: https://github.com/apache/opennlp/pull/399
Addresses OPENNLP-1350 The `MAIL_REGEX` in `UrlCharSSequenceNormalizer` causes `replaceAll(...)` to become extremely costly when given an input string with a long sequence of characters from the first character set in the regex, but which ultimately fails to match the whole regex. This pull request fixes that, and also another detail: Allow `+` in the local part, and disallow `_` in the domain part. There are other characters that are allowed in the local part as well, but these are less common (https://en.wikipedia.org/wiki/Email_address). The speedup for unfortunate input is achieved by adding a negative lookbehind with a single characters from the first character set. Currently, the replaceAll(" ") on a string of ~100K characters from the set `[-_.0-9A-Za-z]` runs in ~1minute on modern hardware; adding a negative lookbehind with one of the characters from that set reduces this to a few milliseconds, and is functionally equivalent. (Consider the current pattern and a match from position `i` to `k`. If the character at `i-1` is in the character set, there would also be a match from `i-1` to `k`, which would already be replaced.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
