[GitHub] [opennlp] jonmv opened a new pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

GitBox Thu, 16 Dec 2021 13:04:05 -0800


jonmv opened a new pull request #399:
URL: https://github.com/apache/opennlp/pull/399



   Addresses OPENNLP-1350
   
   The `MAIL_REGEX` in `UrlCharSSequenceNormalizer` causes `replaceAll(...)` to 
become extremely costly when given an input string with a long sequence of 
characters from the first character set in the regex, but which ultimately 
fails to match the whole regex. This pull request fixes that, and also another 
detail:
   
   Allow `+` in the local part, and disallow `_` in the domain part. There are 
other characters that are allowed in the local part as well, but these are less 
common (https://en.wikipedia.org/wiki/Email_address).
   
   The speedup for unfortunate input is achieved by adding a negative 
lookbehind with a single characters from the first character set. 
   Currently, the replaceAll(" ") on a string of ~100K characters from the set 
`[-_.0-9A-Za-z]` runs in ~1minute on modern hardware; adding a negative 
lookbehind with one of the characters from that set reduces this to a few 
milliseconds, and is functionally equivalent. (Consider the current pattern and 
a match from position `i` to `k`. If the character at `i-1` is in the character 
set, there would also be a match from `i-1` to `k`, which would already be 
replaced.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [opennlp] jonmv opened a new pull request #399: OPENNLP-1350 Improve normaliser MAIL_REGEX

Reply via email to