jonmv commented on a change in pull request #399:
URL: https://github.com/apache/opennlp/pull/399#discussion_r779358503
##########
File path:
opennlp-tools/src/main/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizer.java
##########
@@ -26,7 +26,7 @@
private static final Pattern URL_REGEX =
Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]+");
private static final Pattern MAIL_REGEX =
- Pattern.compile("[-_.0-9A-Za-z]+@[-_0-9A-Za-z]+[-_.0-9A-Za-z]+");
+
Pattern.compile("(?<![-+_.0-9A-Za-z])[-+_.0-9A-Za-z]+@[-0-9A-Za-z]+[-.0-9A-Za-z]+");
Review comment:
Yes, the addition of `+` is just an improvement. Also, `_` is not valid
in the domain part—this is just an error in the current regex.
The _negative_ lookbehind ensures a regex is _not_ a match if the lookbehind
matches right before the pattern.
Consider the string `foo aaa@b`. With the current regex, all of `aaa@b`,
`aa@b`, and `a@b` would match. Because it's used with a `replaceAll`, however,
only the first (longest) match will actually be used, i.e., `aaa@b`, and we'll
retain `foo `.
With the negative lookbehind, `aaa@b` will still match, as the previous
character (` `) does _not_ match the lookbehind (so the match is _not_ ignored
(double negation, sorry)). However, `aa@b` is no longer a match, as the
previous character in this case is `a`, which _does_ match the lookbehind, and
the match is ignored. Likewise, `a@b` no longer matches.
For strings that _are_ email addresses, we get the same behaviour: the
longest match is replaced with ` `.
Now consider the string `foo aaa`. Neither of the regexes match anything
here. The current regex, however, evaluates the whole of `foo` as a possible
local-part match until it reaches ` `, which is not allowed. It then tries with
`oo`, and `o`. Likewise, it tries `aaa` until it reaches the end of input, then
`aa`, and `a`.
The new regex, on the other hand, tries `foo` as a local-part because the
start of input does _not_ match the lookbehind, terminates on ` ` (as the
current regex), but then it immediately fails on `oo` because `f` _does_ match
the lookbehind, so `oo` and `o` immediately fail. Likewise, only `aaa` is
examined until the end; `aa` and `a` fail early.
This does not make much of a difference with these short strings, but you
can see the complexity is quadratic (current) vs linear (proposed) with the
length of strings like `aaa`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]