[
https://issues.apache.org/jira/browse/OPENNLP-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865371#comment-16865371
]
ASF GitHub Bot commented on OPENNLP-1266:
-----------------------------------------
kottmann commented on pull request #355: OPENNLP-1266 -- Limit regexes in
UrlCharSequenceNormalizer
URL: https://github.com/apache/opennlp/pull/355#discussion_r294152668
##########
File path:
opennlp-tools/src/main/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizer.java
##########
@@ -24,9 +24,9 @@
public class UrlCharSequenceNormalizer implements CharSequenceNormalizer {
private static final Pattern URL_REGEX =
- Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]+");
+ Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]{1,10000}");
private static final Pattern MAIL_REGEX =
- Pattern.compile("[-_.0-9A-Za-z]+@[-_0-9A-Za-z]+[-_.0-9A-Za-z]+");
+
Pattern.compile("[-_.0-9A-Za-z]{1,100}@[-_0-9A-Za-z]{1,100}[-_.0-9A-Za-z]{1,100}");
Review comment:
Are these limits are limiting the length of the URL that can be detected?
And why is it important to add it?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Limit normalization regexes in UrlCharSequenceNormalizer
> --------------------------------------------------------
>
> Key: OPENNLP-1266
> URL: https://issues.apache.org/jira/browse/OPENNLP-1266
> Project: OpenNLP
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> The {{MAIL_REGEX}} in UrlCharSequenceNormalizer is unbounded and requires
> backtracking. In rare cases, this can cause eye-opening performance costs.
>
> I tested the other regexes in the other normalizers. I could be wrong, but
> they don't appear to require backtracking, and there are no surprising
> performance costs.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)