[ 
https://issues.apache.org/jira/browse/OPENNLP-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858648#comment-16858648
 ] 

ASF GitHub Bot commented on OPENNLP-1266:
-----------------------------------------

tballison commented on pull request #355: OPENNLP-1266 -- Limit regexes in 
UrlCharSequenceNormalizer
URL: https://github.com/apache/opennlp/pull/355#discussion_r291593496
 
 

 ##########
 File path: 
opennlp-tools/src/test/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizerTest.java
 ##########
 @@ -44,4 +44,15 @@ public void normalizeEmail() throws Exception {
         "asdf   2nnfdf  ", normalizer.normalize("asdf [email protected]" 
+
             " 2nnfdf [email protected]"));
   }
+
 
 Review comment:
   I don't like this test because it relies on timing.  I can remove it or 
substitute something better if you have recommendations.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Limit normalization regexes in UrlCharSequenceNormalizer
> --------------------------------------------------------
>
>                 Key: OPENNLP-1266
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1266
>             Project: OpenNLP
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> The {{MAIL_REGEX}} in UrlCharSequenceNormalizer is unbounded and requires 
> backtracking. In rare cases, this can cause eye-opening performance costs.
>  
> I tested the other regexes in the other normalizers.  I could be wrong, but 
> they don't appear to require backtracking, and there are no surprising 
> performance costs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to