[jira] [Commented] (OPENNLP-1266) Limit normalization regexes in UrlCharSequenceNormalizer

ASF GitHub Bot (Jira) Tue, 02 Jun 2020 07:50:08 -0700


    [ 
https://issues.apache.org/jira/browse/OPENNLP-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123889#comment-17123889
 ]


ASF GitHub Bot commented on OPENNLP-1266:
-----------------------------------------

jzonthemtn commented on a change in pull request #355:
URL: https://github.com/apache/opennlp/pull/355#discussion_r433935594



##########
File path: 
opennlp-tools/src/test/java/opennlp/tools/util/normalizer/UrlCharSequenceNormalizerTest.java
##########
@@ -44,4 +44,15 @@ public void normalizeEmail() throws Exception {
         "asdf   2nnfdf  ", normalizer.normalize("asdf [email protected]" 
+
             " 2nnfdf [email protected]"));
   }
+

Review comment:
       I think this is a good change but I do worry about limiting the length 
of the URL in the regex. What if we added an argument to the 
`UrlCharSequenceNormalizer` constructor to make this an option to the user? 
That way the user can choose between the trade off of speed vs. potentially 
missing URLs and there won't be any risk of changing the expected behavior of 
OpenNLP language detector applications out in the wild. Thoughts?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


> Limit normalization regexes in UrlCharSequenceNormalizer
> --------------------------------------------------------
>
>                 Key: OPENNLP-1266
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1266
>             Project: OpenNLP
>          Issue Type: Task
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.9.3
>
>
> The {{MAIL_REGEX}} in UrlCharSequenceNormalizer is unbounded and requires 
> backtracking. In rare cases, this can cause eye-opening performance costs.
>  
> I tested the other regexes in the other normalizers.  I could be wrong, but 
> they don't appear to require backtracking, and there are no surprising 
> performance costs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (OPENNLP-1266) Limit normalization regexes in UrlCharSequenceNormalizer

Reply via email to