[ 
https://issues.apache.org/jira/browse/OPENNLP-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461048#comment-17461048
 ] 

ASF GitHub Bot commented on OPENNLP-1266:
-----------------------------------------

jonmv commented on pull request #355:
URL: https://github.com/apache/opennlp/pull/355#issuecomment-996195973


   Please consider https://github.com/apache/opennlp/pull/399 instead. The URL 
regex shouldn't cause super-linear complexity like the MAIL regex does, I 
believe. The problem is that the regex is used in `String.replaceAll(...)`, and 
is evaluated for each suffix of bad input—this does not happen for the URL 
regex, which can only be a few characters long before it either fails or 
succeeds definitively. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Limit normalization regexes in UrlCharSequenceNormalizer
> --------------------------------------------------------
>
>                 Key: OPENNLP-1266
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1266
>             Project: OpenNLP
>          Issue Type: Task
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.9.4
>
>
> The {{MAIL_REGEX}} in UrlCharSequenceNormalizer is unbounded and requires 
> backtracking. In rare cases, this can cause eye-opening performance costs.
>  
> I tested the other regexes in the other normalizers.  I could be wrong, but 
> they don't appear to require backtracking, and there are no surprising 
> performance costs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to