[
https://issues.apache.org/jira/browse/OPENNLP-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16853507#comment-16853507
]
Tim Allison edited comment on OPENNLP-1265 at 6/7/19 2:10 PM:
--------------------------------------------------------------
Side issue...looks like the url normalizer uses unbounded regexes. This was a
problem with a file that had a long, long string of dna -- atcgcgat on
TIKA-2777.
If you turn off all of the normalizers except the url normalizer and get rid of
the spaces in the input string, the time goes to:
-...it has been 20 minutes...I'll update this when/if it finishes this year.-
it has been 90 minutes...the fan is now running full blast...out of pity for my
laptop, I give up...
If you bound the regexes to 100, the time is acceptable, but still
discomforting:
{noformat}
private static final Pattern URL_REGEX =
Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]{1,100}");
private static final Pattern MAIL_REGEX =
Pattern.compile("[-_.0-9A-Za-z]{1,100}@[-_0-9A-Za-z]{1,100}[-_.0-9A-Za-z]{1,100}");
{noformat}
25167 : lat=50
25537 : lat=50
25116 : lat=50
Bounding the regexes doesn't help on the regular string, of course, but guard
rails are good:
5938 : por=50
6331 : por=50
5989 : por=50
-Happy to open a separate ticket. Let me know how I can help...-
Opened OPENNLP-1266
was (Author: [email protected]):
Side issue...looks like the url normalizer uses unbounded regexes. This was a
problem with a file that had a long, long string of dna -- atcgcgat on
TIKA-2777.
If you turn off all of the normalizers except the url normalizer and get rid of
the spaces in the input string, the time goes to:
-...it has been 20 minutes...I'll update this when/if it finishes this year.-
it has been 90 minutes...the fan is now running full blast...out of pity for my
laptop, I give up...
If you bound the regexes to 100, the time is acceptable, but still
discomforting:
{noformat}
private static final Pattern URL_REGEX =
Pattern.compile("https?://[-_.?&~;+=/#0-9A-Za-z]{1,100}");
private static final Pattern MAIL_REGEX =
Pattern.compile("[-_.0-9A-Za-z]{1,100}@[-_0-9A-Za-z]{1,100}[-_.0-9A-Za-z]{1,100}");
{noformat}
25167 : lat=50
25537 : lat=50
25116 : lat=50
Bounding the regexes doesn't help on the regular string, of course, but guard
rails are good:
5938 : por=50
6331 : por=50
5989 : por=50
Happy to open a separate ticket. Let me know how I can help...
> Improve speed of lang detect
> ----------------------------
>
> Key: OPENNLP-1265
> URL: https://issues.apache.org/jira/browse/OPENNLP-1265
> Project: OpenNLP
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> Over on TIKA-2790, we found that opennlp's language detector is far, far
> slower than Optimaize and yalder.
> Let's use this ticket to see what we can do to improve lang detect's speed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)