[
https://issues.apache.org/jira/browse/OPENNLP-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861584#comment-16861584
]
Tim Allison edited comment on OPENNLP-1261 at 6/11/19 10:58 PM:
----------------------------------------------------------------
Looks like improvement all around. I ran [~joern]'s 1261 branch with a new
language model he shared with me.
Performance doesn't tank as much with crazily large chunks of text; it still
does a smidge...I wonder if this is caused by saturation...some features are
boosted to and then flat line at 1.0 or the opposite with that many
observations?
Overall, though, this appears to be faster, more accurate on very short texts
and much more accurate on noisy text.
The speed is slightly less than 2x, which is what I found if you swap
StringList for String, which this patch does.
Key parts of reports: "Accuracy Across Languages -- Detector/Noise/Length"
+1
was (Author: [email protected]):
Looks like improvement all around. Performance doesn't tank as much with
crazily large chunks of text; it still does a bit...I wonder if this is caused
by saturation...some features are boosted to and then flat line at 1.0 or the
opposite with that many observations?
Overall, though, this appears to be faster, more accurate on very short texts
and much more accurate on noisy text.
The speed is slightly less than 2x, which is what I found if you swap
StringList for String, which this patch does.
Key parts of reports: "Accuracy Across Languages -- Detector/Noise/Length"
+1
> Language Detector fails to predict language on long input texts
> ---------------------------------------------------------------
>
> Key: OPENNLP-1261
> URL: https://issues.apache.org/jira/browse/OPENNLP-1261
> Project: OpenNLP
> Issue Type: Improvement
> Components: Language Detector
> Reporter: Joern Kottmann
> Assignee: Joern Kottmann
> Priority: Major
> Attachments: langid_plus_minus_rollups.zip, opennlp_as_is_vs_1261.zip
>
>
> If the input text is very long, e.g. 100k chars, then the lang detect
> component fails to detect the language correctly, even though the text is
> only written in one language.
> This issue was tracked down to the context generator, where the count of the
> ngrams are ignored.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)