[ 
https://issues.apache.org/jira/browse/OPENNLP-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861584#comment-16861584
 ] 

Tim Allison edited comment on OPENNLP-1261 at 6/11/19 10:58 PM:
----------------------------------------------------------------

Looks like improvement all around.  I ran [~joern]'s 1261 branch with a new 
language model he shared with me.

Performance doesn't tank as much with crazily large chunks of text; it still 
does a smidge...I wonder if this is caused by saturation...some features are 
boosted to and then flat line at 1.0 or the opposite with that many 
observations?

Overall, though, this appears to be faster, more accurate on very short texts 
and much more accurate on noisy text.

The speed is  slightly less than 2x, which is what I found if you swap 
StringList for String, which this patch does.

Key parts of reports: "Accuracy Across Languages -- Detector/Noise/Length"

+1


was (Author: [email protected]):
Looks like improvement all around.  Performance doesn't tank as much with 
crazily large chunks of text; it still does a bit...I wonder if this is caused 
by saturation...some features are boosted to and then flat line at 1.0 or the 
opposite with that many observations?

Overall, though, this appears to be faster, more accurate on very short texts 
and much more accurate on noisy text.

The speed is  slightly less than 2x, which is what I found if you swap 
StringList for String, which this patch does.

Key parts of reports: "Accuracy Across Languages -- Detector/Noise/Length"

+1

> Language Detector fails to predict language on long input texts
> ---------------------------------------------------------------
>
>                 Key: OPENNLP-1261
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1261
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Language Detector
>            Reporter: Joern Kottmann
>            Assignee: Joern Kottmann
>            Priority: Major
>         Attachments: langid_plus_minus_rollups.zip, opennlp_as_is_vs_1261.zip
>
>
> If the input text is very long, e.g. 100k chars, then the lang detect 
> component fails to detect the language correctly, even though the text is 
> only written in one language.
> This issue was tracked down to the context generator, where the count of the 
> ngrams are ignored.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to