[ 
https://issues.apache.org/jira/browse/OPENNLP-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861784#comment-16861784
 ] 

Joern Kottmann commented on OPENNLP-1261:
-----------------------------------------

One of the issues with performance in the feature generation output is that 
there are strings, I once worked on a prototype where the strings where hashed, 
and the model then used these ints for the prediction, that turned out to be a 
bit faster (because it doesn't need to create all those String objects).
If performance is an issue, we should try to get that released.

GISModel.eval(String[], float[]), I need  to take a look at it, and if it can 
be used we should make a test with it to see how much it helps.

The context generator could try to limit the number of ngrams by removing very 
rare ones, e.g. via cutoff or something similar.

Thanks for your help!


> Language Detector fails to predict language on long input texts
> ---------------------------------------------------------------
>
>                 Key: OPENNLP-1261
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1261
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Language Detector
>            Reporter: Joern Kottmann
>            Assignee: Joern Kottmann
>            Priority: Major
>         Attachments: langid_plus_minus_rollups.zip, opennlp_as_is_vs_1261.zip
>
>
> If the input text is very long, e.g. 100k chars, then the lang detect 
> component fails to detect the language correctly, even though the text is 
> only written in one language.
> This issue was tracked down to the context generator, where the count of the 
> ngrams are ignored.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to