corrupt language identifier tri files and bad language recognition for german
-----------------------------------------------------------------------------

         Key: NUTCH-144
         URL: http://issues.apache.org/jira/browse/NUTCH-144
     Project: Nutch
        Type: Improvement
    Versions: 0.8-dev    
    Reporter: Bernhard Messer
    Priority: Minor


Hi,

i had a look at the generated language guesser tri files. As far as i can say, 
several of them (de.ngp, da.ngp, es.ngp) seems to be corrupt which leeds to bad 
language recognition ratio. For example the german tri file should contain the 
german special characters "ä", "ö", "ü" with their frequency. The text "grüne 
Hüte" which is typical german, is recognized as danish. May be the problem 
comes from wrong character encoding during training.

Jerome, could you provide the training files so that the language identifier 
can be retrained ?

regards
 Bernhard


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to