corrupt language identifier tri files and bad language recognition for german
-----------------------------------------------------------------------------
Key: NUTCH-144
URL: http://issues.apache.org/jira/browse/NUTCH-144
Project: Nutch
Type: Improvement
Versions: 0.8-dev
Reporter: Bernhard Messer
Priority: Minor
Hi,
i had a look at the generated language guesser tri files. As far as i can say,
several of them (de.ngp, da.ngp, es.ngp) seems to be corrupt which leeds to bad
language recognition ratio. For example the german tri file should contain the
german special characters "ä", "ö", "ü" with their frequency. The text "grüne
Hüte" which is typical german, is recognized as danish. May be the problem
comes from wrong character encoding during training.
Jerome, could you provide the training files so that the language identifier
can be retrained ?
regards
Bernhard
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers