Briggs wrote:
> Yeah, you are correct there.  How does this thing actually even
> remotely begin to work on a  predictable level?

One crucial aspect of language identification is that the input properly
encoded. There was a patch that added icu4j character set encoding
detection into Nutch. I believe icu4j also offers language
identification in addition to character set detection. Has anyone
checked how usable the language identification from icu4j would be?

There is severe problems with current language identification for CJK
for example.

--
 Sami Siren

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to