Briggs wrote: > Yeah, you are correct there. How does this thing actually even > remotely begin to work on a predictable level?
One crucial aspect of language identification is that the input properly encoded. There was a patch that added icu4j character set encoding detection into Nutch. I believe icu4j also offers language identification in addition to character set detection. Has anyone checked how usable the language identification from icu4j would be? There is severe problems with current language identification for CJK for example. -- Sami Siren ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers