On 6/4/07, Sami Siren <[EMAIL PROTECTED]> wrote: > Briggs wrote: > > Yeah, you are correct there. How does this thing actually even > > remotely begin to work on a predictable level? > > One crucial aspect of language identification is that the input properly > encoded. There was a patch that added icu4j character set encoding > detection into Nutch. I believe icu4j also offers language > identification in addition to character set detection. Has anyone > checked how usable the language identification from icu4j would be? > > There is severe problems with current language identification for CJK > for example.
Can you give a few links? I have looked at icu4j's API, but I haven't found any info about language identification. IBM does have something called Linguini (http://www-306.ibm.com/software/globalization/topics/linguini/index.jsp) . It doesn't seem to be open source, though. > > -- > Sami Siren > -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers