On 6/4/07, Sami Siren <[EMAIL PROTECTED]> wrote:
Briggs wrote:
> Yeah, you are correct there.  How does this thing actually even
> remotely begin to work on a  predictable level?

One crucial aspect of language identification is that the input properly
encoded. There was a patch that added icu4j character set encoding
detection into Nutch. I believe icu4j also offers language
identification in addition to character set detection. Has anyone
checked how usable the language identification from icu4j would be?

There is severe problems with current language identification for CJK
for example.


Can you give a few links? I have looked at icu4j's API, but I haven't
found any info about language identification.

IBM does have something called Linguini
(http://www-306.ibm.com/software/globalization/topics/linguini/index.jsp)
. It doesn't seem to be open source, though.


--
 Sami Siren



--
Doğacan Güney

Reply via email to