Hi Kevin,

I know for a fact that a bunch of you have been curious about language
categorization for a long time now and Java has lacked a solid way to
solve this problem.

Anyway.  This new library that I just released should be easy to tie
into your lucene indexers.  Just use the library on a text (strip the
HTML) and then create a new field in Lucene called LANG (or soemthing)
and then create a filter before you search with JUST that language
code.

I'd love some help with filling out missing languages if anyone has
some spare time.  That help make up for all the hard work I've done
here (nudge.. nudge)

I did a full research of the lang categorization space for Java and I
think this is basically the only library out there.

[snip]

Recently I'd posted the following to the Nutch mailing list, since the topic of determining web page languages had come up there as well:

Given the recent discussion regarding charset/language detection on this list, people might find this IBM reseearch paper interesting:

<ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf>ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf

    Linguini: Language Identification for Multilingual Documents
    John M. Prager

Prager also uses an n-gram approach, so you might be able to take advantage of some of his research into optimal values for <n>.

The code to Linguini doesn't seem to be available (you have to purchase some IBM product(s) to get it) so what you've done is great for the open source community - thanks!

Also I could post to the Unicode list re training data in multiple languages, as that's a good place to find out about multilingual corpora.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to