Re: NGram Language Categorization Source

Ken Krugler Sat, 20 Aug 2005 08:41:33 -0700

Hi Kevin,

I know for a fact that a bunch of you have been curious about language
categorization for a long time now and Java has lacked a solid way to
solve this problem.


Anyway.  This new library that I just released should be easy to tie
into your lucene indexers.  Just use the library on a text (strip the
HTML) and then create a new field in Lucene called LANG (or soemthing)
and then create a filter before you search with JUST that language
code.

I'd love some help with filling out missing languages if anyone has
some spare time.  That help make up for all the hard work I've done
here (nudge.. nudge)

I did a full research of the lang categorization space for Java and I
think this is basically the only library out there.


[snip]

Recently I'd posted the following to the Nutch mailing list, sincethe topic of determining web page languages had come up there as well:

Given the recent discussion regarding charset/language detection onthis list, people might find this IBM reseearch paper interesting:
<ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf>ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf

    Linguini: Language Identification for Multilingual Documents
    John M. Prager

Prager also uses an n-gram approach, so you might be able to takeadvantage of some of his research into optimal values for <n>.

The code to Linguini doesn't seem to be available (you have topurchase some IBM product(s) to get it) so what you've done is greatfor the open source community - thanks!

Also I could post to the Unicode list re training data in multiplelanguages, as that's a good place to find out about multilingualcorpora.


-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: NGram Language Categorization Source

Reply via email to