Hi Kevin, On 8/19/05, Kevin Burton <[EMAIL PROTECTED]> wrote: > Hey lucene guys. > > I know for a fact that a bunch of you have been curious about language > categorization for a long time now and Java has lacked a solid way to > solve this problem. > > Anyway. This new library that I just released should be easy to tie > into your lucene indexers. Just use the library on a text (strip the > HTML) and then create a new field in Lucene called LANG (or soemthing) > and then create a filter before you search with JUST that language > code. > > I'd love some help with filling out missing languages if anyone has > some spare time. That help make up for all the hard work I've done > here (nudge.. nudge) > > I did a full research of the lang categorization space for Java and I > think this is basically the only library out there.
I know of the following existing Java implementations of language categorization: * A Nutch implementation: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/languageidentifier/ * A Lucene patch: http://issues.apache.org/bugzilla/show_bug.cgi?id=26763 * JTextCat (http://www.jedi.be/JTextCat/index.html), a Java wrapper for libtextcat * NGramJ (http://ngramj.sourceforge.net/), a general n-gram Java library Of these, the Nutch one is certainly under active development, the others don't seem to be as far as I can tell. > > Good luck > ... > > I'm working on a blog post describing how blog search engines like > Technorati, PubSub, and Feedster could/should use language > categorization to help deal with the chaos of tagging and full-text > search. Google has done this for a long time now and Technorati has it > in beta. > > http://www.feedblog.org/2005/08/ngram_language_.html > I like your idea of using Wikipedia translations as the training corpus - it's a good way to get fairly reliable sources for lots of languages. Regards, Tom --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]