Hey lucene guys. I know for a fact that a bunch of you have been curious about language categorization for a long time now and Java has lacked a solid way to solve this problem.
Anyway. This new library that I just released should be easy to tie into your lucene indexers. Just use the library on a text (strip the HTML) and then create a new field in Lucene called LANG (or soemthing) and then create a filter before you search with JUST that language code. I'd love some help with filling out missing languages if anyone has some spare time. That help make up for all the hard work I've done here (nudge.. nudge) I did a full research of the lang categorization space for Java and I think this is basically the only library out there. Good luck ... I'm working on a blog post describing how blog search engines like Technorati, PubSub, and Feedster could/should use language categorization to help deal with the chaos of tagging and full-text search. Google has done this for a long time now and Technorati has it in beta. http://www.feedblog.org/2005/08/ngram_language_.html -- Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://www.feedblog.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]