Re: NGram Language Categorization Source

Tom White Sat, 20 Aug 2005 13:29:21 -0700

Hi Kevin,

On 8/19/05, Kevin Burton <[EMAIL PROTECTED]> wrote:
> Hey lucene guys.
> 
> I know for a fact that a bunch of you have been curious about language
> categorization for a long time now and Java has lacked a solid way to
> solve this problem.
> 
> Anyway.  This new library that I just released should be easy to tie
> into your lucene indexers.  Just use the library on a text (strip the
> HTML) and then create a new field in Lucene called LANG (or soemthing)
> and then create a filter before you search with JUST that language
> code.
> 
> I'd love some help with filling out missing languages if anyone has
> some spare time.  That help make up for all the hard work I've done
> here (nudge.. nudge)
> 
> I did a full research of the lang categorization space for Java and I
> think this is basically the only library out there.


I know of the following existing Java implementations of language
categorization:

* A Nutch implementation:
http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/languageidentifier/

* A Lucene patch: http://issues.apache.org/bugzilla/show_bug.cgi?id=26763

* JTextCat (http://www.jedi.be/JTextCat/index.html),  a Java wrapper
for libtextcat

* NGramJ (http://ngramj.sourceforge.net/), a general n-gram Java library

Of these, the Nutch one is certainly under active development, the
others don't seem to be as far as I can tell.

> 
> Good luck
> ...
> 
> I'm working on a blog post describing how blog search engines like
> Technorati, PubSub, and Feedster could/should use language
> categorization to help deal with the chaos of tagging and full-text
> search. Google has done this for a long time now and Technorati has it
> in beta.
> 
> http://www.feedblog.org/2005/08/ngram_language_.html
> 

I like your idea of using Wikipedia translations as the training
corpus - it's a good way to get fairly reliable sources for lots of
languages.

Regards,

Tom

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: NGram Language Categorization Source

Reply via email to