Hi, sorry for being late, I don't follow this list all the time. I once started writing a ngram version similar to TextCat for java. I soon were distracted and so it is still somewhat raw, but it is long published under http://ngramj.sf.net
As far as I remember it was able to parse TextCat resources, using variable length ngrams, generating ngram sets from samples. It is intended to look at files as a byte stream. Things which should be easy departing from there: * Make it plugable into a general stemming system. * Looking at character streams instead of byte streams (i.e. encoding stuff handled by Java) * cosine valued reporting and orthogonalization of ngram space variables. I could spend some work to do this, but I'd need help, cause it is not my only pasttime. Regards, Frank > -----Original Message----- > From: karl wettin [mailto:[EMAIL PROTECTED] > Sent: Sunday, February 01, 2004 10:07 PM > To: [EMAIL PROTECTED] > Subject: N-gram layer > > > > Hello list, > > I'm Karl, and I just started testing Lucene the other day. > It's a great core engine, but feel there are some things > missing I'd be happy to contribute with. > > I stated with writing a simple N-gram classifier to detect > language of a text in order to automatically cluster > documents by langauge. The > algorithm is very similair to the "TextCat" C-libray. > > And then I though, maybe it would be possible to use the same N-gram > classifier to make an automatic stemmer that works on all languages. > Hopefully I'll have something up and running for tests by > next weekend. > > The same classifier could be used for a simple metaphone index. > > However, I need some help on understanding the Analyzer. > Where can I find some tutorials on how to write my own? I > didn't check with Google, maybe I should before posting here. > Since the stemmer (and metaphone) data would have to be > indexed in their own field(?) querying the stemmed would > require one to stem the query too. Can I create a subclass of > Query (or so), or do I need to create my own Query-class that > handles the stemming all the way for the user? The last > option is my current approach, so I would appreciate some > hints and pointers here. > > > Great project! > > > karl > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]