Hi Karl, during the past days, I've developped such a language guesser myself as a basis for a Lucene analyzer. It is based on trigrams. It is already working but not yet in a "publishable" state. If you or others are interested I can offer the sources.
KR, Jean-Francois Halleux -----Original Message----- From: karl wettin [mailto:[EMAIL PROTECTED] Sent: dimanche 1 fevrier 2004 22:07 To: [EMAIL PROTECTED] Subject: N-gram layer Hello list, I'm Karl, and I just started testing Lucene the other day. It's a great core engine, but feel there are some things missing I'd be happy to contribute with. I stated with writing a simple N-gram classifier to detect language of a text in order to automatically cluster documents by langauge. The algorithm is very similair to the "TextCat" C-libray. And then I though, maybe it would be possible to use the same N-gram classifier to make an automatic stemmer that works on all languages. Hopefully I'll have something up and running for tests by next weekend. The same classifier could be used for a simple metaphone index. However, I need some help on understanding the Analyzer. Where can I find some tutorials on how to write my own? I didn't check with Google, maybe I should before posting here. Since the stemmer (and metaphone) data would have to be indexed in their own field(?) querying the stemmed would require one to stem the query too. Can I create a subclass of Query (or so), or do I need to create my own Query-class that handles the stemming all the way for the user? The last option is my current approach, so I would appreciate some hints and pointers here. Great project! karl --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]