Hi there, this is a cross post. I first send this to the developers list, but some how no response yet. Maybe here, there is someone that can help me!
I am hoping to improve Lucene and add a strategy for multi lingual support. We already have stemmers for almost all european languages, now, I think this is the next step. Any thoughts, please?? Maurits > Dear all, > > Brad Wellington has created a language identifier which can be used in > combination with > the snowball stemmers donated to Lucene by Alex Murzaku. I have currently > build a solid language model for use with the language identifier for the > languages: Danish, Dutch, English, Finnish, French, German, Italian, > Norwegian, Portuguese, Spanish and Swedisch. > > The language identifier is based on a Naive Bayes classifier. Now, this is > all nice, but I have some integration questions, and I hope you can help > out. > > Basically, the process of indexing is: > Create an analyzer > Open a IndexWriter > Pass it the analyzer > Proces a document > Add document to Index > Optimize writer > Close writer > > Now, the language identifier can help automatically identify what langauge a > document is written in. Based on the suggestion of the identifier, an > apropriate analyzer can be selected. > > This is al great, but... > > 1. Do we index all the terms from various documents in various languages > into 1 index? > 2. Do I build a specialised Analyzer that selects the stemmer based on the > Language Identifier or leave that up to the custom indexing application? > > Your thoughts please... > > regards, > > Maurits > > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
