Language identifier, stemmers and analyzers

maurits van wijland Sun, 17 Nov 2002 15:36:31 -0800

Hi there,
this is a cross post. I first send this to the developers list, but some how
no response yet. Maybe here, there is someone that can help me!


I am hoping to improve Lucene and add a strategy for multi lingual
support. We already have stemmers for almost all european languages,
now, I think this is the next step.

Any thoughts, please??

Maurits


> Dear all,
>
> Brad Wellington has created a language identifier which can be used in
> combination with
> the snowball stemmers donated to Lucene by Alex Murzaku. I have currently
> build a solid language model for use with the language identifier for the
> languages: Danish, Dutch, English, Finnish, French, German, Italian,
> Norwegian, Portuguese, Spanish and Swedisch.
>
> The language identifier is based on a Naive Bayes classifier. Now, this is
> all nice, but I have some integration questions, and I hope you can help
> out.
>
> Basically, the process of indexing is:
> Create an analyzer
> Open a IndexWriter
> Pass it the analyzer
> Proces a document
> Add document to Index
> Optimize writer
> Close writer
>
> Now, the language identifier can help automatically identify what langauge
a
> document is written in. Based on the suggestion of the identifier, an
> apropriate analyzer can be selected.
>
> This is al great, but...
>
> 1. Do we index all the terms from various documents in various languages
> into 1 index?
> 2. Do I build a specialised Analyzer that selects the stemmer based on the
> Language Identifier or leave that up to the custom indexing application?
>
> Your thoughts please...
>
> regards,
>
> Maurits
>
>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Language identifier, stemmers and analyzers

Reply via email to