RE: Single Analyzer for multiple European languages

Madhu Satyanarayana Panitini Tue, 27 Sep 2005 05:06:25 -0700

Hi all,
One more idea would be using cryptograms to differentiate between
languages, and then u can use the delete stopwords and apply stemming
for particular language.

Regards
madhu 

-----Original Message-----
From: Endre Stølsvik [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 27, 2005 4:08 PM
To: java-user@lucene.apache.org
Subject: Re: Single Analyzer for multiple European languages

On Mon, 26 Sep 2005, Andrzej Bialecki wrote:

| Shashikant Kore wrote:
| 
| > Search:
| > - Get the superset of stopwords by merging the stopwords from all
the
| > languages.
| 
| This step doesn't make sense. Stopwords ARE language specific. A
stopword in
| one language may be a valid content word in another language - e.g.
English
| stopwords "is, by, far" mean "ice, village, father" in Swedish. And
vice
| versa, e.g. "den, men, man, sin, hans, era" are Swedish stopwords...
So, if
| you mix them and apply to all documents then you will surely loose a
lot of
| valid content.

Idea: Don't drop stopwords from the query - they aren't in the index for

the respective languages. However, you might get results from other 
languages..

It's the stemming that's difficult. You'd have to stem for each
language, 
and then do searches. But since you won't stem at all, this wouldn't be 
much of a problem in your case.

Maybe a language-selector drop-down would be nice?

Regards,
Endre.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Single Analyzer for multiple European languages

Reply via email to