[Lucene-users] (no subject)

Dmitry Serebrennikov Thu, 20 Sep 2001 13:07:13 -0700
Greetings, everyone.

We have been using Lucene for English texts for some time and that works really well. 
But I recently spoke with someone from Germany and they had raised an issue with that 
language that I wasn't sure how Lucene would be able to tackle. The example they used 
was with a word "black pen", which in German is apparently a single word. The 
application domain is an e-commerce catalog. So if the catalog uses this composit word 
but a person is looking for any "pen", they will likely find nothing. Similarly, if 
the catalog specifies "pen" and separately its color, but the user uses the compound 
word, they will again find nothing.

Stemming by itself couldn't solve this problem, it seems, because I don't think it is 
designed for splitting compound words. Yet, this seems like a common issue that people 
would run into constantly. So I was wandering:
- Do German stemmers typically split compound words as well as chooping them down to a 
root form?
- Does this processing require dictionary-based approaches or are there enough clues 
in the word structure to allow words to be split algorithmically (ala Porter stemmer)?
- How is this problem typically solved, in terms of smaller search engines and in 
terms of Yahoos and Googles of the German landscape?

Thanks very much for any information to help with this!
- Dmitry

-- 



_______________________________________________
Lucene-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/lucene-users
[Lucene-users] (no subject)

Reply via email to