Greetings, everyone.
We have been using Lucene for English texts for some time and that works really well.
But I recently spoke with someone from Germany and they had raised an issue with that
language that I wasn't sure how Lucene would be able to tackle. The example they used
was with a word "black pen", which in German is apparently a single word. The
application domain is an e-commerce catalog. So if the catalog uses this composit word
but a person is looking for any "pen", they will likely find nothing. Similarly, if
the catalog specifies "pen" and separately its color, but the user uses the compound
word, they will again find nothing.
Stemming by itself couldn't solve this problem, it seems, because I don't think it is
designed for splitting compound words. Yet, this seems like a common issue that people
would run into constantly. So I was wandering:
- Do German stemmers typically split compound words as well as chooping them down to a
root form?
- Does this processing require dictionary-based approaches or are there enough clues
in the word structure to allow words to be split algorithmically (ala Porter stemmer)?
- How is this problem typically solved, in terms of smaller search engines and in
terms of Yahoos and Googles of the German landscape?
Thanks very much for any information to help with this!
- Dmitry
--
_______________________________________________
Lucene-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/lucene-users