Re: Looking for a stemmer that can return all inflected forms

Bill Taylor Sat, 14 Oct 2006 20:09:32 -0700

On Oct 14, 2006, at 3:57 PM, Jong Kim wrote:

Hi,
I'm looking for a stemmer that is capable of returning allmorphologicalvariants of a query term (to be used for high-recall search). Forexample,given a query term of 'cares', I would like to be able to generate'cares',
'care', 'cared', and 'caring'.

I looked at the Porter stemmer, Snowball stemmer, and the K-stem.
All of them provide a method that takes a surface string ('cares') asan
input and returns its base form/stem, which is 'care' in this example.

First of all, I would GREATLY appreciate it if you would tell me whichof these is easiest to incorporate into Lucene. I have the sameproblem you do. I have solved the other end of it but do not knot howto fit a stemmer into Lucene.

But it appears that I can not use the stemmer to generate all of the
inflected forms of a given query term.

Does anyone know of such tool for Lucene?

I am writing one which is VERY SPECIAL PURPOSE and therefore my codenot likely to be of much use to you. HOWEVER, the basic idea is quitesimple:


Idea 1:

1) Since you have to use the stemmer against something, you are readingwords out of the index and extracting their stems.

2) Having done that for a word, find all "nearby" words which have thesame stem. The simplest definition of "nearby" that I can think of isthat the word starts with the stem, but you might want to drop the lastcharacter of the stem and look for all words that start with that.Thus, if the stem is "care" you would look at all words that start with"car" and if they have "care" as the stem, they are in the same family.

The advantage of this approach is that you do not ever offer any wordsthat are not in your index. If you found cares and cared but notcaring in your index, you would not want to suggest that someone searchfor caring because they won't find it. So you use the index as thesource of words to stem.


Idea 2:

Another way to do it is to build a hash map of tree sets keyed to thestem. Each stem has a tree set of all words which have it as a stem.The code would look something like


HashMap<String, TreeSet> stemmedWords = new HashMap<String, TreeSet>();
TreeSet<String> wordsForStem;

for (String word : all words in the index)  {

stem = MagicStemmer(word); // I left out code for words that do nothave stems

          if ( (wordsForStem = stemmedWords.get(stem)) == null) {

wordsForStem = new TreeSet<String>(); // Tree set forthe new stemstemmedWords.put(stem, wordsForStem); // Now this stemhas a set for its words

wordsForStem.add(word); // Put the word into the tree set forits stem

For each stem from all the words in your index, you get a tree setwhich contains all the words which have it as a stem; The tree setkeeps its words in alphabetical order.

If you want the stems to be displayed in alphabetical order, use aTreeMap instead of a HashMap.

Any help or pointer would be greatly appreciated.

I would appreciate your telling me which stemmer for English words iseasiest to incorporate into Lucene and where to find it. Thanks.


Bill Taylor


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Looking for a stemmer that can return all inflected forms

Reply via email to