Spelling suggestions

2007-12-16 Thread Lukas Lipka
Hi,

Yesterday, as a result of my afternoon hacking, I commited a first-pass
of spelling suggestions into trunk. This is still far from finished and
needs some work. I have received a lot of emails about this, so I will
try to reply and sum up what I have done so far.

Spelling suggestions was an idea first worked on by Fredrik[1] in the
past. His implementation had to build and use a separate index for
storing all the stemmed words. This wasn't very efficient.

My first implementation was very similar to his, but it used a
FuzzyTermEnum to generate suggestions on the fly. FuzzyTermEnum uses
Levenshtein distance to find similar words. This eliminated the need for
maintaining a separate spelling index for each queryable.

In this implementation suggestions were always generated for each query.
Everything worked fine but I wanted to make sure there is no overhead
and that queries won't take much longer than usual.

Profiling showed that generating suggestions for relatively small
indexes doesn't make much of a difference (2%-8% of the whole query
time). However, to my dismay larger indexes proved to be a problem (45%
of the whole query time on FSQ! Whoa!).

That is why I chose another implementation, similar to how snippets work
- spelling suggestions only on-demand. Now you can request suggestions
only when you need them, by sending a SuggestionsReques which is
followed by SuggestionsResponse reply. In my personal point of view this
is much more efficient and doesn't affect the query responsiveness.

There are still some things that need to be done, before we can all
celebrate the Did you mean to search for... stuff. :-)

* Lucene only stores stemmed forms of the words (beagle becomes beagl)

We have to figure out a way to unstem the word:
1.) Hack the analyzer to get the unstemmed word
2.) Traverse through our TextCache and find a word which
which contains the stem part.
This is what I'll be looking into today/tomorrow.

* Rank the suggestions

We need to only return the highest relevant suggestions, based on:
1.) Term frequency in index
2.) Levenshtein distance score

* Suggestions limiter

I've currently limited the suggestions to those that have
the same first character. This is based on the assumption that
most likely the first letter the user types will be correct.
I'm not sure if this is correct, but it does generate better
results.

* Multi-term queries

We also need to look into handling multi-term queries,
where suggestions should contain the whole phrase and not
just the individual words.

Sorry, for the exhausting email and lets make Beagle rock! :-)

Best,
Lukas


___
Dashboard-hackers mailing list
Dashboard-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/dashboard-hackers


Re: Spelling suggestions

2007-12-16 Thread Debajyoti Bera
 * Lucene only stores stemmed forms of the words (beagle becomes beagl)

 We have to figure out a way to unstem the word:
   1.) Hack the analyzer to get the unstemmed word
   2.) Traverse through our TextCache and find a word which
   which contains the stem part.
 This is what I'll be looking into today/tomorrow.

You might want to check the Highlighter.net package (in Lucene.Net/contrib 
from their website). They highlight matched words. They use StandardAnalyzer 
in their example but I wrapped a PorterStemmer around it and asked it to 
highlight words with same stem and it was able to do it.
One way I had in mind was to create a tokenstream, check if the tokentext is 
the same as the suggested stem, if yes use the token.startoffset, 
token.endoffset to extract the actual text. Of course its easier said than 
done ;-)

 We need to only return the highest relevant suggestions, based on:
   1.) Term frequency in index
   2.) Levenshtein distance score

Add to that there could be multiple indexes so results from multiple indexes 
need to be intelligently merged.

 Sorry, for the exhausting email and lets make Beagle rock! :-)

Ya !!!

- dBera

-- 
-
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user
___
Dashboard-hackers mailing list
Dashboard-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/dashboard-hackers