On Feb 9, 2005, at 7:23 AM, Aad Nales wrote:
In my Clipper days I could build an index on English words using a technique that was called soundex. Searching in that index resulted in hits of words that sounded the same. From what i remember this technique only worked for English. Has it ever been generalized?

I do not know how Soundex/Metaphone/Double Metaphone work with non-English languages, but these algorithms are in Jakarta Commons Codec. I used the Metaphone algorithm as a custom analyzer example in Lucene in Action. You'll see it in the source code distribution under src/lia/analysis/codec. I did a couple of variations, one that adds the metaphoned version as a token in the same position and one that simply replaces it in the token stream.


I even envisioned this sounds-like feature being used for children. I was mulling over this idea while having lunch with my son one day last spring (he was 5 at the time). I asked him how to spell "cool cat" and he replied "c-o-l c-a-t". I tried it out with the metaphone algorithm and it matches!

        http://www.lucenebook.com/search?query=cool+cat

        Erik



What i am trying to solve is this. A customer is looking for a solution to spelling mistakes made by children (upto 10) when typing in queries. The site is Dutch. Common mistakes are 'sgool' when searching for 'school'. The 'normal' spellcheckers and suggestors typically generate a list where the 'sounds like' candidates' are too far away from the result. So what I am thinking about doing is this:


1. create a parser that takes a word and creates a soundindex entry.

2. create list of 'correctly' spelled words either based on the index of the website or on some kind of dictionary.
2a. perhaps create a n-gram index based on these words


3. accept a query, figure out that a spelling mistake has been made
3a find alternatives by parsing the query and searching the 'sound like index' and then calculate and order the results


Steps 2 and 3 have been discussed at length in this forum and have even made it to the sandbox. What I am left with is 1.

My thinking is processing a series of replacement statements that go like:
--
g sounds like ch if the immediate predecessor is an s.
o sounds like oo if the immediate predecessor is a consonant
--


But before I takes this to the next step I am wondering if anybody has created or thought up alternative solutions?

Cheers,
Aad








--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to