Re: [htdig] a flaw in search algorithm (synonyms)

Gilles Detillieux Mon, 17 Sep 2001 09:35:57 -0700
According to Alexander I. Lebedev:
> I found a flaw in the logic of the using the synonyms algorithm in HTDig.
> The current algorithm searches only the words from the synonyms database,
> and cannot find the related word forms.
> 
> Simple example:  words "center" and "centre" are different forms for the
> same word in GB and US, so they are in the synonyms database.  The word forms
> for these words are created according the following flags: center/DGJMRSZ,
> centre/DGMS.  So, the forms created with /G flag should be: "centering" and
> "centring".  An attempt to find these words in my document database
> results in 13 documents for "centering" and 0 documents for "centring" while
> it gives the same number of words for "center" and "centre" (63 documents).
> 
> The flaw is in that the word forms are searched in all databases
> simultaneously (i.e. in endings and synonyms databases), so the synonym list
> is known after all word endings have been found.  The correct solution
> would be the following:
>   1. Look into word2root database to find the root(s) of the word(s)
>      (centring->centre);
>   2. Look into synonyms database to find possible synonyms
>      (centre = center);
>   3. Find all word forms for the root(s) and _all_synonyms_ using root2word
>      database (..., centring, centering, ...).
> 
> Can I ask to take into account these corrections in HTDig code?

Yes, this is a known limitation of the current fuzzy match algorithms.
Fuzzy matches are only applied directly to the original search words,
and not to the fuzzy match words of other algorithms.  The same problem
exists with results from the endings algorithm not being also processed
by the accents algorithm.

I think the solution in general would be to run the fuzzy algorithms
iteratively until no new search words are generated.  These iterations may
only be necessary for dictionary-based algorithms, if these are processed
before any word database-based algorithms.  I'm not certain of this last
point - it just occurred to me.  Certainly, though, some sort of iterative
process would be needed.  I've given this some thought before, but I
don't think it's quite as easy as it sounds to implement this reliably.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
Re: [htdig] a flaw in search algorithm (synonyms)

Reply via email to