RE: [htdig] a flaw in search algorithm (synonyms)

Alexander I. Lebedev Tue, 18 Sep 2001 08:51:39 -0700
Quim Sanmarti <[EMAIL PROTECTED]> wrote:

[snip]
>> Gilles Detillieux <[EMAIL PROTECTED]> wrote:
>>
>> Yes, this is a known limitation of the current fuzzy match algorithms.
>> Fuzzy matches are only applied directly to the original search words,
>> and not to the fuzzy match words of other algorithms.  The same problem
>> exists with results from the endings algorithm not being also
>> processed by the accents algorithm.
>>
>> I think the solution in general would be to run the fuzzy algorithms
>> iteratively until no new search words are generated.  These
>> iterations may only be necessary for dictionary-based algorithms, if these
>> are processed before any word database-based algorithms.  I'm not certain
>> of this last point - it just occurred to me.  Certainly, though, some sort
>> of iterative process would be needed.  I've given this some thought before,
>> but I don't think it's quite as easy as it sounds to implement this
>> reliably.
>>
>
>I also feel that the eventual evolution of fuzzy handling should be somewhat
>more flexible that Alexander's proposal. Not everybody will want to expand
>words like that.
>Take in account that extensive, uncontrolled word expansion may induce
>unexpected semantic drifts to the resulting query, that will probably
>receive nonsensical responses. The expanded queries risk becoming an ORed
>list of relatively unrelated words.

I don't see problems for the case of combination of synonyms and endings
algorthms (provided the dictionaries are correct).  But if you add spelling
or soundex, the number of unrelevant words may become very high indeed.
These unrelevant words result mainly from the nature of spelling/soundex
algorithms, not from synonyms, and the correct processing of synonyms will
result only in minor increase in the number of unrelevant words.

The real purpose of the changes I'm proposing is to teach HTDig to work
with more complex word structures needed for Russian.

In my work I'm using the dictionaries in extended ispell format, which
supports the structures like
        be -- am -- are -- is -- was -- were -- being -- been
within the only endings algorithm.  I changed a bit the HTDig code to
support this format; the changes are not publically available because
the only dictionary that uses this format is my dictionary for Russian
(over 100,000 words).  I started the similar work for English dictionary,
but it is at the very beginning.

The structure of Russian and English is very different.  In English many
nouns have the same form as verbs, and the most of changes to produce these
forms can be described well with ispell flags.  In Russian the verb and
participles may be described well only with up to 5 lines in ispell format
(the participles have different forms for past and present, and for passive
and active), and nouns, as a rule, have a different form too.  So, it was
my idea to use the synonyms algorithm not only to support normal synonyms,
but also to link these forms produced from a verb.  BTW, I'm not sure that
this approach is what I need, as the number of links may be very high
(20,000 or so), and the corresponding database may become extremely
large (normal root2word and word2root databases for Russian are about
70 Mbytes).  I suppose that using of linked lists may be a better solution,
but I'd like to start with the existing algorithm with minor changes.

- Alexander


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html
RE: [htdig] a flaw in search algorithm (synonyms)

Reply via email to