According to Neal Richter:
> Here are two possible approaches:
> 
> 1) Strip accents from all stored words & queries.  This is a fairly common
> practice in search engines & NLP systems.  The obvious dissadvantage is
> that a user can't restrict results to contain that specific accent... they
> get back results with all of the different accents for a 'base letter'.

For some languages, the accents are more than just mere pronounciation
cues, though, and can be quite significant.  I think if you stripped
accents unconditionally it could lead to a lot of false matches in some
languages.  I think this approach would probably work fine for languages
like Spanish and Italian, but I think there might be some problems for
some French words where a user might want to make a distinction between
the two.  In Scandinavian countries, where for example ö (or ø) is a
completely different letter from o, I'd expect accent stripping would
generate some pretty bad results.

The advantage of treating accents via a fuzzy match method is you have
search-time control over whether you will treat accented and unaccented
letters as equivalent, and if so, how much weight the variants will have
in the search results.

> 2) Store BOTH the accented word & unaccented/stripped word in the
> db.words.db.  Silently augment each search query with the stripped version
> of each word.
>   This steps around the dissadvantage of #1 and still get the
> 'generalization' of stripped accents.

I'm not sure exactly how this gets around the problem with the first
approach.  By putting the stripped words into the same database as the
original ones, you lose some ability to make the distinction between the
two at search time.  Also, if we "silently augment" the search query as
a fuzzy match method, then we still run into the need for chaining.
If it's via another mechanism, how is that mechanism to be controlled?

Also, it may well be that none of these changes will help Dominique,
if the source of his problem is indeed that the words he wants to be
automatically capitalized aren't even getting into his endings database
to begin with.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.Net email is sponsored by: GNOME Foundation
Hackers Unite!  GUADEC: The world's #1 Open Source Desktop Event.
GNOME Users and Developers European Conference, 28-30th June in Norway
http://2004/guadec.org
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to