According to Neal Richter: > Here are two possible approaches: > > 1) Strip accents from all stored words & queries. This is a fairly common > practice in search engines & NLP systems. The obvious dissadvantage is > that a user can't restrict results to contain that specific accent... they > get back results with all of the different accents for a 'base letter'.
For some languages, the accents are more than just mere pronounciation cues, though, and can be quite significant. I think if you stripped accents unconditionally it could lead to a lot of false matches in some languages. I think this approach would probably work fine for languages like Spanish and Italian, but I think there might be some problems for some French words where a user might want to make a distinction between the two. In Scandinavian countries, where for example ö (or ø) is a completely different letter from o, I'd expect accent stripping would generate some pretty bad results. The advantage of treating accents via a fuzzy match method is you have search-time control over whether you will treat accented and unaccented letters as equivalent, and if so, how much weight the variants will have in the search results. > 2) Store BOTH the accented word & unaccented/stripped word in the > db.words.db. Silently augment each search query with the stripped version > of each word. > This steps around the dissadvantage of #1 and still get the > 'generalization' of stripped accents. I'm not sure exactly how this gets around the problem with the first approach. By putting the stripped words into the same database as the original ones, you lose some ability to make the distinction between the two at search time. Also, if we "silently augment" the search query as a fuzzy match method, then we still run into the need for chaining. If it's via another mechanism, how is that mechanism to be controlled? Also, it may well be that none of these changes will help Dominique, if the source of his problem is indeed that the words he wants to be automatically capitalized aren't even getting into his endings database to begin with. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This SF.Net email is sponsored by: GNOME Foundation Hackers Unite! GUADEC: The world's #1 Open Source Desktop Event. GNOME Users and Developers European Conference, 28-30th June in Norway http://2004/guadec.org _______________________________________________ ht://Dig Developer mailing list: [EMAIL PROTECTED] List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-dev