On 06/21/2012 12:30 AM, Andreas wrote:
Hi,

Is there a similarity-function that minds national charsets?

Over here we've got some special cases that screw up the results on similarity().

Our characters: ä, ö, ü, ß
could as well be written as:  ae, oe, ue, ss

e.g.

select similarity ( 'Müller', 'Mueller' )
results to:  0.363636

In normal cases everything below 0.5 would be to far apart to be considered a match.

That's not just charset aware, that's looking for awareness of language-and-dialect specific transliteration rules for representing accented chars in 7-bit ASCII. My understanding was that these rules and conventions vary and are specific to each language - or even region.

tsearch2 has big language dictionaries to try to handle some issues like this (though I don't know about this issue specifically). It's possible you could extend the tsearch2 dictionaries with synonyms, possibly algorithmically generated.

If you have what you consider to be an acceptable 1:1 translation rule you could build a functional index on it and test against that, eg:

CREATE INDEX blah ON thetable ( (flatten_accent(target_column) );
SELECT similarity( flatten_accent('Müller'), target_column );

Note that the flatten_accent function must be IMMUTABLE and can't access or refer to data in other tables, columns, etc nor SET (GUC) variables that might change at runtime.
--
Craig Ringer

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Reply via email to