Re: [SQL] Is there a similarity-function that minds national charsets?

Craig Ringer Wed, 20 Jun 2012 19:53:29 -0700

On 06/21/2012 12:30 AM, Andreas wrote:

Hi,
Is there a similarity-function that minds national charsets?
Over here we've got some special cases that screw up the results onsimilarity().
Our characters: ä, ö, ü, ß
could as well be written as:  ae, oe, ue, ss

e.g.

select similarity ( 'Müller', 'Mueller' )
results to:  0.363636
In normal cases everything below 0.5 would be to far apart to beconsidered a match.

That's not just charset aware, that's looking for awareness oflanguage-and-dialect specific transliteration rules for representingaccented chars in 7-bit ASCII. My understanding was that these rules andconventions vary and are specific to each language - or even region.

tsearch2 has big language dictionaries to try to handle some issues likethis (though I don't know about this issue specifically). It's possibleyou could extend the tsearch2 dictionaries with synonyms, possiblyalgorithmically generated.

If you have what you consider to be an acceptable 1:1 translation ruleyou could build a functional index on it and test against that, eg:


CREATE INDEX blah ON thetable ( (flatten_accent(target_column) );
SELECT similarity( flatten_accent('Müller'), target_column );

Note that the flatten_accent function must be IMMUTABLE and can't accessor refer to data in other tables, columns, etc nor SET (GUC) variablesthat might change at runtime.

--
Craig Ringer

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Re: [SQL] Is there a similarity-function that minds national charsets?

Reply via email to