On 21 June 2013 19:04, Thom Brown <t...@linux.com> wrote:

> Hi,
>
> The unaccent extension is great, especially with its customisability, but
> it's not always easy to recommend.  I witnessed a customer using no less
> than 56 nested replace functions in an SQL function.  I looked to see how
> much this can be mitigated by unaccent.  It turns out that not all the
> characters they were replacing can be replaced by unaccent, either because
> they replace more than 1 character at a time, or the character they're
> replacing, for some reason, isn't processed by unaccent, even with a custom
> rules file.
>
> So there were 20 characters I could identify that they were replacing.  I
> made a custom rules file and compared its performance to the
> difficult-to-manage set of nested replace calls.
>
> CREATE OR REPLACE FUNCTION public.myunaccent(sometext text)
>  RETURNS text
>  LANGUAGE sql
>  IMMUTABLE
> AS $function$
> SELECT
> replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(sometext,'ą','a'),'Ą','A'),'ă','a'),'Ă','A'),'ā','a'),'Ā','A'),'æ','a'),'å','a'),'ä','a'),'ã','a'),'â','a'),'á','a'),'à','a'),'Æ','A'),'Å','A'),'Ä','A'),'Ã','A'),'Â','A'),'Á','A'),'À','A')
> ;
> $function$
>
> postgres=# SELECT myunaccent(sometext::text) FROM (SELECT
> 'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,1000000)) x OFFSET
> 999999 LIMIT 1;
>       myunaccent
> ----------------------
>  AAAAAAAaaaaaaaAaAaAa
> (1 row)
>
> Time: 726.282 ms
> postgres=# SELECT unaccent(sometext::text) FROM (SELECT
> 'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,1000000)) x OFFSET
> 999999 LIMIT 1;
>        unaccent
> ----------------------
>  AAAAAAAaaaaaaaAaAaAa
> (1 row)
>
> Time: 3305.252 ms
>
> The timings are actually pretty much the same even if I introduce 187
> nested replace calls for every line in the unaccent.rules file for 187
> characters.  But the same character set with unaccent increases to 7418.526
> ms with the same type of query as above.  That's 10 times more expensive.
>
> Is there a way to boost the performance to make its adoption more
> palatable?
>

Another test passing in a string of 100000 characters gives the following
timings:

unaccent: 240619.395 ms
myunaccent: 785.505 ms

I guess this must indicate that unaccent is processing all rows, and
myunaccent is only being run on the 1 select row?  I can't account for
myunaccent always being almost the same duration regardless of string
length otherwise.  This is probably an incorrect assessment of performance.

Another test inserting long text strings into a text column of a table
100,000 times, then updating another column to have that unaccented value
using both methods:

unaccent: 3867.306 ms
myunaccent: 43611.732 ms

So I guess this complaint about performance is all just noise.

However, pushing that pointless complaint to one side, I would like to have
the ability to have unaccent support more characters that it doesn't
currently seem to support, such as bullet points, ellipses etc., and also
more than 1 character being replaced.  Naturally these aren't appropriate
to fall under the unaccent function itself, but the rules file is good
starting point.  It would be a bit like translate, except it would use a
rules file instead of providing strings of single characters to convert.

So say we wanted "(trademark)" to be converted into "™" just as an example,
or ";" to ".".  We can't do that with unaccent, but in order to avoid a
huge list of replace functions, a function like unaccent, with a few
adaptations, would solve the problem.

e.g.:

SELECT transform(my_custom_dictionary, 'Commodore Amiga(trademark);')

would return

Commodore Amiga™.

This would ideally somehow cater for replacing tabs and spaces too.

-- 
Thom

Reply via email to