On 21 June 2013 19:04, Thom Brown <t...@linux.com> wrote: > Hi, > > The unaccent extension is great, especially with its customisability, but > it's not always easy to recommend. I witnessed a customer using no less > than 56 nested replace functions in an SQL function. I looked to see how > much this can be mitigated by unaccent. It turns out that not all the > characters they were replacing can be replaced by unaccent, either because > they replace more than 1 character at a time, or the character they're > replacing, for some reason, isn't processed by unaccent, even with a custom > rules file. > > So there were 20 characters I could identify that they were replacing. I > made a custom rules file and compared its performance to the > difficult-to-manage set of nested replace calls. > > CREATE OR REPLACE FUNCTION public.myunaccent(sometext text) > RETURNS text > LANGUAGE sql > IMMUTABLE > AS $function$ > SELECT > replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(sometext,'ą','a'),'Ą','A'),'ă','a'),'Ă','A'),'ā','a'),'Ā','A'),'æ','a'),'å','a'),'ä','a'),'ã','a'),'â','a'),'á','a'),'à','a'),'Æ','A'),'Å','A'),'Ä','A'),'Ã','A'),'Â','A'),'Á','A'),'À','A') > ; > $function$ > > postgres=# SELECT myunaccent(sometext::text) FROM (SELECT > 'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,1000000)) x OFFSET > 999999 LIMIT 1; > myunaccent > ---------------------- > AAAAAAAaaaaaaaAaAaAa > (1 row) > > Time: 726.282 ms > postgres=# SELECT unaccent(sometext::text) FROM (SELECT > 'ÀÁÂÃÄÅÆàáâãäåæĀāĂ㥹' sometext FROM generate_series(1,1000000)) x OFFSET > 999999 LIMIT 1; > unaccent > ---------------------- > AAAAAAAaaaaaaaAaAaAa > (1 row) > > Time: 3305.252 ms > > The timings are actually pretty much the same even if I introduce 187 > nested replace calls for every line in the unaccent.rules file for 187 > characters. But the same character set with unaccent increases to 7418.526 > ms with the same type of query as above. That's 10 times more expensive. > > Is there a way to boost the performance to make its adoption more > palatable? >
Another test passing in a string of 100000 characters gives the following timings: unaccent: 240619.395 ms myunaccent: 785.505 ms I guess this must indicate that unaccent is processing all rows, and myunaccent is only being run on the 1 select row? I can't account for myunaccent always being almost the same duration regardless of string length otherwise. This is probably an incorrect assessment of performance. Another test inserting long text strings into a text column of a table 100,000 times, then updating another column to have that unaccented value using both methods: unaccent: 3867.306 ms myunaccent: 43611.732 ms So I guess this complaint about performance is all just noise. However, pushing that pointless complaint to one side, I would like to have the ability to have unaccent support more characters that it doesn't currently seem to support, such as bullet points, ellipses etc., and also more than 1 character being replaced. Naturally these aren't appropriate to fall under the unaccent function itself, but the rules file is good starting point. It would be a bit like translate, except it would use a rules file instead of providing strings of single characters to convert. So say we wanted "(trademark)" to be converted into "™" just as an example, or ";" to ".". We can't do that with unaccent, but in order to avoid a huge list of replace functions, a function like unaccent, with a few adaptations, would solve the problem. e.g.: SELECT transform(my_custom_dictionary, 'Commodore Amiga(trademark);') would return Commodore Amiga™. This would ideally somehow cater for replacing tabs and spaces too. -- Thom