Re: [HACKERS] Latest on CITEXT 2.0

David E. Wheeler Thu, 26 Jun 2008 10:10:29 -0700

On Jun 26, 2008, at 10:02, Tom Lane wrote:

BTW, I don't think you can use that same-length optimization for
citext.  There's no reason to think that upper/lowercase pairs will
have the same length all the time in multibyte encodings.

I was wondering about that. I had been thinking of canonically- equivalent stings and combining marks. Doing a quick test it looks like combining marks are not equivalent. For example, this returns false:


  SELECT 'Ä'::text = 'Ä'::text;

At least with en_US.UTF-8. Hrm. It looks like my client makes them both canonical, so I've attached a script demonstrating this issue.

Anyway, I was aware of different byte counts for canonical equivalence, but not for differences between upper- and lowercase characters. I'd certainly defer to your knowledge of how these things truly work in PostgreSQL, Tom, and can of course easily remove that optimization. So, are your certain about this?


Many thanks,

David

try.sql
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Latest on CITEXT 2.0

Reply via email to