Re: [HACKERS] UPPER()/LOWER() and UTF-8

Alexey Mahotkin Sun, 09 Nov 2003 13:36:11 -0800

>>>>> "TL" == Tom Lane <[EMAIL PROTECTED]> writes:


    TL> writes: upper/lower aren't
    TL> going to work desirably in any multi-byte character set
    TL> encoding.

    >> Can you please point me at their implementation?  I do not
    >> understand why that's impossible.

    TL> Because they use <ctype.h>'s toupper() and tolower()
    TL> functions, which only work on single-byte characters.

Aha, that's in src/backend/utils/adt/formatting.c, right?

Yes, I see, it goes byte by byte and uses toupper().  I believe we
could look at the locale, and if it is UTF-8, then use (or copy)
e.g. g_utf8_strup/strdown, right?

     
http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-strup

I belive that patch could be written in a matter of hours.


    TL> There has been some discussion of using <wctype.h> where
    TL> available, but this has a number of issues, notably figuring
    TL> out the correct mapping from the server string encoding (eg
    TL> UTF-8) to unpacked wide characters.  At minimum we'd need to
    TL> know which charset the locale setting is expecting, and there
    TL> doesn't seem to be a portable way to find that out.

    TL> IIRC, Peter thinks we must abandon use of libc's locale
    TL> functionality altogether and write our own locale layer before
    TL> we can really have all the locale-specific functionality we
    TL> want.

I believe that native Unicode strings (together with human language
handling) should be introduced as (almost) separate data type (which
have nothing to do with locale), but that's bluesky maybe.

--alexm

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] UPPER()/LOWER() and UTF-8

Reply via email to