Re: [HACKERS] Unicode string literals versus the world

Sam Mason Thu, 16 Apr 2009 03:51:54 -0700

On Wed, Apr 15, 2009 at 11:19:42PM +0300, Marko Kreen wrote:
> On 4/15/09, Tom Lane <t...@sss.pgh.pa.us> wrote:
> > Given Martijn's complaint about more-than-16-bit code points, I think
> >  the \u proposal is not mature enough to go into 8.4.  We can think
> >  about some version of that later, if there's enough interest.
> 
> I think it would be good idea. Basically we should pick one from
> couple of pre-existing sane schemes.  Here is quick summary
> of Python, Perl and Java:
> 
> Python [1]:
> 
>   \uXXXX         - 16-bit codepoint
>   \UXXXXXXXX     - 32-bit codepoint
>   \N{char-name}  - Characted by name


Microsoft have also gone this way in C#, named code points are not
supported however.

> Perl [2]:
> 
>   \x{XXXX..}     - {} contains hexadecimal codepoint
>   \N{char-name}  - Unicode char name

Looks OK, but the 'x' seems somewhat redundant.  Why not just:

  \{xxxx}

This would be following the BitC[2] project, especially if it was more
like:

  \{U+xxxx}

e.g.

  \{U+03BB}

would be the lowercase lambda character.  Added appeal is in the fact
that this (i.e. U+03BB) is how the Unicode consortium spells code
points.

> Java [3]:
> 
>   \uXXXX         - 16-bit codepoint

AFAIK, Java isn't the best reference to choose; it assumed from an early
point in its design that Unicode characters were at most 16bits and
hence had to switch its internal representation to UTF-16.  I don't
program much Java these days to know how it's all worked out, but it
would be interesting to hear from people who regularly have to deal with
characters outside the BMP (i.e. code points greater than 65535).

-- 
  Sam  http://samason.me.uk/

 [1] http://msdn.microsoft.com/en-us/library/aa664669(VS.71).aspx
 [2] http://www.bitc-lang.org/docs/bitc/spec.html#stringlit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unicode string literals versus the world

Reply via email to