Re: AL32UTF8

2004-05-02 Thread Jarkko Hietaniemi

So the key question is... can we just do SvUTF8_on(sv) on either
of these kinds of Oracle UTF8 encodings? Seems like the answer is
yes, from what Jarkko says, because they are both valid UTF8.
We just need to document the issue.
 
 
 No, Oracle's UTF8 is very much not valid UTF-8. Valid UTF-8 cannot
 contain surrogates. If you mark a string like this as UTF-8 neither
 the Perl core nor other extension modules will be able to interpret
 it correctly.

Well, it depends what you mean by interpret correctly... they will
be perfectly fine _separate_ characters.  But yes, they are pretty
useless -- the UTF-8 machinery of Perl 5 gets rather upset of seeing
these surrogate code points.  No croaks, yes, as I said earlier, but
a lot of -w-noise, and also deeper gurglings from e.g. the regex engine.

 (As people have pointed out earlier in the thread,
 if you want a standard name for this weird form of encoding, that's
 CESU: http://www.unicode.org/reports/tr26/.)
 
 You'll need to do a conversion pass before you can mark it as UTF-8.

I think an Encode translation table would be the best place to do this
kind of mapping.  Encode::CESU, anyone?

-- 
Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special
biologist word we use for 'stable'.  It is 'dead'. -- Jack Cohen


Re: AL32UTF8

2004-05-02 Thread Lincoln A. Baxter
On Sat, 2004-05-01 at 00:37, Lincoln A. Baxter wrote:
 On Fri, 2004-04-30 at 08:03, Tim Bunce wrote:
  On Thu, Apr 29, 2004 at 10:42:18PM -0400, Lincoln A. Baxter wrote:
   On Thu, 2004-04-29 at 11:16, Tim Bunce wrote:
Am I right in thinking that perl's internal utf8 representation
represents surrogates as a single (4 byte) code point and not as
two separate code points?

This is the form that Oracle call AL32UTF8.

[snip]
  
  Were you using characters that require surrogates in UTF16?
  If not then you'd wouldn't see a difference between .UTF8 and .AL32UTF8.
 
 Hmmm...err.. probably not... I guess I need to hunt one up.

There is only one case in which 3 and 4 byte characters can be round
tripped.  After a bunch of other changes and fixups, I tested with the
following two new totally invented (by me) super wide characters:

row:   8: nice_string=\x{32263A}   byte_string=248|140|162|152|186 (3 byte wide 
char)
row:   9: nice_string=\x{2532263A} byte_string=252|165|140|162|152|186 (4 byte wide 
char)

In a database with ncharset=al16utf16, storage is as follows: (NLS_NCHAR= UTF8 or 
AL32UTF8)

row 8: nch=Typ=1 Len=10: 255,253,255,253,255,253,255,253,255,253 
row 9: nch=Typ=1 Len=12: 255,253,255,253,255,253,255,253,255,253,255,253 

Values can NOT be round tripped.

In a database with Ncharset=utf8 storage is as follows (NLS_NCHAR=AL32UTF8)

row 8: nch=Typ=1 Len=15: 
239,191,189,239,191,189,239,191,189,239,191,189,239,191,189  
row 9: nch=Typ=1 Len=18: 
239,191,189,239,191,189,239,191,189,239,191,189,239,191,189,239,191

Values can NOT be round tripped.

In a database with Ncharset=utf8 and NLS_NCHAR=AL32UTF8 storage is as follows:

row 8: nch=Typ=1 Len=5: 248,140,162,152,186
row 9: nch=Typ=1 Len=6: 252,165,140,162,152,186

Values CAN be round tripped!

So, it would appear that UTF8 is the PREFERRED Database NCHARSET, not AL16UTF16
And that NLS_NCHAR=UTF8 is more portable than NLS_NCHAR=AL32UTF8.

[snip]
 Seems reasonable.  I think you made a good point about the cost of
 crawling through the data. I'm convinced. If you have not already
 changed it, I will. 
 
  p.s. If we do opt for defaulting NLS_NCHAR (effectively) if NLS_LANG
  and NLS_NCHAR are not defined then we should use AL32UTF8 if possible.
 
 I changed that last night (to use AL32UTF8).

But given the above results... perhaps I should change it back.

Lincoln




Re: AL32UTF8

2004-05-02 Thread Tim Bunce
On Sat, May 01, 2004 at 05:35:58PM -0400, Lincoln A. Baxter wrote:
 Hello Owen, 
 
 On Sat, 2004-05-01 at 16:46, Owen Taylor wrote:
  On Fri, 2004-04-30 at 08:03, Tim Bunce wrote:
  
   You can use UTF8 and AL32UTF8 by setting NLS_LANG for OCI client
   applications. If you do not need supplementary characters, then it
   does not matter whether you choose UTF8 or AL32UTF8. However, if
   your OCI applications might handle supplementary characters, then
   you need to make a decision. Because UTF8 can require up to three
   bytes for each character, one supplementary character is represented
   in two code points, totalling six bytes. In AL32UTF8, one supplementary
   character is represented in one code point, totalling four bytes.
   
   So the key question is... can we just do SvUTF8_on(sv) on either
   of these kinds of Oracle UTF8 encodings? Seems like the answer is
   yes, from what Jarkko says, because they are both valid UTF8.
   We just need to document the issue.
  
  No, Oracle's UTF8 is very much not valid UTF-8. Valid UTF-8 cannot
  contain surrogates. If you mark a string like this as UTF-8 neither
  the Perl core nor other extension modules will be able to interpret
  it correctly.
  
  (As people have pointed out earlier in the thread,
  if you want a standard name for this weird form of encoding, that's
  CESU: http://www.unicode.org/reports/tr26/.)
  
  You'll need to do a conversion pass before you can mark it as UTF-8.
 
 Your message comes at a PERFECT time!
 
 I just spent about 3 hours coming to that same conclusion empiricly:
 
 I made the changes to do what tim had asked (just mark the string
 as UTF8), and it breaks a bunch of stuff, like the 8bit nchar test,
 and the long test when column type is LONG.
 
 I think I am going to back out (or rather... NOT COMMIT) those changes.
 leaving the code that inspects the fetched string to see if it (looks
 like) utf8 before setting the flag.

I think we should always mark Oracle UTF8 strings as Perl UTF8.

Basically Oracle UTF8 is broken for non-BMP characters. Period.
So no one should be using the Oracle UTF8 character set for them.

It just needs a note in the docs.

Tim.


Re: AL32UTF8

2004-05-01 Thread Jungshik Shin
Tim Bunce wrote:
On Fri, Apr 30, 2004 at 10:58:19PM +0700, Martin Hosken wrote: 

IIRC AL32UTF8 was introduced at the behest of Oracle (a voting member of 
Unicode) because they were storing higher plane codes using the 
surrogate pair technique of UTF-16 mapped into UTF-8 (i.e. resulting in 
2 UTF-8 chars or 6 bytes) rather than the correct UTF-8 way of a single 
char of 4+ bytes. There is no real trouble doing it that way since 
anyone can convert between the 'wrong' UTF-8 and the correct form. But 
they found that if you do a simple binary based sort of a string in 
AL32UTF8 and compare it to a sort in true UTF-8 you end up with a subtly 
different order. On this basis they made request to the UTC to have 
AL32UTF8 added as a kludge and out of the kindness of their hearts the 
UTC agreed thus saving Oracle from a whole heap of work. But all are 
agreed that UTF-8 and not AL32UTF8 is the way forward.


Um, now you've confused me.

The Oracle docs say In AL32UTF8, one supplementary character is
represented in one code point, totalling four bytes. which you
say is correct UTF-8 way. So the old Oracle ``UTF8'' charset
is what's now called CESU-8 and what Oracle call ``AL32UTF8''
is the correct UTF-8 way.
 So did you mean CESU-8 when you said AL32UTF8?

I guess so.

Thank you for reminding me of this. I used to know that, but forgot it 
and was about to write my colleague to use 'UTF8' (instead of 
'AL32UTF8') when she creates a database with Oracle for our project.

Oracle is notorious for using 'incorrect' and confusing character 
encoding names. Their 'AL32UTF8' is the true and only UTF-8 while 
__their__ 'UTF8' is CESU-8 (a beast that MUST be confined within Oracle 
and MUST NOT be leaked out to the world at large. Needless to say, it'd 
be even better had it not been born.)

Oracle has no execuse whatsoever for failing to get their 'UTF8' right 
in the first place because Unicode had been extended beyond BMP a long 
time before they introduced UTF8 into their product(s) (let alone the 
fact that ISO 10646 had non-BMP planes from the very beginning in 1980's 
and that UTF-8 was devised to cover the full set of ISO 10646) However, 
they failed and in their 'UTF8', a single character beyond BMP was (and 
still is) encoded as a pair of 3-byte representations of surrogate code 
points. Apparently for the sake of backward compatibility (I wonder how 
many instances of Oracle databases existed with non-BMP characters 
stored in their 'UTF8' when they decided to follow this route), they 
decided to keep the designation 'UTF8' for CESU-8 and came up with a new 
designation 'AL32UTF8' for the true and only UTF-8.

Jungshik



Re: AL32UTF8

2004-04-30 Thread Tim Bunce
[The background to this is that Lincoln and I have been working on
Unicode support for DBD::Oracle. (Actually Lincoln's done most of
the heavy lifting, I've mostly been setting goals and directions
at the DBI API level and scratching at edge cases. Like this one.)]

On Thu, Apr 29, 2004 at 09:23:45PM +0300, Jarkko Hietaniemi wrote:
 Tim Bunce wrote:
 
  Am I right in thinking that perl's internal utf8 representation
  represents surrogates as a single (4 byte) code point and not as
  two separate code points?
 
 Mmmh.  Right and wrong... as a single code point, yes, since the real
 UTF-8 doesn't do surrogates which are only a UTF-16 thing.  4 bytes, no,
 3 bytes.
 
  This is the form that Oracle call AL32UTF8.
 
 Does this
 
 http://www.unicode.org/reports/tr26/
 
 look like like Oracle's older (?) UTF8?

CESU-8 defines an encoding scheme for Unicode identical to UTF-8
except for its representation of supplementary characters. In CESU-8,
supplementary characters are represented as six-byte sequences
resulting from the transformation of each UTF-16 surrogate code
unit into an eight-bit form similar to the UTF-8 transformation, but
without first converting the input surrogate pairs to a scalar value.

Yes, that sounds like it.  But see my quote from Oracle docs in my
reply to Lincoln's email to make sure.

(I presume it dates from before UTF16 had surrogate pairs. When
they were added to UTF16 they gave a name CESU-8 to what old UTF16
to UTF8 conversion code would produce when given surrogate pairs.
A classic standards maneuver :)

  What would be the effect of setting SvUTF8_on(sv) on a valid utf8
  byte string that used surrogates? Would there be problems?
 
 You would get out the surrogate code points from the sv, not the
 supplementary plane code point the surrogate pairs are encoding.
 Depends what you do with the data: this might be okay, might not.
 Since it's valid UTF-8, nothing should croak perl-side.

Okay. Thanks.

Basically I need to document that Oracle AL32UTF8 should be used
as the client charset in preference to the older UTF8 because
UTF8 doesn't do the best? thing with surrogate pairs.

Seems like best is the, er, best word to use here as right
would be too strong. But then the shortest form requirement
is quite strong so perhaps modern standard would be the right words.

Tim.


Re: AL32UTF8

2004-04-30 Thread Martin Hosken
Dear Tim,

CESU-8 defines an encoding scheme for Unicode identical to UTF-8
except for its representation of supplementary characters. In CESU-8,
supplementary characters are represented as six-byte sequences
resulting from the transformation of each UTF-16 surrogate code
unit into an eight-bit form similar to the UTF-8 transformation, but
without first converting the input surrogate pairs to a scalar value.
Yes, that sounds like it.  But see my quote from Oracle docs in my
reply to Lincoln's email to make sure.
(I presume it dates from before UTF16 had surrogate pairs. When
they were added to UTF16 they gave a name CESU-8 to what old UTF16
to UTF8 conversion code would produce when given surrogate pairs.
A classic standards maneuver :)
IIRC AL32UTF8 was introduced at the behest of Oracle (a voting member of 
Unicode) because they were storing higher plane codes using the 
surrogate pair technique of UTF-16 mapped into UTF-8 (i.e. resulting in 
2 UTF-8 chars or 6 bytes) rather than the correct UTF-8 way of a single 
char of 4+ bytes. There is no real trouble doing it that way since 
anyone can convert between the 'wrong' UTF-8 and the correct form. But 
they found that if you do a simple binary based sort of a string in 
AL32UTF8 and compare it to a sort in true UTF-8 you end up with a subtly 
different order. On this basis they made request to the UTC to have 
AL32UTF8 added as a kludge and out of the kindness of their hearts the 
UTC agreed thus saving Oracle from a whole heap of work. But all are 
agreed that UTF-8 and not AL32UTF8 is the way forward.

Yours,
Martin


Re: AL32UTF8

2004-04-30 Thread Tim Bunce
On Fri, Apr 30, 2004 at 03:49:13PM +0300, Jarkko Hietaniemi wrote:
  
  Okay. Thanks.
  
  Basically I need to document that Oracle AL32UTF8 should be used
  as the client charset in preference to the older UTF8 because
  UTF8 doesn't do the best? thing with surrogate pairs.
 
 because what Oracle calls UTF8 is not conformant with the modern
 definition of UTF8

Thanks Jarkko.

Tim.

  Seems like best is the, er, best word to use here as right
  would be too strong. But then the shortest form requirement
  is quite strong so perhaps modern standard would be the right words.
  
  Tim.
 
 
 -- 
 Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special
 biologist word we use for 'stable'.  It is 'dead'. -- Jack Cohen


Re: AL32UTF8

2004-04-29 Thread Jarkko Hietaniemi
Tim Bunce wrote:

 Am I right in thinking that perl's internal utf8 representation
 represents surrogates as a single (4 byte) code point and not as
 two separate code points?

Mmmh.  Right and wrong... as a single code point, yes, since the real
UTF-8 doesn't do surrogates which are only a UTF-16 thing.  4 bytes, no,
3 bytes.

 This is the form that Oracle call AL32UTF8.

Does this

http://www.unicode.org/reports/tr26/

look like like Oracle's older (?) UTF8?

 What would be the effect of setting SvUTF8_on(sv) on a valid utf8
 byte string that used surrogates? Would there be problems?

You would get out the surrogate code points from the sv, not the
supplementary plane code point the surrogate pairs are encoding.
Depends what you do with the data: this might be okay, might not.
Since it's valid UTF-8, nothing should croak perl-side.

 (For example, a string returned from Oracle when using the UTF8
 character set instead of the newer AL32UTF8 one.)
 
 Tim.