Re: AL32UTF8

2004-05-02 Thread Jarkko Hietaniemi

So the key question is... can we just do SvUTF8_on(sv) on either
of these kinds of Oracle UTF8 encodings? Seems like the answer is
yes, from what Jarkko says, because they are both valid UTF8.
We just need to document the issue.
 
 
 No, Oracle's UTF8 is very much not valid UTF-8. Valid UTF-8 cannot
 contain surrogates. If you mark a string like this as UTF-8 neither
 the Perl core nor other extension modules will be able to interpret
 it correctly.

Well, it depends what you mean by interpret correctly... they will
be perfectly fine _separate_ characters.  But yes, they are pretty
useless -- the UTF-8 machinery of Perl 5 gets rather upset of seeing
these surrogate code points.  No croaks, yes, as I said earlier, but
a lot of -w-noise, and also deeper gurglings from e.g. the regex engine.

 (As people have pointed out earlier in the thread,
 if you want a standard name for this weird form of encoding, that's
 CESU: http://www.unicode.org/reports/tr26/.)
 
 You'll need to do a conversion pass before you can mark it as UTF-8.

I think an Encode translation table would be the best place to do this
kind of mapping.  Encode::CESU, anyone?

-- 
Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special
biologist word we use for 'stable'.  It is 'dead'. -- Jack Cohen


Re: AL32UTF8

2004-05-02 Thread Lincoln A. Baxter
On Sat, 2004-05-01 at 00:37, Lincoln A. Baxter wrote:
 On Fri, 2004-04-30 at 08:03, Tim Bunce wrote:
  On Thu, Apr 29, 2004 at 10:42:18PM -0400, Lincoln A. Baxter wrote:
   On Thu, 2004-04-29 at 11:16, Tim Bunce wrote:
Am I right in thinking that perl's internal utf8 representation
represents surrogates as a single (4 byte) code point and not as
two separate code points?

This is the form that Oracle call AL32UTF8.

[snip]
  
  Were you using characters that require surrogates in UTF16?
  If not then you'd wouldn't see a difference between .UTF8 and .AL32UTF8.
 
 Hmmm...err.. probably not... I guess I need to hunt one up.

There is only one case in which 3 and 4 byte characters can be round
tripped.  After a bunch of other changes and fixups, I tested with the
following two new totally invented (by me) super wide characters:

row:   8: nice_string=\x{32263A}   byte_string=248|140|162|152|186 (3 byte wide 
char)
row:   9: nice_string=\x{2532263A} byte_string=252|165|140|162|152|186 (4 byte wide 
char)

In a database with ncharset=al16utf16, storage is as follows: (NLS_NCHAR= UTF8 or 
AL32UTF8)

row 8: nch=Typ=1 Len=10: 255,253,255,253,255,253,255,253,255,253 
row 9: nch=Typ=1 Len=12: 255,253,255,253,255,253,255,253,255,253,255,253 

Values can NOT be round tripped.

In a database with Ncharset=utf8 storage is as follows (NLS_NCHAR=AL32UTF8)

row 8: nch=Typ=1 Len=15: 
239,191,189,239,191,189,239,191,189,239,191,189,239,191,189  
row 9: nch=Typ=1 Len=18: 
239,191,189,239,191,189,239,191,189,239,191,189,239,191,189,239,191

Values can NOT be round tripped.

In a database with Ncharset=utf8 and NLS_NCHAR=AL32UTF8 storage is as follows:

row 8: nch=Typ=1 Len=5: 248,140,162,152,186
row 9: nch=Typ=1 Len=6: 252,165,140,162,152,186

Values CAN be round tripped!

So, it would appear that UTF8 is the PREFERRED Database NCHARSET, not AL16UTF16
And that NLS_NCHAR=UTF8 is more portable than NLS_NCHAR=AL32UTF8.

[snip]
 Seems reasonable.  I think you made a good point about the cost of
 crawling through the data. I'm convinced. If you have not already
 changed it, I will. 
 
  p.s. If we do opt for defaulting NLS_NCHAR (effectively) if NLS_LANG
  and NLS_NCHAR are not defined then we should use AL32UTF8 if possible.
 
 I changed that last night (to use AL32UTF8).

But given the above results... perhaps I should change it back.

Lincoln




Re: AL32UTF8

2004-05-02 Thread Tim Bunce
On Sat, May 01, 2004 at 05:35:58PM -0400, Lincoln A. Baxter wrote:
 Hello Owen, 
 
 On Sat, 2004-05-01 at 16:46, Owen Taylor wrote:
  On Fri, 2004-04-30 at 08:03, Tim Bunce wrote:
  
   You can use UTF8 and AL32UTF8 by setting NLS_LANG for OCI client
   applications. If you do not need supplementary characters, then it
   does not matter whether you choose UTF8 or AL32UTF8. However, if
   your OCI applications might handle supplementary characters, then
   you need to make a decision. Because UTF8 can require up to three
   bytes for each character, one supplementary character is represented
   in two code points, totalling six bytes. In AL32UTF8, one supplementary
   character is represented in one code point, totalling four bytes.
   
   So the key question is... can we just do SvUTF8_on(sv) on either
   of these kinds of Oracle UTF8 encodings? Seems like the answer is
   yes, from what Jarkko says, because they are both valid UTF8.
   We just need to document the issue.
  
  No, Oracle's UTF8 is very much not valid UTF-8. Valid UTF-8 cannot
  contain surrogates. If you mark a string like this as UTF-8 neither
  the Perl core nor other extension modules will be able to interpret
  it correctly.
  
  (As people have pointed out earlier in the thread,
  if you want a standard name for this weird form of encoding, that's
  CESU: http://www.unicode.org/reports/tr26/.)
  
  You'll need to do a conversion pass before you can mark it as UTF-8.
 
 Your message comes at a PERFECT time!
 
 I just spent about 3 hours coming to that same conclusion empiricly:
 
 I made the changes to do what tim had asked (just mark the string
 as UTF8), and it breaks a bunch of stuff, like the 8bit nchar test,
 and the long test when column type is LONG.
 
 I think I am going to back out (or rather... NOT COMMIT) those changes.
 leaving the code that inspects the fetched string to see if it (looks
 like) utf8 before setting the flag.

I think we should always mark Oracle UTF8 strings as Perl UTF8.

Basically Oracle UTF8 is broken for non-BMP characters. Period.
So no one should be using the Oracle UTF8 character set for them.

It just needs a note in the docs.

Tim.