Re: [OT] Unicode-compatible SQL?

2001-02-05 Thread J M Sykes

I have heard a rumour (i.e. my source is not involved in the reported
activity) that:

quote
SAP, PeopleSoft, Siebel, Oracle and others are actually
in the process of proposing a new format of UTF that will cause a UTF-16
surrogate pair to become two 3-byte UTF-8 codepoints so that UTF-8 will
have the same behaviour as UTF-16, that is, a surrogate will be two UTF-8
code points.
/quote

Can anyone corroborate this, and, if it's true, offer an opinion on it?

I may add that, as some of you already know, a small group in the UK (which
includes me) is working on a proposal intended to improve the SQL standard
specification with regard to the treatment of Unicode data by an
SQL-implementation.

The competent bodies are ISO/IEC SC 32/WG 3, ANSI NCITS H2, BSI IST/40 and
other national bodies.

We expect that most of the parties most interested, principally SQL
implementors, are already represented either directly or indirectly on one
or more competent bodies. But if anyone else is interested, please feel free
to download the current, incomplete, provisional draft of the proposal from:

ftp://jerry.ece.umassd.edu/pub/SC32/WG3/TEMPdocs

where the files containing two different versions are jms01v6 and jms01v7
each of which is in both w97.doc and .pdf format.

All comments will be seriously considered.

Mike Sykes

***

J M Sykes  Email: [EMAIL PROTECTED]
97 Oakdale Drive
Heald Green
CHEADLE
Cheshire   SK8 3SN
UKTel: (44) 161 437 5413

***





Re: [OT] Unicode-compatible SQL?

2001-02-05 Thread Michael \(michka\) Kaplan

Using UTF-8 to handle characters in the supplementary planes by way of using
two separate code points in the surrogate range is NOT considered
acceptable.

Currently it is legal to interpret them but *not* to generate them (multople
refs on the Unicode site). Therefore, I hope you are mistaken about the
rumor since this would be a Bad Thing (tm).

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/

- Original Message -
From: "J M Sykes" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Monday, February 05, 2001 3:50 AM
Subject: Re: [OT] Unicode-compatible SQL?


 I have heard a rumour (i.e. my source is not involved in the reported
 activity) that:

 quote
 SAP, PeopleSoft, Siebel, Oracle and others are actually
 in the process of proposing a new format of UTF that will cause a UTF-16
 surrogate pair to become two 3-byte UTF-8 codepoints so that UTF-8 will
 have the same behaviour as UTF-16, that is, a surrogate will be two UTF-8
 code points.
 /quote

 Can anyone corroborate this, and, if it's true, offer an opinion on it?

 I may add that, as some of you already know, a small group in the UK
(which
 includes me) is working on a proposal intended to improve the SQL standard
 specification with regard to the treatment of Unicode data by an
 SQL-implementation.

 The competent bodies are ISO/IEC SC 32/WG 3, ANSI NCITS H2, BSI IST/40 and
 other national bodies.

 We expect that most of the parties most interested, principally SQL
 implementors, are already represented either directly or indirectly on one
 or more competent bodies. But if anyone else is interested, please feel
free
 to download the current, incomplete, provisional draft of the proposal
from:

 ftp://jerry.ece.umassd.edu/pub/SC32/WG3/TEMPdocs

 where the files containing two different versions are jms01v6 and jms01v7
 each of which is in both w97.doc and .pdf format.

 All comments will be seriously considered.

 Mike Sykes

 ***

 J M Sykes  Email: [EMAIL PROTECTED]
 97 Oakdale Drive
 Heald Green
 CHEADLE
 Cheshire   SK8 3SN
 UKTel: (44) 161 437 5413

 ***







Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)

2001-02-05 Thread DougEwell2

In a message dated 2001-02-05 5:19:59 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

   I have heard a rumour (i.e. my source is not involved in the reported
   activity) that:
  
   quote
   SAP, PeopleSoft, Siebel, Oracle and others are actually
   in the process of proposing a new format of UTF that will cause a UTF-16
   surrogate pair to become two 3-byte UTF-8 codepoints so that UTF-8 will
   have the same behaviour as UTF-16, that is, a surrogate will be two UTF-8
   code points.
   /quote
  
   Can anyone corroborate this, and, if it's true, offer an opinion on it?

  Using UTF-8 to handle characters in the supplementary planes by way of 
  using two separate code points in the surrogate range is NOT considered
  acceptable.
  
  Currently it is legal to interpret them but *not* to generate them 
(multople
  refs on the Unicode site). Therefore, I hope you are mistaken about the
  rumor since this would be a Bad Thing (tm).

This is laziness, intended to get around the "problem" of supplementary code 
points instead of handling them like any other code points.  This reminds me 
of the Java bastardization of UTF-8, in which U+ is encoded 0xC0 0x80 so 
that no character string will ever contain the byte 0x00.  (Nobody has ever 
explained to me why a character string would contain U+ in the first 
place.)

I have argued in the past that in some cases, semi-conformant Unicode 
implementations might be better than non-Unicode solutions.  But creating a 
new UTF to get around your product's lack of real Unicode support *and then 
expecting others to use your hack* is a different matter entirely.  Just bite 
the bullet and support UTF-8.  It's not that hard.

-Doug Ewell
 Fullerton, California



Re: [OT] Unicode-compatible SQL?

2001-02-05 Thread Mark Davis

The topic came up in a UTC meeting some time ago, a "UTF-8S". The motivation
was for performance (having a form that reproduces the binary order of
UTF-16). We have yet to see a formal proposal for this, though.

Mark

- Original Message -
From: "J M Sykes" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Monday, February 05, 2001 03:50
Subject: Re: [OT] Unicode-compatible SQL?


 I have heard a rumour (i.e. my source is not involved in the reported
 activity) that:

 quote
 SAP, PeopleSoft, Siebel, Oracle and others are actually
 in the process of proposing a new format of UTF that will cause a UTF-16
 surrogate pair to become two 3-byte UTF-8 codepoints so that UTF-8 will
 have the same behaviour as UTF-16, that is, a surrogate will be two UTF-8
 code points.
 /quote

 Can anyone corroborate this, and, if it's true, offer an opinion on it?

 I may add that, as some of you already know, a small group in the UK
(which
 includes me) is working on a proposal intended to improve the SQL standard
 specification with regard to the treatment of Unicode data by an
 SQL-implementation.

 The competent bodies are ISO/IEC SC 32/WG 3, ANSI NCITS H2, BSI IST/40 and
 other national bodies.

 We expect that most of the parties most interested, principally SQL
 implementors, are already represented either directly or indirectly on one
 or more competent bodies. But if anyone else is interested, please feel
free
 to download the current, incomplete, provisional draft of the proposal
from:

 ftp://jerry.ece.umassd.edu/pub/SC32/WG3/TEMPdocs

 where the files containing two different versions are jms01v6 and jms01v7
 each of which is in both w97.doc and .pdf format.

 All comments will be seriously considered.

 Mike Sykes

 ***

 J M Sykes  Email: [EMAIL PROTECTED]
 97 Oakdale Drive
 Heald Green
 CHEADLE
 Cheshire   SK8 3SN
 UKTel: (44) 161 437 5413

 ***






Macintosh OS8.6, OS9

2001-02-05 Thread P. T. Rourke

A communication with someone offlist (though I think he is on the list)
suggested that Unicode is not supported at all in Macintosh OS8.6 or OS9,
not even to the degree that it is supported in Windows 9x, except by means
of Windows emulation (if I'm characterizing the message correctly; it is on
another computer at the moment).  Is this true? (Obviously I don't use Mac
OS much).  I question this statement because the last time I used a Mac with
OS8.6 (I don't have many opportunities, I'm afraid), Netscape 4 did include
the UTF-8 encoding among the permitted encodings; but as it wasn't my
computer, I didn't install a Unicode font to test it.

PTR





Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)

2001-02-05 Thread John O'Conner

Within a String, the encoding of char values is practically irrelevant. It is a
hidden encoding that is never exposed to the user...or developer. When you access
String char values, you use an index to 16-bit Unicode values. To my knowledge,
Sun does not claim that its internal encoding of String is UTF-8 in any of its API
documentation.

Any component or converter that claims to produce a UTF-8 encoding should not
behave as you describe. For example, Java's UTF-8 converter does not encode U+
as 0xC0 0x80. If it ever does, please file a bug.

Regards,
John O'Conner

[EMAIL PROTECTED] wrote:

 This is laziness, intended to get around the "problem" of supplementary code
 points instead of handling them like any other code points.  This reminds me
 of the Java bastardization of UTF-8, in which U+ is encoded 0xC0 0x80 so
 that no character string will ever contain the byte 0x00.  (Nobody has ever
 explained to me why a character string would contain U+ in the first
 place.)




Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)

2001-02-05 Thread John Cowan

John O'Conner wrote:

 Within a String, the encoding of char values is practically irrelevant. It is a
 hidden encoding that is never exposed to the user...or developer. When you access
 String char values, you use an index to 16-bit Unicode values. To my knowledge,
 Sun does not claim that its internal encoding of String is UTF-8 in any of its API
 documentation.

The internal encoding is exposed by the regrettably named readUTF and
writeUTF methods of java.io.Data{Input,Output}Stream, which should have
been named readString and writeString.  People have assumed that they
are general-purpose UTF-8 read/write functions.

At one point, this was a FAQ on this list.


-- 
There is / one art || John Cowan [EMAIL PROTECTED]
no more / no less  || http://www.reutershealth.com
to do / all things || http://www.ccil.org/~cowan
with art- / lessness   \\ -- Piet Hein




Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)

2001-02-05 Thread Tex Texin

John,
It does impact developers.

The API for DataInputStream defines FSS_UTF, which includes the funky
null behavior.

http://java.sun.com/products/jdk/1.2/docs/api/java/io/DataInputStream.html

Since this API and other use this UTF, it gets into file formats and
applications
end up supporting it

tex

John O'Conner wrote:
 
 Within a String, the encoding of char values is practically irrelevant. It is a
 hidden encoding that is never exposed to the user...or developer. When you access
 String char values, you use an index to 16-bit Unicode values. To my knowledge,
 Sun does not claim that its internal encoding of String is UTF-8 in any of its API
 documentation.
 
 Any component or converter that claims to produce a UTF-8 encoding should not
 behave as you describe. For example, Java's UTF-8 converter does not encode U+
 as 0xC0 0x80. If it ever does, please file a bug.
 
 Regards,
 John O'Conner
 
 [EMAIL PROTECTED] wrote:
 
  This is laziness, intended to get around the "problem" of supplementary code
  points instead of handling them like any other code points.  This reminds me
  of the Java bastardization of UTF-8, in which U+ is encoded 0xC0 0x80 so
  that no character string will ever contain the byte 0x00.  (Nobody has ever
  explained to me why a character string would contain U+ in the first
  place.)

-- 
According to Murphy, nothing goes according to Hoyle.
--
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271 Fax:+1-781-280-4655
Progress Software Corp.14 Oak Park, Bedford, MA 01730

http://www.Progress.com#1 Embedded Database

Globalization Program   
http://www.Progress.com/partners/globalization.htm
---



FW: Question on IBM's Unicode enabled product Support.

2001-02-05 Thread Magda Danish (Unicode)



-Original Message-
From: William Palmer [mailto:[EMAIL PROTECTED]]
Sent: Thursday, February 01, 2001 3:45 PM
To: '[EMAIL PROTECTED]'
Subject: Question on Unicode enabled product Support.


Hello...!

Question:

Has IBM OS/390 Ver 2.6, 2.7 or 2.9  MVS/ESA Operating support for
Unicode...?

IBM RACF Security Facility supported for Unicode...?

and IBM COBOL for MVS Unicode enabled...

Thank you,



Bill Palmer
Enterprise Server Verification
Access360
Phone (949) 450-6589
Fax (949) 585-0198
www.access360.com
 
access360
A Better Way to Manage Access Rights




Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)

2001-02-05 Thread John O'Conner

Perhaps the methods readUTF and writeUTF should be deprecated in favor of
read/writeString. I will submit an RFE (request for enhancement) for this.

I noticed that although the Data{Input,Output} interface clearly says that the
write/readUTF handles a "Java modified UTF-8". The actual javadoc in DataOutputStream
says that writeUTF writes the String as UTF-8. Also, the doc for UTFDataFormatException
is confusing on the issue, saying UTF-8 in one place and "modified UTF-8" in the doc 
for
DataInputStream.

Thats 1 RFE for better method names and 2 bugs in the API documentation! I'll submit 
all
3...if they don't already exist in the db.

Regards,
John O'Conner



John Cowan wrote:

 The internal encoding is exposed by the regrettably named readUTF and
 writeUTF methods of java.io.Data{Input,Output}Stream, which should have
 been named readString and writeString.  People have assumed that they
 are general-purpose UTF-8 read/write functions.





Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)

2001-02-05 Thread Tex Texin

John,

I am not clear from your comments which is the bug, since the doc
goes both ways. Are the doc bugs that they say 
it is UTF-8, or that they say it is modified UTF-8?

It would be great to learn that the functions are actually unmodified
UTF-8, as I know of some interfaces that are writing non-Java
code and are forced to deal with specialized handling of the modified
UTF-8.
It would be great to inform them they can use standard UTF-8 library
routines.

tex



John O'Conner wrote:
 
 Perhaps the methods readUTF and writeUTF should be deprecated in favor of
 read/writeString. I will submit an RFE (request for enhancement) for this.
 
 I noticed that although the Data{Input,Output} interface clearly says that the
 write/readUTF handles a "Java modified UTF-8". The actual javadoc in DataOutputStream
 says that writeUTF writes the String as UTF-8. Also, the doc for 
UTFDataFormatException
 is confusing on the issue, saying UTF-8 in one place and "modified UTF-8" in the doc 
for
 DataInputStream.
 
 Thats 1 RFE for better method names and 2 bugs in the API documentation! I'll submit 
all
 3...if they don't already exist in the db.
 
 Regards,
 John O'Conner
 
 John Cowan wrote:
 
  The internal encoding is exposed by the regrettably named readUTF and
  writeUTF methods of java.io.Data{Input,Output}Stream, which should have
  been named readString and writeString.  People have assumed that they
  are general-purpose UTF-8 read/write functions.
 

-- 
According to Murphy, nothing goes according to Hoyle.
--
Tex Texin  Director, International Business
mailto:[EMAIL PROTECTED]  +1-781-280-4271 Fax:+1-781-280-4655
Progress Software Corp.14 Oak Park, Bedford, MA 01730

http://www.Progress.com#1 Embedded Database

Globalization Program   
http://www.Progress.com/partners/globalization.htm
---



Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)

2001-02-05 Thread John Cowan

Tex Texin wrote:


 I am not clear from your comments which is the bug, since the doc
 goes both ways. Are the doc bugs that they say 
 it is UTF-8, or that they say it is modified UTF-8?

It uses modified UTF-8, modified in three ways:

1) U+ is encoded in two bytes as 0xc0 0x80;

2) values above U+ are encoded in six bytes as the UTF-8 encoding
of their UTF-16 equivalent form;

3) the whole string is prefixed with a byte count represented
as a 2-byte big-endian binary integer.


 It would be great to learn that the functions are actually unmodified
 UTF-8, as I know of some interfaces that are writing non-Java
 code and are forced to deal with specialized handling of the modified
 UTF-8.
 It would be great to inform them they can use standard UTF-8 library
 routines.

*chomp* No such luck Doc!

-- 
There is / one art || John Cowan [EMAIL PROTECTED]
no more / no less  || http://www.reutershealth.com
to do / all things || http://www.ccil.org/~cowan
with art- / lessness   \\ -- Piet Hein




Re: Macintosh OS8.6, OS9

2001-02-05 Thread Sebastian Hagedorn

-- "P. T. Rourke" [EMAIL PROTECTED] is rumored to have mumbled on Montag, 5. 
Februar 2001 8:47 Uhr -0800 regarding Macintosh OS8.6, OS9:

 A communication with someone offlist (though I think he is on the list)
 suggested that Unicode is not supported at all in Macintosh OS8.6 or OS9,

It's not as simple as that.

 not even to the degree that it is supported in Windows 9x, except by means
 of Windows emulation (if I'm characterizing the message correctly; it is
 on another computer at the moment).  Is this true? (Obviously I don't use
 Mac OS much).

To some extent. See below.

 I question this statement because the last time I used a
 Mac with OS8.6 (I don't have many opportunities, I'm afraid), Netscape 4
 did include the UTF-8 encoding among the permitted encodings; but as it
 wasn't my computer, I didn't install a Unicode font to test it.

A Unicode font wouldn't have helped you there. The "old" way to provide UTF-8 
compatibility requires Apple's Text Encoding Converter, which maps characters to 
*different* fonts. So you cannot use just one Unicode font, but instead you need to 
install all kinds of fonts, and even then you won't 
be able to view all Unicode characters, because some blocks don't have a corresponding 
representation in Apple's old script system. One example for this is the Extended 
Greek block. AFAIK you cannot get that with the TEC.

However, 8.5 and above actually contain ATSUI, the Apple Typography Services for 
Unicode Imaging. Those can read all kinds of font formats, including fonts like 
Microsoft's Arial Unicode. But that's only one half of 
the equation. This fine API doesn't do you any good without applications that support 
it. Unfortunately the only mainstream application that seems to do so (I don't know 
for sure) is Adobe InDesign 1.5

Mac OS 9.1 comes with a small word processor called WorldText that supports 
ATSUI... it's a first step. Here's hoping for the Carbon (or Cocoa?) release of 
Microsoft Office...

Cheers, Sebastian
--
Sebastian Hagedorn
Ehrenfeldgrtel 156, 50823 Kln, Germany
http://www.spinfo.uni-koeln.de/~hgd/

taz muss sein. [EMAIL PROTECTED] http://www.taz.de

 PGP signature


Re: [OT] Unicode-compatible SQL?

2001-02-05 Thread Marcin 'Qrczak' Kowalczyk

Mon, 5 Feb 2001 08:20:43 -0800 (GMT-0800), Mark Davis [EMAIL PROTECTED] pisze:

 The topic came up in a UTC meeting some time ago, a "UTF-8S". The
 motivation was for performance (having a form that reproduces the
 binary order of UTF-16).

This is unfair: it slows down the conversion UTF-8 - UTF-32.

In both cases the speed difference is almost none, and it's a big
portability problem. I hope that such trash will not be accepted.

-- 
 __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
 \__/
  ^^  SYGNATURA ZASTPCZA
QRCZAK




Re: Macintosh OS8.6, OS9

2001-02-05 Thread Patrick T. Rourke

Thanks to everyone who responded, especially Mr. Hagedorn; as it is
precisely the Extended Greek, Basic Greek, and Combining diacriticals blocks
that interest me, this was very important information.

Patrick Rourke

 Netscape, Internet Explorer and Icab (another browser for the MacOs)
 use UTF8 (Internet Explorer and Icab via the TEC)

 In macOs 9.1 with an apllication which supports the Multilingual Text
 Engine you can even use a windows true type font like cyberbit
 without converting it to Macintosh format.





Re: [OT] Unicode-compatible SQL?

2001-02-05 Thread Michael \(michka\) Kaplan

Unfortunately, the issue at this point is that some companies have either
already accepted it or in the process of accepting it now.

MichKa

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/

- Original Message -
From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Monday, February 05, 2001 1:49 PM
Subject: Re: [OT] Unicode-compatible SQL?


 Mon, 5 Feb 2001 08:20:43 -0800 (GMT-0800), Mark Davis
[EMAIL PROTECTED] pisze:

  The topic came up in a UTC meeting some time ago, a "UTF-8S". The
  motivation was for performance (having a form that reproduces the
  binary order of UTF-16).

 This is unfair: it slows down the conversion UTF-8 - UTF-32.

 In both cases the speed difference is almost none, and it's a big
 portability problem. I hope that such trash will not be accepted.

 --
  __("  Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/
  \__/
   ^^  SYGNATURA ZASTPCZA
 QRCZAK






Re: Macintosh OS8.6, OS9

2001-02-05 Thread Bertrand Laidain

The MacOs did support Unicode in 8.6 and 9.x, see

Apple Type Services for Unicode Imaging Reference
http://developer.apple.com/techpubs/macos8/TextIntlSvcs/ATSUI/ATSUI_ref/index.html

Handling Unicode Text Editing With Multilingual Text Engine
http://developer.apple.com/techpubs/macosx/Carbon/text/MultilingualTextEngine/Multilingual_Text_Engine/index.html

Programming With the Text Encoding Conversion Manager
http://developer.apple.com/techpubs/macos8/TextIntlSvcs/TextEncodingConversionManager/TEC1.5/index.html

Netscape, Internet Explorer and Icab (another browser for the MacOs) 
use UTF8 (Internet Explorer and Icab via the TEC)

In macOs 9.1 with an apllication which supports the Multilingual Text 
Engine you can even use a windows true type font like cyberbit 
without converting it to Macintosh format.

There is still some problems, but you can't say it's not supported at all.

Bertrand

A communication with someone offlist (though I think he is on the list)
suggested that Unicode is not supported at all in Macintosh OS8.6 or OS9,
not even to the degree that it is supported in Windows 9x, except by means
of Windows emulation (if I'm characterizing the message correctly; it is on
another computer at the moment).  Is this true? (Obviously I don't use Mac
OS much).  I question this statement because the last time I used a Mac with
OS8.6 (I don't have many opportunities, I'm afraid), Netscape 4 did include
the UTF-8 encoding among the permitted encodings; but as it wasn't my
computer, I didn't install a Unicode font to test it.

PTR
-- 
Bertrand Laidain
1, rue Stendhal
75020 Paris



Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)

2001-02-05 Thread John O'Conner

Here's what I see about the Java API docs:
1. The Data{Input, Output}Stream methods {read, write}UTF could be named better. More
appropriate names are {read, write}String. Strictly speaking, this is not a bug, but 
it could
be better. That's why I call it an RFE (request for enhancement).
2. DataOutputStream's writeUTF() method says it writes UTF-8, when clearly this is a
"modified" UTF-8. The implementation is fine...the documentation is incorrect since it 
doesn't
write UTF-8 but something slightly different.
3. DataInputStream's readUTF() method is clear that it reads a "modified" UTF-8, but 
the doc
also says it can throw an UnsupportedDataFormatException if the input stream isn't 
valid
UTF-8. The error is that it says UTF-8, not "modified" UTF-8 or FSS_UTF.

Later,
John O'Conner


Tex Texin wrote:

 John,

 I am not clear from your comments which is the bug, since the doc
 goes both ways. Are the doc bugs that they say
 it is UTF-8, or that they say it is modified UTF-8?

 It would be great to learn that the functions are actually unmodified
 UTF-8, as I know of some interfaces that are writing non-Java
 code and are forced to deal with specialized handling of the modified
 UTF-8.
 It would be great to inform them they can use standard UTF-8 library
 routines.

 tex