Re: [OT] Unicode-compatible SQL?
I have heard a rumour (i.e. my source is not involved in the reported activity) that: quote SAP, PeopleSoft, Siebel, Oracle and others are actually in the process of proposing a new format of UTF that will cause a UTF-16 surrogate pair to become two 3-byte UTF-8 codepoints so that UTF-8 will have the same behaviour as UTF-16, that is, a surrogate will be two UTF-8 code points. /quote Can anyone corroborate this, and, if it's true, offer an opinion on it? I may add that, as some of you already know, a small group in the UK (which includes me) is working on a proposal intended to improve the SQL standard specification with regard to the treatment of Unicode data by an SQL-implementation. The competent bodies are ISO/IEC SC 32/WG 3, ANSI NCITS H2, BSI IST/40 and other national bodies. We expect that most of the parties most interested, principally SQL implementors, are already represented either directly or indirectly on one or more competent bodies. But if anyone else is interested, please feel free to download the current, incomplete, provisional draft of the proposal from: ftp://jerry.ece.umassd.edu/pub/SC32/WG3/TEMPdocs where the files containing two different versions are jms01v6 and jms01v7 each of which is in both w97.doc and .pdf format. All comments will be seriously considered. Mike Sykes *** J M Sykes Email: [EMAIL PROTECTED] 97 Oakdale Drive Heald Green CHEADLE Cheshire SK8 3SN UKTel: (44) 161 437 5413 ***
Re: [OT] Unicode-compatible SQL?
Using UTF-8 to handle characters in the supplementary planes by way of using two separate code points in the surrogate range is NOT considered acceptable. Currently it is legal to interpret them but *not* to generate them (multople refs on the Unicode site). Therefore, I hope you are mistaken about the rumor since this would be a Bad Thing (tm). MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/ - Original Message - From: "J M Sykes" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 05, 2001 3:50 AM Subject: Re: [OT] Unicode-compatible SQL? I have heard a rumour (i.e. my source is not involved in the reported activity) that: quote SAP, PeopleSoft, Siebel, Oracle and others are actually in the process of proposing a new format of UTF that will cause a UTF-16 surrogate pair to become two 3-byte UTF-8 codepoints so that UTF-8 will have the same behaviour as UTF-16, that is, a surrogate will be two UTF-8 code points. /quote Can anyone corroborate this, and, if it's true, offer an opinion on it? I may add that, as some of you already know, a small group in the UK (which includes me) is working on a proposal intended to improve the SQL standard specification with regard to the treatment of Unicode data by an SQL-implementation. The competent bodies are ISO/IEC SC 32/WG 3, ANSI NCITS H2, BSI IST/40 and other national bodies. We expect that most of the parties most interested, principally SQL implementors, are already represented either directly or indirectly on one or more competent bodies. But if anyone else is interested, please feel free to download the current, incomplete, provisional draft of the proposal from: ftp://jerry.ece.umassd.edu/pub/SC32/WG3/TEMPdocs where the files containing two different versions are jms01v6 and jms01v7 each of which is in both w97.doc and .pdf format. All comments will be seriously considered. Mike Sykes *** J M Sykes Email: [EMAIL PROTECTED] 97 Oakdale Drive Heald Green CHEADLE Cheshire SK8 3SN UKTel: (44) 161 437 5413 ***
Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)
In a message dated 2001-02-05 5:19:59 Pacific Standard Time, [EMAIL PROTECTED] writes: I have heard a rumour (i.e. my source is not involved in the reported activity) that: quote SAP, PeopleSoft, Siebel, Oracle and others are actually in the process of proposing a new format of UTF that will cause a UTF-16 surrogate pair to become two 3-byte UTF-8 codepoints so that UTF-8 will have the same behaviour as UTF-16, that is, a surrogate will be two UTF-8 code points. /quote Can anyone corroborate this, and, if it's true, offer an opinion on it? Using UTF-8 to handle characters in the supplementary planes by way of using two separate code points in the surrogate range is NOT considered acceptable. Currently it is legal to interpret them but *not* to generate them (multople refs on the Unicode site). Therefore, I hope you are mistaken about the rumor since this would be a Bad Thing (tm). This is laziness, intended to get around the "problem" of supplementary code points instead of handling them like any other code points. This reminds me of the Java bastardization of UTF-8, in which U+ is encoded 0xC0 0x80 so that no character string will ever contain the byte 0x00. (Nobody has ever explained to me why a character string would contain U+ in the first place.) I have argued in the past that in some cases, semi-conformant Unicode implementations might be better than non-Unicode solutions. But creating a new UTF to get around your product's lack of real Unicode support *and then expecting others to use your hack* is a different matter entirely. Just bite the bullet and support UTF-8. It's not that hard. -Doug Ewell Fullerton, California
Re: [OT] Unicode-compatible SQL?
The topic came up in a UTC meeting some time ago, a "UTF-8S". The motivation was for performance (having a form that reproduces the binary order of UTF-16). We have yet to see a formal proposal for this, though. Mark - Original Message - From: "J M Sykes" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 05, 2001 03:50 Subject: Re: [OT] Unicode-compatible SQL? I have heard a rumour (i.e. my source is not involved in the reported activity) that: quote SAP, PeopleSoft, Siebel, Oracle and others are actually in the process of proposing a new format of UTF that will cause a UTF-16 surrogate pair to become two 3-byte UTF-8 codepoints so that UTF-8 will have the same behaviour as UTF-16, that is, a surrogate will be two UTF-8 code points. /quote Can anyone corroborate this, and, if it's true, offer an opinion on it? I may add that, as some of you already know, a small group in the UK (which includes me) is working on a proposal intended to improve the SQL standard specification with regard to the treatment of Unicode data by an SQL-implementation. The competent bodies are ISO/IEC SC 32/WG 3, ANSI NCITS H2, BSI IST/40 and other national bodies. We expect that most of the parties most interested, principally SQL implementors, are already represented either directly or indirectly on one or more competent bodies. But if anyone else is interested, please feel free to download the current, incomplete, provisional draft of the proposal from: ftp://jerry.ece.umassd.edu/pub/SC32/WG3/TEMPdocs where the files containing two different versions are jms01v6 and jms01v7 each of which is in both w97.doc and .pdf format. All comments will be seriously considered. Mike Sykes *** J M Sykes Email: [EMAIL PROTECTED] 97 Oakdale Drive Heald Green CHEADLE Cheshire SK8 3SN UKTel: (44) 161 437 5413 ***
Macintosh OS8.6, OS9
A communication with someone offlist (though I think he is on the list) suggested that Unicode is not supported at all in Macintosh OS8.6 or OS9, not even to the degree that it is supported in Windows 9x, except by means of Windows emulation (if I'm characterizing the message correctly; it is on another computer at the moment). Is this true? (Obviously I don't use Mac OS much). I question this statement because the last time I used a Mac with OS8.6 (I don't have many opportunities, I'm afraid), Netscape 4 did include the UTF-8 encoding among the permitted encodings; but as it wasn't my computer, I didn't install a Unicode font to test it. PTR
Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)
Within a String, the encoding of char values is practically irrelevant. It is a hidden encoding that is never exposed to the user...or developer. When you access String char values, you use an index to 16-bit Unicode values. To my knowledge, Sun does not claim that its internal encoding of String is UTF-8 in any of its API documentation. Any component or converter that claims to produce a UTF-8 encoding should not behave as you describe. For example, Java's UTF-8 converter does not encode U+ as 0xC0 0x80. If it ever does, please file a bug. Regards, John O'Conner [EMAIL PROTECTED] wrote: This is laziness, intended to get around the "problem" of supplementary code points instead of handling them like any other code points. This reminds me of the Java bastardization of UTF-8, in which U+ is encoded 0xC0 0x80 so that no character string will ever contain the byte 0x00. (Nobody has ever explained to me why a character string would contain U+ in the first place.)
Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)
John O'Conner wrote: Within a String, the encoding of char values is practically irrelevant. It is a hidden encoding that is never exposed to the user...or developer. When you access String char values, you use an index to 16-bit Unicode values. To my knowledge, Sun does not claim that its internal encoding of String is UTF-8 in any of its API documentation. The internal encoding is exposed by the regrettably named readUTF and writeUTF methods of java.io.Data{Input,Output}Stream, which should have been named readString and writeString. People have assumed that they are general-purpose UTF-8 read/write functions. At one point, this was a FAQ on this list. -- There is / one art || John Cowan [EMAIL PROTECTED] no more / no less || http://www.reutershealth.com to do / all things || http://www.ccil.org/~cowan with art- / lessness \\ -- Piet Hein
Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)
John, It does impact developers. The API for DataInputStream defines FSS_UTF, which includes the funky null behavior. http://java.sun.com/products/jdk/1.2/docs/api/java/io/DataInputStream.html Since this API and other use this UTF, it gets into file formats and applications end up supporting it tex John O'Conner wrote: Within a String, the encoding of char values is practically irrelevant. It is a hidden encoding that is never exposed to the user...or developer. When you access String char values, you use an index to 16-bit Unicode values. To my knowledge, Sun does not claim that its internal encoding of String is UTF-8 in any of its API documentation. Any component or converter that claims to produce a UTF-8 encoding should not behave as you describe. For example, Java's UTF-8 converter does not encode U+ as 0xC0 0x80. If it ever does, please file a bug. Regards, John O'Conner [EMAIL PROTECTED] wrote: This is laziness, intended to get around the "problem" of supplementary code points instead of handling them like any other code points. This reminds me of the Java bastardization of UTF-8, in which U+ is encoded 0xC0 0x80 so that no character string will ever contain the byte 0x00. (Nobody has ever explained to me why a character string would contain U+ in the first place.) -- According to Murphy, nothing goes according to Hoyle. -- Tex Texin Director, International Business mailto:[EMAIL PROTECTED] +1-781-280-4271 Fax:+1-781-280-4655 Progress Software Corp.14 Oak Park, Bedford, MA 01730 http://www.Progress.com#1 Embedded Database Globalization Program http://www.Progress.com/partners/globalization.htm ---
FW: Question on IBM's Unicode enabled product Support.
-Original Message- From: William Palmer [mailto:[EMAIL PROTECTED]] Sent: Thursday, February 01, 2001 3:45 PM To: '[EMAIL PROTECTED]' Subject: Question on Unicode enabled product Support. Hello...! Question: Has IBM OS/390 Ver 2.6, 2.7 or 2.9 MVS/ESA Operating support for Unicode...? IBM RACF Security Facility supported for Unicode...? and IBM COBOL for MVS Unicode enabled... Thank you, Bill Palmer Enterprise Server Verification Access360 Phone (949) 450-6589 Fax (949) 585-0198 www.access360.com access360 A Better Way to Manage Access Rights
Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)
Perhaps the methods readUTF and writeUTF should be deprecated in favor of read/writeString. I will submit an RFE (request for enhancement) for this. I noticed that although the Data{Input,Output} interface clearly says that the write/readUTF handles a "Java modified UTF-8". The actual javadoc in DataOutputStream says that writeUTF writes the String as UTF-8. Also, the doc for UTFDataFormatException is confusing on the issue, saying UTF-8 in one place and "modified UTF-8" in the doc for DataInputStream. Thats 1 RFE for better method names and 2 bugs in the API documentation! I'll submit all 3...if they don't already exist in the db. Regards, John O'Conner John Cowan wrote: The internal encoding is exposed by the regrettably named readUTF and writeUTF methods of java.io.Data{Input,Output}Stream, which should have been named readString and writeString. People have assumed that they are general-purpose UTF-8 read/write functions.
Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)
John, I am not clear from your comments which is the bug, since the doc goes both ways. Are the doc bugs that they say it is UTF-8, or that they say it is modified UTF-8? It would be great to learn that the functions are actually unmodified UTF-8, as I know of some interfaces that are writing non-Java code and are forced to deal with specialized handling of the modified UTF-8. It would be great to inform them they can use standard UTF-8 library routines. tex John O'Conner wrote: Perhaps the methods readUTF and writeUTF should be deprecated in favor of read/writeString. I will submit an RFE (request for enhancement) for this. I noticed that although the Data{Input,Output} interface clearly says that the write/readUTF handles a "Java modified UTF-8". The actual javadoc in DataOutputStream says that writeUTF writes the String as UTF-8. Also, the doc for UTFDataFormatException is confusing on the issue, saying UTF-8 in one place and "modified UTF-8" in the doc for DataInputStream. Thats 1 RFE for better method names and 2 bugs in the API documentation! I'll submit all 3...if they don't already exist in the db. Regards, John O'Conner John Cowan wrote: The internal encoding is exposed by the regrettably named readUTF and writeUTF methods of java.io.Data{Input,Output}Stream, which should have been named readString and writeString. People have assumed that they are general-purpose UTF-8 read/write functions. -- According to Murphy, nothing goes according to Hoyle. -- Tex Texin Director, International Business mailto:[EMAIL PROTECTED] +1-781-280-4271 Fax:+1-781-280-4655 Progress Software Corp.14 Oak Park, Bedford, MA 01730 http://www.Progress.com#1 Embedded Database Globalization Program http://www.Progress.com/partners/globalization.htm ---
Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)
Tex Texin wrote: I am not clear from your comments which is the bug, since the doc goes both ways. Are the doc bugs that they say it is UTF-8, or that they say it is modified UTF-8? It uses modified UTF-8, modified in three ways: 1) U+ is encoded in two bytes as 0xc0 0x80; 2) values above U+ are encoded in six bytes as the UTF-8 encoding of their UTF-16 equivalent form; 3) the whole string is prefixed with a byte count represented as a 2-byte big-endian binary integer. It would be great to learn that the functions are actually unmodified UTF-8, as I know of some interfaces that are writing non-Java code and are forced to deal with specialized handling of the modified UTF-8. It would be great to inform them they can use standard UTF-8 library routines. *chomp* No such luck Doc! -- There is / one art || John Cowan [EMAIL PROTECTED] no more / no less || http://www.reutershealth.com to do / all things || http://www.ccil.org/~cowan with art- / lessness \\ -- Piet Hein
Re: Macintosh OS8.6, OS9
-- "P. T. Rourke" [EMAIL PROTECTED] is rumored to have mumbled on Montag, 5. Februar 2001 8:47 Uhr -0800 regarding Macintosh OS8.6, OS9: A communication with someone offlist (though I think he is on the list) suggested that Unicode is not supported at all in Macintosh OS8.6 or OS9, It's not as simple as that. not even to the degree that it is supported in Windows 9x, except by means of Windows emulation (if I'm characterizing the message correctly; it is on another computer at the moment). Is this true? (Obviously I don't use Mac OS much). To some extent. See below. I question this statement because the last time I used a Mac with OS8.6 (I don't have many opportunities, I'm afraid), Netscape 4 did include the UTF-8 encoding among the permitted encodings; but as it wasn't my computer, I didn't install a Unicode font to test it. A Unicode font wouldn't have helped you there. The "old" way to provide UTF-8 compatibility requires Apple's Text Encoding Converter, which maps characters to *different* fonts. So you cannot use just one Unicode font, but instead you need to install all kinds of fonts, and even then you won't be able to view all Unicode characters, because some blocks don't have a corresponding representation in Apple's old script system. One example for this is the Extended Greek block. AFAIK you cannot get that with the TEC. However, 8.5 and above actually contain ATSUI, the Apple Typography Services for Unicode Imaging. Those can read all kinds of font formats, including fonts like Microsoft's Arial Unicode. But that's only one half of the equation. This fine API doesn't do you any good without applications that support it. Unfortunately the only mainstream application that seems to do so (I don't know for sure) is Adobe InDesign 1.5 Mac OS 9.1 comes with a small word processor called WorldText that supports ATSUI... it's a first step. Here's hoping for the Carbon (or Cocoa?) release of Microsoft Office... Cheers, Sebastian -- Sebastian Hagedorn Ehrenfeldgrtel 156, 50823 Kln, Germany http://www.spinfo.uni-koeln.de/~hgd/ taz muss sein. [EMAIL PROTECTED] http://www.taz.de PGP signature
Re: [OT] Unicode-compatible SQL?
Mon, 5 Feb 2001 08:20:43 -0800 (GMT-0800), Mark Davis [EMAIL PROTECTED] pisze: The topic came up in a UTC meeting some time ago, a "UTF-8S". The motivation was for performance (having a form that reproduces the binary order of UTF-16). This is unfair: it slows down the conversion UTF-8 - UTF-32. In both cases the speed difference is almost none, and it's a big portability problem. I hope that such trash will not be accepted. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Macintosh OS8.6, OS9
Thanks to everyone who responded, especially Mr. Hagedorn; as it is precisely the Extended Greek, Basic Greek, and Combining diacriticals blocks that interest me, this was very important information. Patrick Rourke Netscape, Internet Explorer and Icab (another browser for the MacOs) use UTF8 (Internet Explorer and Icab via the TEC) In macOs 9.1 with an apllication which supports the Multilingual Text Engine you can even use a windows true type font like cyberbit without converting it to Macintosh format.
Re: [OT] Unicode-compatible SQL?
Unfortunately, the issue at this point is that some companies have either already accepted it or in the process of accepting it now. MichKa Michael Kaplan Trigeminal Software, Inc. http://www.trigeminal.com/ - Original Message - From: "Marcin 'Qrczak' Kowalczyk" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Monday, February 05, 2001 1:49 PM Subject: Re: [OT] Unicode-compatible SQL? Mon, 5 Feb 2001 08:20:43 -0800 (GMT-0800), Mark Davis [EMAIL PROTECTED] pisze: The topic came up in a UTC meeting some time ago, a "UTF-8S". The motivation was for performance (having a form that reproduces the binary order of UTF-16). This is unfair: it slows down the conversion UTF-8 - UTF-32. In both cases the speed difference is almost none, and it's a big portability problem. I hope that such trash will not be accepted. -- __(" Marcin Kowalczyk * [EMAIL PROTECTED] http://qrczak.ids.net.pl/ \__/ ^^ SYGNATURA ZASTPCZA QRCZAK
Re: Macintosh OS8.6, OS9
The MacOs did support Unicode in 8.6 and 9.x, see Apple Type Services for Unicode Imaging Reference http://developer.apple.com/techpubs/macos8/TextIntlSvcs/ATSUI/ATSUI_ref/index.html Handling Unicode Text Editing With Multilingual Text Engine http://developer.apple.com/techpubs/macosx/Carbon/text/MultilingualTextEngine/Multilingual_Text_Engine/index.html Programming With the Text Encoding Conversion Manager http://developer.apple.com/techpubs/macos8/TextIntlSvcs/TextEncodingConversionManager/TEC1.5/index.html Netscape, Internet Explorer and Icab (another browser for the MacOs) use UTF8 (Internet Explorer and Icab via the TEC) In macOs 9.1 with an apllication which supports the Multilingual Text Engine you can even use a windows true type font like cyberbit without converting it to Macintosh format. There is still some problems, but you can't say it's not supported at all. Bertrand A communication with someone offlist (though I think he is on the list) suggested that Unicode is not supported at all in Macintosh OS8.6 or OS9, not even to the degree that it is supported in Windows 9x, except by means of Windows emulation (if I'm characterizing the message correctly; it is on another computer at the moment). Is this true? (Obviously I don't use Mac OS much). I question this statement because the last time I used a Mac with OS8.6 (I don't have many opportunities, I'm afraid), Netscape 4 did include the UTF-8 encoding among the permitted encodings; but as it wasn't my computer, I didn't install a Unicode font to test it. PTR -- Bertrand Laidain 1, rue Stendhal 75020 Paris
Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)
Here's what I see about the Java API docs: 1. The Data{Input, Output}Stream methods {read, write}UTF could be named better. More appropriate names are {read, write}String. Strictly speaking, this is not a bug, but it could be better. That's why I call it an RFE (request for enhancement). 2. DataOutputStream's writeUTF() method says it writes UTF-8, when clearly this is a "modified" UTF-8. The implementation is fine...the documentation is incorrect since it doesn't write UTF-8 but something slightly different. 3. DataInputStream's readUTF() method is clear that it reads a "modified" UTF-8, but the doc also says it can throw an UnsupportedDataFormatException if the input stream isn't valid UTF-8. The error is that it says UTF-8, not "modified" UTF-8 or FSS_UTF. Later, John O'Conner Tex Texin wrote: John, I am not clear from your comments which is the bug, since the doc goes both ways. Are the doc bugs that they say it is UTF-8, or that they say it is modified UTF-8? It would be great to learn that the functions are actually unmodified UTF-8, as I know of some interfaces that are writing non-Java code and are forced to deal with specialized handling of the modified UTF-8. It would be great to inform them they can use standard UTF-8 library routines. tex