Re: UCS2, UCS4, UTF-8 ?

IVANCSÓ Krisztián Wed, 04 Jun 2003 22:51:31 -0700

Sven Köhler:
> UTF-8 ?
> I don't see, why it should matter which format the database uses
> internally - as long as it is Unicode -, because usually the
> database-api your application converts the strings into the charset your
> application uses - for Java this would be UCS2 (char is 2 byte in Java)
> and for Windows' Unicode it would be either UCS2.
>
> In addition, UCS2 uses exactly 2 bytes for each char - and UTF-8 uses
> 1-3 (or even more) bytes per char.
> You can imagine, that UCS2 might be very good for sorting stuff etc.

I think UTF-8 as good as UCS for sorting, because sorting in UTF-8 is as complicated as in UCS, you must use system locale independent functions and you have to specify what locale the function to use.

UCS-2 contains all the generally used languages, exactly the characters of these languages, so it's a good choice, but, I think the dba/resource should have the ability to change the encoding and/or language (locale) used internally by a field. (e.g. VARCHAR(32) hu_HU.UTF-8)

>
> I couldn't explain, why MySQL and PostgreSQL use UTF-8 - can anybody
> explain that to me?
>
> If i'm correct, UCS2 is a real subset of UCS4 - so a DB should think
> about UCS4 - although i don't know any language that uses UCS4 or even
> UTF-8.

Dittmar, Daniel wrote:

UTF-8 strings are on average smaller than their UCS2 equivalents, which leads to less I/O.

Zabach, Elke wrote:

correct, if you think about storing mainly 7bit-ASCII (then shorter)
but if you think about asian languages, UTF/8 needs more space.

An UTF-8 char can be 1 to 6 byte.

UTF-8 can be a good choice, if you use charcters that used in european languages (except greek). E.g. using hungarian (e.g. ÁÉŐŰÍ) or german characters (e.g. öüß), the average size of string is smaller than UCS-2 (because most characters are 7-bit ASCII, and the special chars take 2-byte), so it leads to less I/O.

1-byte UTF-8 chars:
7-bit ASCII

2-byte UTF-8 chars: Latin Extended-A, Latin Extended-B, IPA Extensions, Spacing Modifiers, Combining Diacritics and Greek

3-byte UTF-8 chars:
Cyrillic, Cyrillic Sup., Armenian, Hebrew, Arabic, Syriac

Languages that use more then 2 byte in UCS encoding:
(more information on http://www.unicode.org/roadmaps)
Plane 1
00010000-000102FF Aegean scripts
00010300-000107FF Alphabetic and syllabic LTR scripts
00010800-00010FFF Alphabetic and syllabic RTL scripts
00011000-000117FF Brahmic scripts
00011800-00011FFF African and other syllabic scripts
00012000-000127FF Scripts for invented languages
00012800-00012DFF Cuneiform and other Near Eastern scripts
00012E00-000133FF Undeciphered scripts
00013400-00013FFF North American ideographs and pictograms
00014000-00016BFF Egyptian and Mayan hieroglyphs
00016C00 00016FFF Sumerian pictograms
00017000-0001B5FF Large Asian scripts
0001B600-0001CFFF unassigned
0001D000-0001FFFD Notational systems

Plane 2
00020000-0002A6DF CJK Unified Ideographs Extension B
0002A6E0-0002F7FF unassigned
0002F800-0002FA1F CJK Compatibility Ideographs Supplement
0002FA20-0002FFFD unassigned

It's true that these are not widely used or dead languages, scripts, etc., but it may be worthy of think about UCS-4.


Regards,
Kriszitán IVANCSÓ


_______________________________________________
sapdb.general mailing list
[EMAIL PROTECTED]
http://listserv.sap.com/mailman/listinfo/sapdb.general

Re: UCS2, UCS4, UTF-8 ?

Reply via email to