Re: UTF-8 woes on z/OS, a solution - comments invited

Pew, Curtis G Tue, 05 Sep 2017 05:37:08 -0700

On Sep 4, 2017, at 9:02 PM, Paul Gilmartin 
<0000000433f07816-dmarc-requ...@listserv.ua.edu> wrote:
> 
> Why is there UTF-16?
> 
> o It's a variable-length encoding, involving the same complexities as UTF-8.
> 
> o It lacks the compactness of UTF-8 in the case of Latin text.
> 
> Is it because it's (sort of) an extension of UCS-2?
> 
> (What does Java use internally?)


Unicode was originally supposed to be a fixed-width, 16-bit encoding. 
Fixed-width was actually a design criteria for the original developers. It was 
only after it became clear that there was no possible way to fit all the needed 
characters into 16 bits that the “astral planes”[1] were (reluctantly) added to 
Unicode and the various UTF encodings defined. In this light, UTF-16 is the 
closest thing to the original version of Unicode. Also, if your text includes 
few or no Latin characters UTF-16 may be just as compact, or even more compact, 
than UTF-8, and can probably be processed more easily.

Since Java was developed when Unicode was still supposed to be a 16-bit 
encoding the early versions at least used what we would now call UTF-16. As I 
recall, there was a significant period of time after Unicode abandoned a 
fixed-width 16-bit representation before Java implementations really supported 
characters from the “astral planes”.


[1] Unicode is still organized into 64K ranges called “planes”. The original 
0–xFFFF range is called the “Basic Multilingual Plane” (BMP) and “astral 
planes” is a convenient nickname for the other ranges.

-- 
Pew, Curtis G
curtis....@austin.utexas.edu
ITS Systems/Core/Administrative Services


----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: UTF-8 woes on z/OS, a solution - comments invited

Reply via email to