``A Short Into ...'' - comments, suggestions?

Brian Foster Wed, 11 Dec 2002 21:48:02 -0800

  the company I work for is prototyping a software platform to
 be embedded within a product intended for homes in CJK-land.
 a Java application, developed by native CJK programmers, will
 run on the platform.


  I have raised some concerns about the use of CJK-characters to
 name objects in non-volatile storage (which below is identified
 only as the «FS» --- short for filesystem).  to do so, I filed
 an admittedly somewhat obscure defect report.  in response to
 several people asking me what I was going on about, I quickly
 wrote a version of the note below.

  it occurred to me people on this list may have useful input
 they are willing to share.  the note is aimed at the technical
 management, so whilst pedantic details are not relevant, tech-
 speak is Ok(-ish).   but the note must be short!

  what I am wondering is if there are any gross errors, and/or
 if the note may be useful to anyone else:  if there's sufficient
 interest, I'm happy to post a summary of replies/comments.

 ( apologies for slightly obscuring several points for legal or
  commercial reasons.  I do not believe that materially changes
  the note. )

  as far as I am concerned, this version of the note is in the
 Public Domain and may be abused as you see fit.   ;-)

=====(cut here and below)=====(CJK non-volatile object names)=====

            A QUICK INTRO TO FILES NAMED «スッ•txt»
         (Unicode, UCS-2 and UTF-8, served with Java.)
                         - Brian Foster <[EMAIL PROTECTED]>
                           2002.12.11, Montpellier France

This file, which is itself encoded in UTF-8, attempts to illustrate
the issue(s) with [ Java/JVM on the system and its interaction with
the FS filesystem ].

The problems surround the issue of encoding.  All characters sufficient
to write most of the world's languages have been assigned an absolute
value in ISO standard 10646, which is also called Unicode.  The Unicode
value assigned to any character is written  U+nnnn  where <nnnn> is the
assigned Unicode value, in hex.

By deliberate design, the first 128 Unicode values, U+0000 to U+007F,
are the usual US-ASCII characters.  Hence, U+0000 is another way of
writing the NUL character, which in C/C++ is written \0.  Similarly,
U+0041 is capital A, and so on.

Unicode also assigns official names for each value.  For instance, the
official name of U+0041 is «LATIN CAPITAL LETTER A».

There are a huge number of Unicode names/values, with some of the
recently added values being larger that 65535 (decimal), or 2^16.
Unicode is currently considered to be a 31-bit space.

That obviously makes it impractical, in many circumstances, to process
text in the natural encoding, 4-byte (32-bit) integers.  The simple
string «Hi!», which consists of the three characters:

    H  U+0048  LATIN CAPITAL LETTER H
    i  U+0069  LATIN SMALL LETTER I
    !  U+0021  EXCLAMATION MARK

would require three 4-byte words (12-bytes!) to store:

    0x00000048 0x00000069 0x00000021                UCS-4

This natural 4-byte encoding is called UCS-4.  Please notice there are
0x00-valued bytes in UCS-4, and hence the UCS-4 encoding of «Hi!» is
not a C/C++ string.  It contains \0.

Most Unicode values are less than 65535, so a truncated form, storing
only the lo-order 2-bytes of each value, is commonly used:

    0x0048 0x0069 0x0021                            UCS-2

This natural 2-byte encoding is called UCS-2.  Note that UCS-2 also
contains \0 (embedded 0x00 bytes), and hence is not a C/C++ string.

However, even UCS-2 occupies twice the space of the traditional one-
character-per-byte encoding:

    0x48 0x69 0x21                                  various

There are many different one-character-per-byte encodings (which are
also called 8-bit encodings).  Most use the first 128 values, 0x00 to
0x7F, to encode U+0000 to U+007F, in the natural manner:  U+0041 is
byte 0x41.  Where they differ is in the second 128 values, 0x80 to 0xFF.

As just one example, ISO-8859-1 uses byte 0xA4 to encode U+00A4, ¤,
the CURRENCY SIGN.  But ISO-8859-15 uses 0xA4 to encode U+20A0, €,
the EURO-CURRENCY SIGN.

Various encodings of Unicode values which are more space-efficient
than UCS-4 have been designed.  The best is definitely UTF-8, which
is a self-synchronizing multi-byte encoding.  UTF-8 encodes the
example «Hi!» string as the three obvious bytes:

    0x48 0x69 0x21                                  UTF-8

UTF-8 is a C/C++ string.  The one and only time a 0x00-valued byte
can occur in an UTF-8 encoding is as the character \0.  Unlike both
UCS-4 and UCS-2, there are no embedded 0x00-bytes in UTF-8.

Unlike UCS-2, UTF-8 can also encode the entire 31-bit Unicode space.
Unlike the 8-bit encodings, UTF-8 encoding is entirely computational
(and bi-directional).  If you know the Unicode value, you can compute
the UTF-8 encoding; if you have some UTF-8 encoded text, you can
recreate all the Unicode values.  UTF-8 is a loss-less reversible
Unicode encoding.  (In contrast, ISO-8859-15 requires a conversion
table:  How else would one convert U+20A0 to 0xA4 or visa-versa?)


The issue with [ FS ] and [ Java/JVM ] is that both are naturally
Unicode-based.   Both [ FS ] filenames (and also the obsolete VFAT),
and Java, use Unicode.  (Java strings and character constants are,
I think, UCS-2.)  However, the "thing in the middle", [ the system ],
does not use Unicode.  It uses C/C++ strings, assumes bytes 0x00 to
0x7F are U+0000 to U+007F (US-ASCII), and makes no obvious assumptions
about 0x80 to 0xFF bytes.

A perfectly natural filename might be «スッ•txt».  That is the six
Unicode characters:

   U+30B9  ス  KATAKANA LETTER SU
   U+30C3  ッ  KATAKANA LETTER SMALL TU
   U+2022  •   BULLET
   U+0074  t   LATIN SMALL LETTER T
   U+0078  x   LATIN SMALL LETTER X
   U+0074  t   LATIN SMALL LETTER T

A programmer could write that in his programme, or a user may enter
or select it, and expect it to work.

What would actually happen?

Suppose such a file is being opened.  What bytes are passed as the
name of the file?  This is an unknown.  It obviously depends on the
Java/JVM implementation.

[ FS ], like VFAT, understands the C/C++ string «:30b9:30c3:2022txt»
to mean the above six Unicode characters.  That is, [ FS ] uses another
multi-byte encoding, which it calls the «Linux encoding», to allow use
of almost the full Unicode set U+0001 to U+FFFF available for filenames.

Conversely, if [ Java/JVM ] were to scan a directory (or the user
does a simple «ls»), the file «:30b9:30c3:2022txt» would be listed.
Whilst [ Java/JVM ] may understand that means «スッ•txt», most users
would not!

The Linux encoding used by [ FS ] is a hack.  It is not as space-
efficient as UTF-8, nor is it a de jour (official) standard.
It is also slightly ambiguous.  Whilst «:00a4» means U+00A4 (and
is thus unambiguous), [ FS ] only uses the «:»-escape to encode
UCS-2 values 0x0100 to 0xFFFF (characters U+0100 to U+FFFF).

So UCS-2 value 0x00A4 becomes the byte 0xA4.  Is that ISO-8859-1's
U+00A4, ¤, the CURRENCY SIGN, or ISO-8859-15's U+20A0, €, the EURO-
CURRENCY SIGN?  (Or perhaps something else?)  As used by [ FS ],
the Linux encoding depends on the locale and font (glyphs) in effect.

The Linux encoding is used because the system is not ready for UTF-8.
The Linux encoding only uses US-ASCII, unlike UTF-8.

=====(cut here and above)=====(CJK non-volatile object names)=====

cheers!
        -blf-
--
«How many surrealists does it take to    |  Brian Foster      Montpellier,
 change a lightbulb?  Three.  One calms  |  [EMAIL PROTECTED]      France
 the warthog, and two fill the bathtub   |    Stop E$$o (ExxonMobile)!
 with brightly-colored machine tools.»   |        http://www.stopesso.com
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

``A Short Into ...'' - comments, suggestions?

Reply via email to