This discussion started out on using APL characters as executable in J. I'm
not sure I would want to make many equivalences between APL symbols and J
primitives; however, representing APL characters and international
characters gets into the way J handles these characters with the character
types literal, unicode and UTF-8.

Those not interested bail out now as the rest is kind of boring, but my
soap-box.

About the time mini-computers and personal computers became common 7-bit
ASCII was well-established standard. But since by this time computers had
standardized on 8 bits to the character. This extra bit allowed for
supporting international characters and still fit in the byte. In addition,
APL used those extra characters to support APL characters. But this lead to
confusion since those characters varied between countries and systems.

Unicode was created to attempt to clean this mess up. It took the 7-bit
ASCII and a fairly accepted version of the 8-bit version of extended ASCII
and added leading zeros up to 32 bits. Now there is all kinds of room to
support many languages in a compatible manner.

Enter UCS Transformation Format, in particular UTF-8. There are many
problems with Unicode as it made ASCII files much larger and take longer to
send over slow communications lines. And there is the endian issue between
different computers. UTF-8 is an ingenious technique to compress unicode in
a manner that is completely compatible with 7-bit ASCII. The endian problem
is eliminated. It is not compatible with 8-bit ASCII extensions. 7-bit
ASCII text looks identical to UTF-8 text. The 8-bit ASCII extensions text
does not. Those characters become two bytes each using the UTF-8
compression algorithm.

J converts literal to unicode by simply putting a zero byte in front
extending it to the the 16-bit version of Unicode implemented in Windows
and Unix. This is perfectly valid as the numeric values of the first 256
Unicode letters match the 8-bit ASCII extension. UTF-8 assumes that
_128{.a. characters in literal are used in the compression algorithm. That
they do not represent extended ASCII. But J treats UTF-8 as literal making
it impossible to tell if those characters represent extended ASCII or UTF-8
compression.

UTF-8 is a compressed version of Unicode that J fits in literal. J treats
literal as 8-bit extended ASCII when combining and converting to/from
unicode (wide). It treats literal as UTF-8 when entered from the keyboard
and displayed. Got a bit of an inconsistency here.

   U =: 7 u: u =: 'þ'

   3!:0 u   NB. u is literal

2

   3!:0 U   NB. U is unicode

131072

   #u       NB. u takes 2 atoms

2

   #U       NB. U takes 1 atom

1

   'abc',u  NB. ASCII literals catenate with UTF-8

abcþ

   'abc',U  NB. ASCII literals catenate with unicode

abcþ

   u,U      NB. UTF-8 literals do not catenate well with unicode

þþ

   a.i.u,U  NB. Here we have þ in two forms

195 190 254

So, when programming in J one must never mix UTF-8 and unicode without
being extremely careful and aware of what can happen. It is easiest to use
ASCII and UTF-8 together. Not a problem as one cannot get any unicode into
J without specifically converting to unicode using u: .

The alternative is to make sure all text that might contain UTF-8 is
converted to unicode. That can be difficult at times.

The trouble with mixing ASCII and UTF-8 is that J primitives work on the
atoms of literal. Any UTF-8 are treated as 8-bit extended ASCII. Counting
characters and reshaping fail with UTF-8. Searching for UTF-8 characters is
harder. An example of a failure character counting with UTF-8 is the
displaying of boxed literals.

   <u

+--+

|þ|

+--+
Notice that þ is treated as two characters but displays as one.

I choose to make sure everything that might contain UTF-8 is run through 7
u: which will convert it unicode if it contains any UTF-8 or it leaves it
literal otherwise. Now all the J primitives work as expected. A character
fits in an atom. I never worry about the possibility of UTF-8 characters
being garbled. When I'm through, simply convert my final result back to
UTF-8.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to