[Jprogramming] RFC: unicode

Elijah Stone Fri, 18 Mar 2022 16:17:02 -0700

   x=: 8 u: 97 243 98      NB. same as entering x=: 'aób'
   y=: 9 u: x
   z=: 10 u: 97 195 179 98
   x
aób
   y
aób
   z
aÃ³b


   x-:y
0
   NB. ??? they look the same

   x-:z
1
   NB. ??? they look different

   $x
4
   NB. ??? it looks like 3 characters, not 4

Well, this is unicode. There are good reasons why two things that lookthe same might not actually be the same. For instance:


   ]p=: 10 u: 97 243 98
aób
   ]q=: 10 u: 97 111 769 98
aób
   p-:q
0

But in the above case, x doesn't match y for stupid reasons. And xmatches z for stupider ones.

J's default (1-byte) character representation is a weird hodge-podge of'UCS-1' (I don't know what else to call it) and UTF-8, and it does notseem well thought through. The dictionary page for u: seems confused asto whether the 1-byte representation corresponds to ASCII or UTF-8, andsimilarly as to whether the 2-byte representation is coded as UCS-2 orUTF-16.

Most charitably, this is exposing low-level aspects of the encoding tousers, but if so, that is unsuitable for a high-level language such as j,and it is inconsistent. I do not have to worry that 0 1 1 0 1 1 0 1 willsuddenly turn into 36169536663191680, nor that 2.718 will suddenly turninto 4613302810693613912, but _that is exactly what is happening in theabove code_.

I give you the crowning WTF (maybe it is not so surprising at thispoint...):


   x;y;x,y                NB. pls j
┌───┬───┬───────┐
│aób│aób│aÃ³baób│
└───┴───┴───────┘

Unicode is delicate and skittish, and must be approached delicately. Ithink that there are some essential conflicts between unicode and j--asthe above example with the combining character demonstrates--but also thatpandora's box is open: literal data _exists_ in j. Given that that is thecase, I think it is possible and desirable to do much better than thecurrent scheme.

---

Unicode text can be broken up in a number of ways. Graphemes, characters,code points, code units...

The composition of code units into code points is the only suchdemarcation which is stable and can be counted upon. It is also ademarcation which is necessary for pretty much any interesting textprocessing (to the point that I would suggest any form of 'textprocessing' which does not consider code points is not actually processingtext). Therefore, I suggest that, at a minimum, no user-exposedrepresentation of text should acknowledge a delineation below that of thecode point. If there is any primitive which deals in code units, itshould be a foreign: scary, obscure, not for everyday use.

A non-obvious but good result of the above is that all strings arecorrectly-formed by construction. Not all sequences of code units arecorrectly formed and correspond to valid strings of text. But allsequences of code points _are_, of necessity, correctly formed, otherwisethere would be ... problems following additions to unicode. J currentlyallows us to create malformed strings, but then complains when we use themin certain ways:


   9 u: 1 u: 10 u: 254 255
|domain error
|   9     u:1 u:10 u:254 255

---

It is a question whether j should natively recognise delineations abovethe code point. It pains me to suggest that it should not.

Raku (a pointer-chasing language) has the best-thought-out strings of anyprogramming language I have encountered. (Unsurprising, given it waswritten by perl hackers.) In raku, operations on strings aregrapheme-oriented. Raku also normalizes all text by default (which solvesthe problem I presented above with combining characters--but rest assured,it can not solve all such problems). They even have a scheme forspace-efficient random access to strings on this basis.

But j is not raku, and it is telling that, though raku hasmultidimensional arrays, its strings are _not_ arrays, and it does nothave characters. The principle problem is a violation of the rules ofconformability. For instance, it is not guaranteed that, for vectors xand y, (#x,y) -: x +&# y. This is not _so_ terrible (though it is prettybad), but from it follows an obvious problem with catenating higher-rankarrays. Similar concerns apply at least to i., e., E., and }. That said,I would support the addition of primitives to perform normalization (aswell as casefolding etc.) and identification of grapheme boundaries.

---

It would be wise of me to address the elephant in the room. Charactersare not only used to represent text, but also arbitrary binary data, e.g.from the network or files, which may in fact be malformed as text. Isubmit that characters are clearly the wrong way to represent such data;the right way to represent a sequence of _octets_ is using _integers_.But people persist, and there are two issues: the first is compatibility,and the second is performance.

Regarding the second, an obvious solution is to add a 1-byte integerrepresentation (as Marshall has suggested on at least one occasion), butthis represents a potentially nontrivial development effort. Therefore Isuggest an alternate solution, at least for the interim: foreigns (scaryand obscure, per above) that will _intentionally misinterpret_ data fromthe outside world as 'UCS-1' and represent it compactly (or do theopposite).

Regarding the issue of backwards compatibility, I propose the addition of256 'meta-characters', each corresponding to an octet. Attempts to decodecorrectly formed utf-8 from the outside world will succeed and producecorresponding unicode; attempts to decode malformed utf-8 may map eachincorrect code unit to the corresponding meta-character. When encoded,real characters will be utf-8 encoded, but each meta-character will beencoded as its corresponding octet. In this way, arbitrary byte streamsmay be passed through j strings; but byte streams which consist entirelyor partly of valid utf-8 can be sensibly interpreted. This is similar toraku's utf8-c8, and to python's surrogateescape.

---

An implementation detail, sort of. Variable-width representations (suchas utf-8) should not be used internally. Many fundamental arrayoperations require constant-time random access (with the correspondingobvious caveats), which variable-width representations cannot provide; andeven operations which are inherently sequential--like E., i., ;., #--maybe more difficult or impossible to optimize to the same degree.Fixed-width representations therefore provide more predictableperformance, better performance in nearly all cases, and better asymptoticperformance for many interesting applications.

(The UCS-1 misinterpretation mentioned above is a loophole which allowspeople who really care about space to do the variable-width partthemselves.)

---

I therefore suggest the following language changes, probably to bedeferred to version 10:


- 1, 2, and 4-byte character representations are still used internally.
  They are fixed-width, with each code unit representing one code point.
  In the 4-byte representation, because there are more 32-bit values than
  unicode code points, some 32-bit values may correspond to passed-through
  bytes of misencoded utf8.  In this way, a j literal can round-trip
  arbitrary byte sequences.  The remainder of the 32-bit value space is
  completely inaccessible.

- A new primitive verb U:, to replace u:.  u: is removed.  U: has a
  different name, so that old code will break loudly, rather than quietly.
  If y is an array of integers, then U:y is an array of characters with
  corresponding codepoints; and if y is an array of characters, then U:y
  is an array of their code points.  (Alternately, make a. impractically
  large and rely on a.i.y and x{a. for everything.  I disrecommend this
  for the same reason that we have j. and r., and do not write x = 0j1 * y
  or x * ^ 0j1 * y.)

- Foreigns for reading from files, like 1!:1 and 1!:11 permit 3 modes of
  operation; foreigns for writing to files, 1!:2, 1!:3, and 1!:12, permit
  2 modes of operation.  The reading modes are:

  1. Throw on misencoded utf-8 (default).
  2. Pass-through misencoded bytes as meta characters.
  3. Intentionally misinterpret the file as being 'UCS-1' encoded rather
     than utf-8 encoded.

  The writing modes are:

  1. Encode as utf-8, passing through meta characters as the corresponding
     octets (default).
  2. Misinterpret output as 'UCS-1' and perform no encoding.  Only valid
     for 1-byte characters.

A recommendation: the UCS-1 misinterpretation should be removed if 1-byteintegers are ever added.


- A new foreign is provided to 'sneeze' character arrays.  This is
  largely cosmetic, but may be useful for some.  If some string uses a
  4-byte representation, but in fact, all of its elements' code points are
  below 65536, then the result will use a smaller representation.  (This
  can also do work on integers, as it can convert them to a boolean
  representation if they are all 0 or 1; this is, again, marginal.)

Future directions:

Provide functionality for unicode normalization, casefolding, graphemeboundary identification, unicode character properties, and others. Maybethis should be done by turning U: into a trenchcoat function; or maybe itshould be done by library code. There is the potential to reuse existingprimitives, e.g. <.y might be a lowercased y, but I am wary of such puns.


Thoughts?  Comments?

 -E
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

[Jprogramming] RFC: unicode

Reply via email to