x=: 8 u: 97 243 98 NB. same as entering x=: 'aób'
y=: 9 u: x
z=: 10 u: 97 195 179 98
x
aób
y
aób
z
aób
x-:y
0
NB. ??? they look the same
x-:z
1
NB. ??? they look different
$x
4
NB. ??? it looks like 3 characters, not 4
Well, this is unicode. There are good reasons why two things that look
the same might not actually be the same. For instance:
]p=: 10 u: 97 243 98
aób
]q=: 10 u: 97 111 769 98
aób
p-:q
0
But in the above case, x doesn't match y for stupid reasons. And x
matches z for stupider ones.
J's default (1-byte) character representation is a weird hodge-podge of
'UCS-1' (I don't know what else to call it) and UTF-8, and it does not
seem well thought through. The dictionary page for u: seems confused as
to whether the 1-byte representation corresponds to ASCII or UTF-8, and
similarly as to whether the 2-byte representation is coded as UCS-2 or
UTF-16.
Most charitably, this is exposing low-level aspects of the encoding to
users, but if so, that is unsuitable for a high-level language such as j,
and it is inconsistent. I do not have to worry that 0 1 1 0 1 1 0 1 will
suddenly turn into 36169536663191680, nor that 2.718 will suddenly turn
into 4613302810693613912, but _that is exactly what is happening in the
above code_.
I give you the crowning WTF (maybe it is not so surprising at this
point...):
x;y;x,y NB. pls j
┌───┬───┬───────┐
│aób│aób│aóbaób│
└───┴───┴───────┘
Unicode is delicate and skittish, and must be approached delicately. I
think that there are some essential conflicts between unicode and j--as
the above example with the combining character demonstrates--but also that
pandora's box is open: literal data _exists_ in j. Given that that is the
case, I think it is possible and desirable to do much better than the
current scheme.
---
Unicode text can be broken up in a number of ways. Graphemes, characters,
code points, code units...
The composition of code units into code points is the only such
demarcation which is stable and can be counted upon. It is also a
demarcation which is necessary for pretty much any interesting text
processing (to the point that I would suggest any form of 'text
processing' which does not consider code points is not actually processing
text). Therefore, I suggest that, at a minimum, no user-exposed
representation of text should acknowledge a delineation below that of the
code point. If there is any primitive which deals in code units, it
should be a foreign: scary, obscure, not for everyday use.
A non-obvious but good result of the above is that all strings are
correctly-formed by construction. Not all sequences of code units are
correctly formed and correspond to valid strings of text. But all
sequences of code points _are_, of necessity, correctly formed, otherwise
there would be ... problems following additions to unicode. J currently
allows us to create malformed strings, but then complains when we use them
in certain ways:
9 u: 1 u: 10 u: 254 255
|domain error
| 9 u:1 u:10 u:254 255
---
It is a question whether j should natively recognise delineations above
the code point. It pains me to suggest that it should not.
Raku (a pointer-chasing language) has the best-thought-out strings of any
programming language I have encountered. (Unsurprising, given it was
written by perl hackers.) In raku, operations on strings are
grapheme-oriented. Raku also normalizes all text by default (which solves
the problem I presented above with combining characters--but rest assured,
it can not solve all such problems). They even have a scheme for
space-efficient random access to strings on this basis.
But j is not raku, and it is telling that, though raku has
multidimensional arrays, its strings are _not_ arrays, and it does not
have characters. The principle problem is a violation of the rules of
conformability. For instance, it is not guaranteed that, for vectors x
and y, (#x,y) -: x +&# y. This is not _so_ terrible (though it is pretty
bad), but from it follows an obvious problem with catenating higher-rank
arrays. Similar concerns apply at least to i., e., E., and }. That said,
I would support the addition of primitives to perform normalization (as
well as casefolding etc.) and identification of grapheme boundaries.
---
It would be wise of me to address the elephant in the room. Characters
are not only used to represent text, but also arbitrary binary data, e.g.
from the network or files, which may in fact be malformed as text. I
submit that characters are clearly the wrong way to represent such data;
the right way to represent a sequence of _octets_ is using _integers_.
But people persist, and there are two issues: the first is compatibility,
and the second is performance.
Regarding the second, an obvious solution is to add a 1-byte integer
representation (as Marshall has suggested on at least one occasion), but
this represents a potentially nontrivial development effort. Therefore I
suggest an alternate solution, at least for the interim: foreigns (scary
and obscure, per above) that will _intentionally misinterpret_ data from
the outside world as 'UCS-1' and represent it compactly (or do the
opposite).
Regarding the issue of backwards compatibility, I propose the addition of
256 'meta-characters', each corresponding to an octet. Attempts to decode
correctly formed utf-8 from the outside world will succeed and produce
corresponding unicode; attempts to decode malformed utf-8 may map each
incorrect code unit to the corresponding meta-character. When encoded,
real characters will be utf-8 encoded, but each meta-character will be
encoded as its corresponding octet. In this way, arbitrary byte streams
may be passed through j strings; but byte streams which consist entirely
or partly of valid utf-8 can be sensibly interpreted. This is similar to
raku's utf8-c8, and to python's surrogateescape.
---
An implementation detail, sort of. Variable-width representations (such
as utf-8) should not be used internally. Many fundamental array
operations require constant-time random access (with the corresponding
obvious caveats), which variable-width representations cannot provide; and
even operations which are inherently sequential--like E., i., ;., #--may
be more difficult or impossible to optimize to the same degree.
Fixed-width representations therefore provide more predictable
performance, better performance in nearly all cases, and better asymptotic
performance for many interesting applications.
(The UCS-1 misinterpretation mentioned above is a loophole which allows
people who really care about space to do the variable-width part
themselves.)
---
I therefore suggest the following language changes, probably to be
deferred to version 10:
- 1, 2, and 4-byte character representations are still used internally.
They are fixed-width, with each code unit representing one code point.
In the 4-byte representation, because there are more 32-bit values than
unicode code points, some 32-bit values may correspond to passed-through
bytes of misencoded utf8. In this way, a j literal can round-trip
arbitrary byte sequences. The remainder of the 32-bit value space is
completely inaccessible.
- A new primitive verb U:, to replace u:. u: is removed. U: has a
different name, so that old code will break loudly, rather than quietly.
If y is an array of integers, then U:y is an array of characters with
corresponding codepoints; and if y is an array of characters, then U:y
is an array of their code points. (Alternately, make a. impractically
large and rely on a.i.y and x{a. for everything. I disrecommend this
for the same reason that we have j. and r., and do not write x = 0j1 * y
or x * ^ 0j1 * y.)
- Foreigns for reading from files, like 1!:1 and 1!:11 permit 3 modes of
operation; foreigns for writing to files, 1!:2, 1!:3, and 1!:12, permit
2 modes of operation. The reading modes are:
1. Throw on misencoded utf-8 (default).
2. Pass-through misencoded bytes as meta characters.
3. Intentionally misinterpret the file as being 'UCS-1' encoded rather
than utf-8 encoded.
The writing modes are:
1. Encode as utf-8, passing through meta characters as the corresponding
octets (default).
2. Misinterpret output as 'UCS-1' and perform no encoding. Only valid
for 1-byte characters.
A recommendation: the UCS-1 misinterpretation should be removed if 1-byte
integers are ever added.
- A new foreign is provided to 'sneeze' character arrays. This is
largely cosmetic, but may be useful for some. If some string uses a
4-byte representation, but in fact, all of its elements' code points are
below 65536, then the result will use a smaller representation. (This
can also do work on integers, as it can convert them to a boolean
representation if they are all 0 or 1; this is, again, marginal.)
Future directions:
Provide functionality for unicode normalization, casefolding, grapheme
boundary identification, unicode character properties, and others. Maybe
this should be done by turning U: into a trenchcoat function; or maybe it
should be done by library code. There is the potential to reuse existing
primitives, e.g. <.y might be a lowercased y, but I am wary of such puns.
Thoughts? Comments?
-E
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm