I use UTF-16 and UTF-32 to try to get the code-point of UTF-8 characters so I can get each character onto one atom. That way I don't have to worry about how many atoms each character takes. Unfortunately UTF-16 and UTF-32 don't guarantee the characters are in one atom each. It would be nice if U: had an option to give the code-points of unicode characters.
On Sat, Mar 19, 2022 at 8:49 AM bill lam <[email protected]> wrote: > Further clarification, J language itself knows nothing about unicode > standard. > u: is the only place when utf8 and utf16 etc are relevant. > > > On Sat, 19 Mar 2022 at 10:17 PM bill lam <[email protected]> wrote: > > > I think the current behavior of u: is correct and intended. > > First of all, J utf8 is not a unicode datatype, it is merely a > > interpretation of 1 byte literal. > > Similarly 2 byte and 4 byte literal aren't exactly ucs2 and uft32, and > > this is intended. > > Operation and comparison between different types of literal are done by > > promotion atom by atom. This will explain the results that you quoted. > > > > The handling of unicode in J is not perfect but it is consistent with J > > fundamental concepts such as rank. > > > > On Sat, 19 Mar 2022 at 7:17 AM Elijah Stone <[email protected]> wrote: > > > >> x=: 8 u: 97 243 98 NB. same as entering x=: 'aób' > >> y=: 9 u: x > >> z=: 10 u: 97 195 179 98 > >> x > >> aób > >> y > >> aób > >> z > >> aób > >> > >> x-:y > >> 0 > >> NB. ??? they look the same > >> > >> x-:z > >> 1 > >> NB. ??? they look different > >> > >> $x > >> 4 > >> NB. ??? it looks like 3 characters, not 4 > >> > >> Well, this is unicode. There are good reasons why two things that look > >> the same might not actually be the same. For instance: > >> > >> ]p=: 10 u: 97 243 98 > >> aób > >> ]q=: 10 u: 97 111 769 98 > >> aób > >> p-:q > >> 0 > >> > >> But in the above case, x doesn't match y for stupid reasons. And x > >> matches z for stupider ones. > >> > >> J's default (1-byte) character representation is a weird hodge-podge of > >> 'UCS-1' (I don't know what else to call it) and UTF-8, and it does not > >> seem well thought through. The dictionary page for u: seems confused as > >> to whether the 1-byte representation corresponds to ASCII or UTF-8, and > >> similarly as to whether the 2-byte representation is coded as UCS-2 or > >> UTF-16. > >> > >> Most charitably, this is exposing low-level aspects of the encoding to > >> users, but if so, that is unsuitable for a high-level language such as > j, > >> and it is inconsistent. I do not have to worry that 0 1 1 0 1 1 0 1 > will > >> suddenly turn into 36169536663191680, nor that 2.718 will suddenly turn > >> into 4613302810693613912, but _that is exactly what is happening in the > >> above code_. > >> > >> I give you the crowning WTF (maybe it is not so surprising at this > >> point...): > >> > >> x;y;x,y NB. pls j > >> ┌───┬───┬───────┐ > >> │aób│aób│aóbaób│ > >> └───┴───┴───────┘ > >> > >> Unicode is delicate and skittish, and must be approached delicately. I > >> think that there are some essential conflicts between unicode and j--as > >> the above example with the combining character demonstrates--but also > >> that > >> pandora's box is open: literal data _exists_ in j. Given that that is > >> the > >> case, I think it is possible and desirable to do much better than the > >> current scheme. > >> > >> --- > >> > >> Unicode text can be broken up in a number of ways. Graphemes, > >> characters, > >> code points, code units... > >> > >> The composition of code units into code points is the only such > >> demarcation which is stable and can be counted upon. It is also a > >> demarcation which is necessary for pretty much any interesting text > >> processing (to the point that I would suggest any form of 'text > >> processing' which does not consider code points is not actually > >> processing > >> text). Therefore, I suggest that, at a minimum, no user-exposed > >> representation of text should acknowledge a delineation below that of > the > >> code point. If there is any primitive which deals in code units, it > >> should be a foreign: scary, obscure, not for everyday use. > >> > >> A non-obvious but good result of the above is that all strings are > >> correctly-formed by construction. Not all sequences of code units are > >> correctly formed and correspond to valid strings of text. But all > >> sequences of code points _are_, of necessity, correctly formed, > otherwise > >> there would be ... problems following additions to unicode. J currently > >> allows us to create malformed strings, but then complains when we use > >> them > >> in certain ways: > >> > >> 9 u: 1 u: 10 u: 254 255 > >> |domain error > >> | 9 u:1 u:10 u:254 255 > >> > >> --- > >> > >> It is a question whether j should natively recognise delineations above > >> the code point. It pains me to suggest that it should not. > >> > >> Raku (a pointer-chasing language) has the best-thought-out strings of > any > >> programming language I have encountered. (Unsurprising, given it was > >> written by perl hackers.) In raku, operations on strings are > >> grapheme-oriented. Raku also normalizes all text by default (which > >> solves > >> the problem I presented above with combining characters--but rest > >> assured, > >> it can not solve all such problems). They even have a scheme for > >> space-efficient random access to strings on this basis. > >> > >> But j is not raku, and it is telling that, though raku has > >> multidimensional arrays, its strings are _not_ arrays, and it does not > >> have characters. The principle problem is a violation of the rules of > >> conformability. For instance, it is not guaranteed that, for vectors x > >> and y, (#x,y) -: x +&# y. This is not _so_ terrible (though it is > pretty > >> bad), but from it follows an obvious problem with catenating higher-rank > >> arrays. Similar concerns apply at least to i., e., E., and }. That > >> said, > >> I would support the addition of primitives to perform normalization (as > >> well as casefolding etc.) and identification of grapheme boundaries. > >> > >> --- > >> > >> It would be wise of me to address the elephant in the room. Characters > >> are not only used to represent text, but also arbitrary binary data, > e.g. > >> from the network or files, which may in fact be malformed as text. I > >> submit that characters are clearly the wrong way to represent such data; > >> the right way to represent a sequence of _octets_ is using _integers_. > >> But people persist, and there are two issues: the first is > compatibility, > >> and the second is performance. > >> > >> Regarding the second, an obvious solution is to add a 1-byte integer > >> representation (as Marshall has suggested on at least one occasion), but > >> this represents a potentially nontrivial development effort. Therefore > I > >> suggest an alternate solution, at least for the interim: foreigns (scary > >> and obscure, per above) that will _intentionally misinterpret_ data from > >> the outside world as 'UCS-1' and represent it compactly (or do the > >> opposite). > >> > >> Regarding the issue of backwards compatibility, I propose the addition > of > >> 256 'meta-characters', each corresponding to an octet. Attempts to > >> decode > >> correctly formed utf-8 from the outside world will succeed and produce > >> corresponding unicode; attempts to decode malformed utf-8 may map each > >> incorrect code unit to the corresponding meta-character. When encoded, > >> real characters will be utf-8 encoded, but each meta-character will be > >> encoded as its corresponding octet. In this way, arbitrary byte streams > >> may be passed through j strings; but byte streams which consist entirely > >> or partly of valid utf-8 can be sensibly interpreted. This is similar > to > >> raku's utf8-c8, and to python's surrogateescape. > >> > >> --- > >> > >> An implementation detail, sort of. Variable-width representations (such > >> as utf-8) should not be used internally. Many fundamental array > >> operations require constant-time random access (with the corresponding > >> obvious caveats), which variable-width representations cannot provide; > >> and > >> even operations which are inherently sequential--like E., i., ;., #--may > >> be more difficult or impossible to optimize to the same degree. > >> Fixed-width representations therefore provide more predictable > >> performance, better performance in nearly all cases, and better > >> asymptotic > >> performance for many interesting applications. > >> > >> (The UCS-1 misinterpretation mentioned above is a loophole which allows > >> people who really care about space to do the variable-width part > >> themselves.) > >> > >> --- > >> > >> I therefore suggest the following language changes, probably to be > >> deferred to version 10: > >> > >> - 1, 2, and 4-byte character representations are still used internally. > >> They are fixed-width, with each code unit representing one code > point. > >> In the 4-byte representation, because there are more 32-bit values > than > >> unicode code points, some 32-bit values may correspond to > >> passed-through > >> bytes of misencoded utf8. In this way, a j literal can round-trip > >> arbitrary byte sequences. The remainder of the 32-bit value space is > >> completely inaccessible. > >> > >> - A new primitive verb U:, to replace u:. u: is removed. U: has a > >> different name, so that old code will break loudly, rather than > >> quietly. > >> If y is an array of integers, then U:y is an array of characters with > >> corresponding codepoints; and if y is an array of characters, then > U:y > >> is an array of their code points. (Alternately, make a. > impractically > >> large and rely on a.i.y and x{a. for everything. I disrecommend this > >> for the same reason that we have j. and r., and do not write x = 0j1 > * > >> y > >> or x * ^ 0j1 * y.) > >> > >> - Foreigns for reading from files, like 1!:1 and 1!:11 permit 3 modes of > >> operation; foreigns for writing to files, 1!:2, 1!:3, and 1!:12, > permit > >> 2 modes of operation. The reading modes are: > >> > >> 1. Throw on misencoded utf-8 (default). > >> 2. Pass-through misencoded bytes as meta characters. > >> 3. Intentionally misinterpret the file as being 'UCS-1' encoded > rather > >> than utf-8 encoded. > >> > >> The writing modes are: > >> > >> 1. Encode as utf-8, passing through meta characters as the > >> corresponding > >> octets (default). > >> 2. Misinterpret output as 'UCS-1' and perform no encoding. Only > valid > >> for 1-byte characters. > >> > >> A recommendation: the UCS-1 misinterpretation should be removed if > 1-byte > >> integers are ever added. > >> > >> - A new foreign is provided to 'sneeze' character arrays. This is > >> largely cosmetic, but may be useful for some. If some string uses a > >> 4-byte representation, but in fact, all of its elements' code points > >> are > >> below 65536, then the result will use a smaller representation. > (This > >> can also do work on integers, as it can convert them to a boolean > >> representation if they are all 0 or 1; this is, again, marginal.) > >> > >> Future directions: > >> > >> Provide functionality for unicode normalization, casefolding, grapheme > >> boundary identification, unicode character properties, and others. > Maybe > >> this should be done by turning U: into a trenchcoat function; or maybe > it > >> should be done by library code. There is the potential to reuse > existing > >> primitives, e.g. <.y might be a lowercased y, but I am wary of such > puns. > >> > >> Thoughts? Comments? > >> > >> -E > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > >> > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
