I don't get it. Can you demo with an example? On Sat, Mar 19, 2022 at 11:15 PM Don Guinn <[email protected]> wrote:
> I use UTF-16 and UTF-32 to try to get the code-point of UTF-8 characters so > I can get each character onto one atom. That way I don't have to worry > about how many atoms each character takes. Unfortunately UTF-16 and UTF-32 > don't guarantee the characters are in one atom each. It would be nice if U: > had an option to give the code-points of unicode characters. > > On Sat, Mar 19, 2022 at 8:49 AM bill lam <[email protected]> wrote: > > > Further clarification, J language itself knows nothing about unicode > > standard. > > u: is the only place when utf8 and utf16 etc are relevant. > > > > > > On Sat, 19 Mar 2022 at 10:17 PM bill lam <[email protected]> wrote: > > > > > I think the current behavior of u: is correct and intended. > > > First of all, J utf8 is not a unicode datatype, it is merely a > > > interpretation of 1 byte literal. > > > Similarly 2 byte and 4 byte literal aren't exactly ucs2 and uft32, and > > > this is intended. > > > Operation and comparison between different types of literal are done by > > > promotion atom by atom. This will explain the results that you quoted. > > > > > > The handling of unicode in J is not perfect but it is consistent with J > > > fundamental concepts such as rank. > > > > > > On Sat, 19 Mar 2022 at 7:17 AM Elijah Stone <[email protected]> > wrote: > > > > > >> x=: 8 u: 97 243 98 NB. same as entering x=: 'aób' > > >> y=: 9 u: x > > >> z=: 10 u: 97 195 179 98 > > >> x > > >> aób > > >> y > > >> aób > > >> z > > >> aób > > >> > > >> x-:y > > >> 0 > > >> NB. ??? they look the same > > >> > > >> x-:z > > >> 1 > > >> NB. ??? they look different > > >> > > >> $x > > >> 4 > > >> NB. ??? it looks like 3 characters, not 4 > > >> > > >> Well, this is unicode. There are good reasons why two things that > look > > >> the same might not actually be the same. For instance: > > >> > > >> ]p=: 10 u: 97 243 98 > > >> aób > > >> ]q=: 10 u: 97 111 769 98 > > >> aób > > >> p-:q > > >> 0 > > >> > > >> But in the above case, x doesn't match y for stupid reasons. And x > > >> matches z for stupider ones. > > >> > > >> J's default (1-byte) character representation is a weird hodge-podge > of > > >> 'UCS-1' (I don't know what else to call it) and UTF-8, and it does not > > >> seem well thought through. The dictionary page for u: seems confused > as > > >> to whether the 1-byte representation corresponds to ASCII or UTF-8, > and > > >> similarly as to whether the 2-byte representation is coded as UCS-2 or > > >> UTF-16. > > >> > > >> Most charitably, this is exposing low-level aspects of the encoding to > > >> users, but if so, that is unsuitable for a high-level language such as > > j, > > >> and it is inconsistent. I do not have to worry that 0 1 1 0 1 1 0 1 > > will > > >> suddenly turn into 36169536663191680, nor that 2.718 will suddenly > turn > > >> into 4613302810693613912, but _that is exactly what is happening in > the > > >> above code_. > > >> > > >> I give you the crowning WTF (maybe it is not so surprising at this > > >> point...): > > >> > > >> x;y;x,y NB. pls j > > >> ┌───┬───┬───────┐ > > >> │aób│aób│aóbaób│ > > >> └───┴───┴───────┘ > > >> > > >> Unicode is delicate and skittish, and must be approached delicately. > I > > >> think that there are some essential conflicts between unicode and > j--as > > >> the above example with the combining character demonstrates--but also > > >> that > > >> pandora's box is open: literal data _exists_ in j. Given that that is > > >> the > > >> case, I think it is possible and desirable to do much better than the > > >> current scheme. > > >> > > >> --- > > >> > > >> Unicode text can be broken up in a number of ways. Graphemes, > > >> characters, > > >> code points, code units... > > >> > > >> The composition of code units into code points is the only such > > >> demarcation which is stable and can be counted upon. It is also a > > >> demarcation which is necessary for pretty much any interesting text > > >> processing (to the point that I would suggest any form of 'text > > >> processing' which does not consider code points is not actually > > >> processing > > >> text). Therefore, I suggest that, at a minimum, no user-exposed > > >> representation of text should acknowledge a delineation below that of > > the > > >> code point. If there is any primitive which deals in code units, it > > >> should be a foreign: scary, obscure, not for everyday use. > > >> > > >> A non-obvious but good result of the above is that all strings are > > >> correctly-formed by construction. Not all sequences of code units are > > >> correctly formed and correspond to valid strings of text. But all > > >> sequences of code points _are_, of necessity, correctly formed, > > otherwise > > >> there would be ... problems following additions to unicode. J > currently > > >> allows us to create malformed strings, but then complains when we use > > >> them > > >> in certain ways: > > >> > > >> 9 u: 1 u: 10 u: 254 255 > > >> |domain error > > >> | 9 u:1 u:10 u:254 255 > > >> > > >> --- > > >> > > >> It is a question whether j should natively recognise delineations > above > > >> the code point. It pains me to suggest that it should not. > > >> > > >> Raku (a pointer-chasing language) has the best-thought-out strings of > > any > > >> programming language I have encountered. (Unsurprising, given it was > > >> written by perl hackers.) In raku, operations on strings are > > >> grapheme-oriented. Raku also normalizes all text by default (which > > >> solves > > >> the problem I presented above with combining characters--but rest > > >> assured, > > >> it can not solve all such problems). They even have a scheme for > > >> space-efficient random access to strings on this basis. > > >> > > >> But j is not raku, and it is telling that, though raku has > > >> multidimensional arrays, its strings are _not_ arrays, and it does not > > >> have characters. The principle problem is a violation of the rules of > > >> conformability. For instance, it is not guaranteed that, for vectors > x > > >> and y, (#x,y) -: x +&# y. This is not _so_ terrible (though it is > > pretty > > >> bad), but from it follows an obvious problem with catenating > higher-rank > > >> arrays. Similar concerns apply at least to i., e., E., and }. That > > >> said, > > >> I would support the addition of primitives to perform normalization > (as > > >> well as casefolding etc.) and identification of grapheme boundaries. > > >> > > >> --- > > >> > > >> It would be wise of me to address the elephant in the room. > Characters > > >> are not only used to represent text, but also arbitrary binary data, > > e.g. > > >> from the network or files, which may in fact be malformed as text. I > > >> submit that characters are clearly the wrong way to represent such > data; > > >> the right way to represent a sequence of _octets_ is using _integers_. > > >> But people persist, and there are two issues: the first is > > compatibility, > > >> and the second is performance. > > >> > > >> Regarding the second, an obvious solution is to add a 1-byte integer > > >> representation (as Marshall has suggested on at least one occasion), > but > > >> this represents a potentially nontrivial development effort. > Therefore > > I > > >> suggest an alternate solution, at least for the interim: foreigns > (scary > > >> and obscure, per above) that will _intentionally misinterpret_ data > from > > >> the outside world as 'UCS-1' and represent it compactly (or do the > > >> opposite). > > >> > > >> Regarding the issue of backwards compatibility, I propose the addition > > of > > >> 256 'meta-characters', each corresponding to an octet. Attempts to > > >> decode > > >> correctly formed utf-8 from the outside world will succeed and produce > > >> corresponding unicode; attempts to decode malformed utf-8 may map each > > >> incorrect code unit to the corresponding meta-character. When > encoded, > > >> real characters will be utf-8 encoded, but each meta-character will be > > >> encoded as its corresponding octet. In this way, arbitrary byte > streams > > >> may be passed through j strings; but byte streams which consist > entirely > > >> or partly of valid utf-8 can be sensibly interpreted. This is similar > > to > > >> raku's utf8-c8, and to python's surrogateescape. > > >> > > >> --- > > >> > > >> An implementation detail, sort of. Variable-width representations > (such > > >> as utf-8) should not be used internally. Many fundamental array > > >> operations require constant-time random access (with the corresponding > > >> obvious caveats), which variable-width representations cannot provide; > > >> and > > >> even operations which are inherently sequential--like E., i., ;., > #--may > > >> be more difficult or impossible to optimize to the same degree. > > >> Fixed-width representations therefore provide more predictable > > >> performance, better performance in nearly all cases, and better > > >> asymptotic > > >> performance for many interesting applications. > > >> > > >> (The UCS-1 misinterpretation mentioned above is a loophole which > allows > > >> people who really care about space to do the variable-width part > > >> themselves.) > > >> > > >> --- > > >> > > >> I therefore suggest the following language changes, probably to be > > >> deferred to version 10: > > >> > > >> - 1, 2, and 4-byte character representations are still used > internally. > > >> They are fixed-width, with each code unit representing one code > > point. > > >> In the 4-byte representation, because there are more 32-bit values > > than > > >> unicode code points, some 32-bit values may correspond to > > >> passed-through > > >> bytes of misencoded utf8. In this way, a j literal can round-trip > > >> arbitrary byte sequences. The remainder of the 32-bit value space > is > > >> completely inaccessible. > > >> > > >> - A new primitive verb U:, to replace u:. u: is removed. U: has a > > >> different name, so that old code will break loudly, rather than > > >> quietly. > > >> If y is an array of integers, then U:y is an array of characters > with > > >> corresponding codepoints; and if y is an array of characters, then > > U:y > > >> is an array of their code points. (Alternately, make a. > > impractically > > >> large and rely on a.i.y and x{a. for everything. I disrecommend > this > > >> for the same reason that we have j. and r., and do not write x = > 0j1 > > * > > >> y > > >> or x * ^ 0j1 * y.) > > >> > > >> - Foreigns for reading from files, like 1!:1 and 1!:11 permit 3 modes > of > > >> operation; foreigns for writing to files, 1!:2, 1!:3, and 1!:12, > > permit > > >> 2 modes of operation. The reading modes are: > > >> > > >> 1. Throw on misencoded utf-8 (default). > > >> 2. Pass-through misencoded bytes as meta characters. > > >> 3. Intentionally misinterpret the file as being 'UCS-1' encoded > > rather > > >> than utf-8 encoded. > > >> > > >> The writing modes are: > > >> > > >> 1. Encode as utf-8, passing through meta characters as the > > >> corresponding > > >> octets (default). > > >> 2. Misinterpret output as 'UCS-1' and perform no encoding. Only > > valid > > >> for 1-byte characters. > > >> > > >> A recommendation: the UCS-1 misinterpretation should be removed if > > 1-byte > > >> integers are ever added. > > >> > > >> - A new foreign is provided to 'sneeze' character arrays. This is > > >> largely cosmetic, but may be useful for some. If some string uses > a > > >> 4-byte representation, but in fact, all of its elements' code > points > > >> are > > >> below 65536, then the result will use a smaller representation. > > (This > > >> can also do work on integers, as it can convert them to a boolean > > >> representation if they are all 0 or 1; this is, again, marginal.) > > >> > > >> Future directions: > > >> > > >> Provide functionality for unicode normalization, casefolding, grapheme > > >> boundary identification, unicode character properties, and others. > > Maybe > > >> this should be done by turning U: into a trenchcoat function; or maybe > > it > > >> should be done by library code. There is the potential to reuse > > existing > > >> primitives, e.g. <.y might be a lowercased y, but I am wary of such > > puns. > > >> > > >> Thoughts? Comments? > > >> > > >> -E > > >> ---------------------------------------------------------------------- > > >> For information about J forums see > http://www.jsoftware.com/forums.htm > > >> > > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
