Re: [Jprogramming] RFC: unicode

Don Guinn Sat, 19 Mar 2022 08:15:34 -0700

I use UTF-16 and UTF-32 to try to get the code-point of UTF-8 characters so
I can get each character onto one atom. That way I don't have to worry
about how many atoms each character takes. Unfortunately UTF-16 and UTF-32
don't guarantee the characters are in one atom each. It would be nice if U:
had an option to give the code-points of unicode characters.


On Sat, Mar 19, 2022 at 8:49 AM bill lam <[email protected]> wrote:

> Further clarification, J language itself knows nothing about unicode
> standard.
> u: is the only place when utf8 and utf16 etc are relevant.
>
>
> On Sat, 19 Mar 2022 at 10:17 PM bill lam <[email protected]> wrote:
>
> > I think the current behavior of u: is correct and intended.
> > First of all, J utf8 is not a unicode datatype, it is merely a
> > interpretation of 1 byte literal.
> > Similarly 2 byte and 4 byte literal aren't exactly ucs2 and uft32, and
> > this is intended.
> > Operation and comparison between different types of literal are done by
> > promotion atom by atom. This will explain the results that you quoted.
> >
> > The handling of unicode in J is not perfect but it is consistent with J
> > fundamental concepts such as rank.
> >
> > On Sat, 19 Mar 2022 at 7:17 AM Elijah Stone <[email protected]> wrote:
> >
> >>     x=: 8 u: 97 243 98      NB. same as entering x=: 'aób'
> >>     y=: 9 u: x
> >>     z=: 10 u: 97 195 179 98
> >>     x
> >> aób
> >>     y
> >> aób
> >>     z
> >> aÃ³b
> >>
> >>     x-:y
> >> 0
> >>     NB. ??? they look the same
> >>
> >>     x-:z
> >> 1
> >>     NB. ??? they look different
> >>
> >>     $x
> >> 4
> >>     NB. ??? it looks like 3 characters, not 4
> >>
> >> Well, this is unicode.  There are good reasons why two things that look
> >> the same might not actually be the same.  For instance:
> >>
> >>     ]p=: 10 u: 97 243 98
> >> aób
> >>     ]q=: 10 u: 97 111 769 98
> >> aób
> >>     p-:q
> >> 0
> >>
> >> But in the above case, x doesn't match y for stupid reasons.  And x
> >> matches z for stupider ones.
> >>
> >> J's default (1-byte) character representation is a weird hodge-podge of
> >> 'UCS-1' (I don't know what else to call it) and UTF-8, and it does not
> >> seem well thought through.  The dictionary page for u: seems confused as
> >> to whether the 1-byte representation corresponds to ASCII or UTF-8, and
> >> similarly as to whether the 2-byte representation is coded as UCS-2 or
> >> UTF-16.
> >>
> >> Most charitably, this is exposing low-level aspects of the encoding to
> >> users, but if so, that is unsuitable for a high-level language such as
> j,
> >> and it is inconsistent.  I do not have to worry that 0 1 1 0 1 1 0 1
> will
> >> suddenly turn into 36169536663191680, nor that 2.718 will suddenly turn
> >> into 4613302810693613912, but _that is exactly what is happening in the
> >> above code_.
> >>
> >> I give you the crowning WTF (maybe it is not so surprising at this
> >> point...):
> >>
> >>     x;y;x,y                NB. pls j
> >> ┌───┬───┬───────┐
> >> │aób│aób│aÃ³baób│
> >> └───┴───┴───────┘
> >>
> >> Unicode is delicate and skittish, and must be approached delicately.  I
> >> think that there are some essential conflicts between unicode and j--as
> >> the above example with the combining character demonstrates--but also
> >> that
> >> pandora's box is open: literal data _exists_ in j.  Given that that is
> >> the
> >> case, I think it is possible and desirable to do much better than the
> >> current scheme.
> >>
> >> ---
> >>
> >> Unicode text can be broken up in a number of ways.  Graphemes,
> >> characters,
> >> code points, code units...
> >>
> >> The composition of code units into code points is the only such
> >> demarcation which is stable and can be counted upon.  It is also a
> >> demarcation which is necessary for pretty much any interesting text
> >> processing (to the point that I would suggest any form of 'text
> >> processing' which does not consider code points is not actually
> >> processing
> >> text).  Therefore, I suggest that, at a minimum, no user-exposed
> >> representation of text should acknowledge a delineation below that of
> the
> >> code point.  If there is any primitive which deals in code units, it
> >> should be a foreign: scary, obscure, not for everyday use.
> >>
> >> A non-obvious but good result of the above is that all strings are
> >> correctly-formed by construction.  Not all sequences of code units are
> >> correctly formed and correspond to valid strings of text.  But all
> >> sequences of code points _are_, of necessity, correctly formed,
> otherwise
> >> there would be ... problems following additions to unicode.  J currently
> >> allows us to create malformed strings, but then complains when we use
> >> them
> >> in certain ways:
> >>
> >>     9 u: 1 u: 10 u: 254 255
> >> |domain error
> >> |   9     u:1 u:10 u:254 255
> >>
> >> ---
> >>
> >> It is a question whether j should natively recognise delineations above
> >> the code point.  It pains me to suggest that it should not.
> >>
> >> Raku (a pointer-chasing language) has the best-thought-out strings of
> any
> >> programming language I have encountered.  (Unsurprising, given it was
> >> written by perl hackers.)  In raku, operations on strings are
> >> grapheme-oriented.  Raku also normalizes all text by default (which
> >> solves
> >> the problem I presented above with combining characters--but rest
> >> assured,
> >> it can not solve all such problems).  They even have a scheme for
> >> space-efficient random access to strings on this basis.
> >>
> >> But j is not raku, and it is telling that, though raku has
> >> multidimensional arrays, its strings are _not_ arrays, and it does not
> >> have characters.  The principle problem is a violation of the rules of
> >> conformability.  For instance, it is not guaranteed that, for vectors x
> >> and y, (#x,y) -: x +&# y.  This is not _so_ terrible (though it is
> pretty
> >> bad), but from it follows an obvious problem with catenating higher-rank
> >> arrays.  Similar concerns apply at least to i., e., E., and }.  That
> >> said,
> >> I would support the addition of primitives to perform normalization (as
> >> well as casefolding etc.) and identification of grapheme boundaries.
> >>
> >> ---
> >>
> >> It would be wise of me to address the elephant in the room.  Characters
> >> are not only used to represent text, but also arbitrary binary data,
> e.g.
> >> from the network or files, which may in fact be malformed as text.  I
> >> submit that characters are clearly the wrong way to represent such data;
> >> the right way to represent a sequence of _octets_ is using _integers_.
> >> But people persist, and there are two issues: the first is
> compatibility,
> >> and the second is performance.
> >>
> >> Regarding the second, an obvious solution is to add a 1-byte integer
> >> representation (as Marshall has suggested on at least one occasion), but
> >> this represents a potentially nontrivial development effort.  Therefore
> I
> >> suggest an alternate solution, at least for the interim: foreigns (scary
> >> and obscure, per above) that will _intentionally misinterpret_ data from
> >> the outside world as 'UCS-1' and represent it compactly (or do the
> >> opposite).
> >>
> >> Regarding the issue of backwards compatibility, I propose the addition
> of
> >> 256 'meta-characters', each corresponding to an octet.  Attempts to
> >> decode
> >> correctly formed utf-8 from the outside world will succeed and produce
> >> corresponding unicode; attempts to decode malformed utf-8 may map each
> >> incorrect code unit to the corresponding meta-character.  When encoded,
> >> real characters will be utf-8 encoded, but each meta-character will be
> >> encoded as its corresponding octet.  In this way, arbitrary byte streams
> >> may be passed through j strings; but byte streams which consist entirely
> >> or partly of valid utf-8 can be sensibly interpreted.  This is similar
> to
> >> raku's utf8-c8, and to python's surrogateescape.
> >>
> >> ---
> >>
> >> An implementation detail, sort of.  Variable-width representations (such
> >> as utf-8) should not be used internally.  Many fundamental array
> >> operations require constant-time random access (with the corresponding
> >> obvious caveats), which variable-width representations cannot provide;
> >> and
> >> even operations which are inherently sequential--like E., i., ;., #--may
> >> be more difficult or impossible to optimize to the same degree.
> >> Fixed-width representations therefore provide more predictable
> >> performance, better performance in nearly all cases, and better
> >> asymptotic
> >> performance for many interesting applications.
> >>
> >> (The UCS-1 misinterpretation mentioned above is a loophole which allows
> >> people who really care about space to do the variable-width part
> >> themselves.)
> >>
> >> ---
> >>
> >> I therefore suggest the following language changes, probably to be
> >> deferred to version 10:
> >>
> >> - 1, 2, and 4-byte character representations are still used internally.
> >>    They are fixed-width, with each code unit representing one code
> point.
> >>    In the 4-byte representation, because there are more 32-bit values
> than
> >>    unicode code points, some 32-bit values may correspond to
> >> passed-through
> >>    bytes of misencoded utf8.  In this way, a j literal can round-trip
> >>    arbitrary byte sequences.  The remainder of the 32-bit value space is
> >>    completely inaccessible.
> >>
> >> - A new primitive verb U:, to replace u:.  u: is removed.  U: has a
> >>    different name, so that old code will break loudly, rather than
> >> quietly.
> >>    If y is an array of integers, then U:y is an array of characters with
> >>    corresponding codepoints; and if y is an array of characters, then
> U:y
> >>    is an array of their code points.  (Alternately, make a.
> impractically
> >>    large and rely on a.i.y and x{a. for everything.  I disrecommend this
> >>    for the same reason that we have j. and r., and do not write x = 0j1
> *
> >> y
> >>    or x * ^ 0j1 * y.)
> >>
> >> - Foreigns for reading from files, like 1!:1 and 1!:11 permit 3 modes of
> >>    operation; foreigns for writing to files, 1!:2, 1!:3, and 1!:12,
> permit
> >>    2 modes of operation.  The reading modes are:
> >>
> >>    1. Throw on misencoded utf-8 (default).
> >>    2. Pass-through misencoded bytes as meta characters.
> >>    3. Intentionally misinterpret the file as being 'UCS-1' encoded
> rather
> >>       than utf-8 encoded.
> >>
> >>    The writing modes are:
> >>
> >>    1. Encode as utf-8, passing through meta characters as the
> >> corresponding
> >>       octets (default).
> >>    2. Misinterpret output as 'UCS-1' and perform no encoding.  Only
> valid
> >>       for 1-byte characters.
> >>
> >> A recommendation: the UCS-1 misinterpretation should be removed if
> 1-byte
> >> integers are ever added.
> >>
> >> - A new foreign is provided to 'sneeze' character arrays.  This is
> >>    largely cosmetic, but may be useful for some.  If some string uses a
> >>    4-byte representation, but in fact, all of its elements' code points
> >> are
> >>    below 65536, then the result will use a smaller representation.
> (This
> >>    can also do work on integers, as it can convert them to a boolean
> >>    representation if they are all 0 or 1; this is, again, marginal.)
> >>
> >> Future directions:
> >>
> >> Provide functionality for unicode normalization, casefolding, grapheme
> >> boundary identification, unicode character properties, and others.
> Maybe
> >> this should be done by turning U: into a trenchcoat function; or maybe
> it
> >> should be done by library code.  There is the potential to reuse
> existing
> >> primitives, e.g. <.y might be a lowercased y, but I am wary of such
> puns.
> >>
> >> Thoughts?  Comments?
> >>
> >>   -E
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] RFC: unicode

Reply via email to