Re: [Jprogramming] RFC: unicode

bill lam Sat, 19 Mar 2022 08:36:42 -0700

I don't get it. Can you demo with an example?

On Sat, Mar 19, 2022 at 11:15 PM Don Guinn <[email protected]> wrote:


> I use UTF-16 and UTF-32 to try to get the code-point of UTF-8 characters so
> I can get each character onto one atom. That way I don't have to worry
> about how many atoms each character takes. Unfortunately UTF-16 and UTF-32
> don't guarantee the characters are in one atom each. It would be nice if U:
> had an option to give the code-points of unicode characters.
>
> On Sat, Mar 19, 2022 at 8:49 AM bill lam <[email protected]> wrote:
>
> > Further clarification, J language itself knows nothing about unicode
> > standard.
> > u: is the only place when utf8 and utf16 etc are relevant.
> >
> >
> > On Sat, 19 Mar 2022 at 10:17 PM bill lam <[email protected]> wrote:
> >
> > > I think the current behavior of u: is correct and intended.
> > > First of all, J utf8 is not a unicode datatype, it is merely a
> > > interpretation of 1 byte literal.
> > > Similarly 2 byte and 4 byte literal aren't exactly ucs2 and uft32, and
> > > this is intended.
> > > Operation and comparison between different types of literal are done by
> > > promotion atom by atom. This will explain the results that you quoted.
> > >
> > > The handling of unicode in J is not perfect but it is consistent with J
> > > fundamental concepts such as rank.
> > >
> > > On Sat, 19 Mar 2022 at 7:17 AM Elijah Stone <[email protected]>
> wrote:
> > >
> > >>     x=: 8 u: 97 243 98      NB. same as entering x=: 'aób'
> > >>     y=: 9 u: x
> > >>     z=: 10 u: 97 195 179 98
> > >>     x
> > >> aób
> > >>     y
> > >> aób
> > >>     z
> > >> aÃ³b
> > >>
> > >>     x-:y
> > >> 0
> > >>     NB. ??? they look the same
> > >>
> > >>     x-:z
> > >> 1
> > >>     NB. ??? they look different
> > >>
> > >>     $x
> > >> 4
> > >>     NB. ??? it looks like 3 characters, not 4
> > >>
> > >> Well, this is unicode.  There are good reasons why two things that
> look
> > >> the same might not actually be the same.  For instance:
> > >>
> > >>     ]p=: 10 u: 97 243 98
> > >> aób
> > >>     ]q=: 10 u: 97 111 769 98
> > >> aób
> > >>     p-:q
> > >> 0
> > >>
> > >> But in the above case, x doesn't match y for stupid reasons.  And x
> > >> matches z for stupider ones.
> > >>
> > >> J's default (1-byte) character representation is a weird hodge-podge
> of
> > >> 'UCS-1' (I don't know what else to call it) and UTF-8, and it does not
> > >> seem well thought through.  The dictionary page for u: seems confused
> as
> > >> to whether the 1-byte representation corresponds to ASCII or UTF-8,
> and
> > >> similarly as to whether the 2-byte representation is coded as UCS-2 or
> > >> UTF-16.
> > >>
> > >> Most charitably, this is exposing low-level aspects of the encoding to
> > >> users, but if so, that is unsuitable for a high-level language such as
> > j,
> > >> and it is inconsistent.  I do not have to worry that 0 1 1 0 1 1 0 1
> > will
> > >> suddenly turn into 36169536663191680, nor that 2.718 will suddenly
> turn
> > >> into 4613302810693613912, but _that is exactly what is happening in
> the
> > >> above code_.
> > >>
> > >> I give you the crowning WTF (maybe it is not so surprising at this
> > >> point...):
> > >>
> > >>     x;y;x,y                NB. pls j
> > >> ┌───┬───┬───────┐
> > >> │aób│aób│aÃ³baób│
> > >> └───┴───┴───────┘
> > >>
> > >> Unicode is delicate and skittish, and must be approached delicately.
> I
> > >> think that there are some essential conflicts between unicode and
> j--as
> > >> the above example with the combining character demonstrates--but also
> > >> that
> > >> pandora's box is open: literal data _exists_ in j.  Given that that is
> > >> the
> > >> case, I think it is possible and desirable to do much better than the
> > >> current scheme.
> > >>
> > >> ---
> > >>
> > >> Unicode text can be broken up in a number of ways.  Graphemes,
> > >> characters,
> > >> code points, code units...
> > >>
> > >> The composition of code units into code points is the only such
> > >> demarcation which is stable and can be counted upon.  It is also a
> > >> demarcation which is necessary for pretty much any interesting text
> > >> processing (to the point that I would suggest any form of 'text
> > >> processing' which does not consider code points is not actually
> > >> processing
> > >> text).  Therefore, I suggest that, at a minimum, no user-exposed
> > >> representation of text should acknowledge a delineation below that of
> > the
> > >> code point.  If there is any primitive which deals in code units, it
> > >> should be a foreign: scary, obscure, not for everyday use.
> > >>
> > >> A non-obvious but good result of the above is that all strings are
> > >> correctly-formed by construction.  Not all sequences of code units are
> > >> correctly formed and correspond to valid strings of text.  But all
> > >> sequences of code points _are_, of necessity, correctly formed,
> > otherwise
> > >> there would be ... problems following additions to unicode.  J
> currently
> > >> allows us to create malformed strings, but then complains when we use
> > >> them
> > >> in certain ways:
> > >>
> > >>     9 u: 1 u: 10 u: 254 255
> > >> |domain error
> > >> |   9     u:1 u:10 u:254 255
> > >>
> > >> ---
> > >>
> > >> It is a question whether j should natively recognise delineations
> above
> > >> the code point.  It pains me to suggest that it should not.
> > >>
> > >> Raku (a pointer-chasing language) has the best-thought-out strings of
> > any
> > >> programming language I have encountered.  (Unsurprising, given it was
> > >> written by perl hackers.)  In raku, operations on strings are
> > >> grapheme-oriented.  Raku also normalizes all text by default (which
> > >> solves
> > >> the problem I presented above with combining characters--but rest
> > >> assured,
> > >> it can not solve all such problems).  They even have a scheme for
> > >> space-efficient random access to strings on this basis.
> > >>
> > >> But j is not raku, and it is telling that, though raku has
> > >> multidimensional arrays, its strings are _not_ arrays, and it does not
> > >> have characters.  The principle problem is a violation of the rules of
> > >> conformability.  For instance, it is not guaranteed that, for vectors
> x
> > >> and y, (#x,y) -: x +&# y.  This is not _so_ terrible (though it is
> > pretty
> > >> bad), but from it follows an obvious problem with catenating
> higher-rank
> > >> arrays.  Similar concerns apply at least to i., e., E., and }.  That
> > >> said,
> > >> I would support the addition of primitives to perform normalization
> (as
> > >> well as casefolding etc.) and identification of grapheme boundaries.
> > >>
> > >> ---
> > >>
> > >> It would be wise of me to address the elephant in the room.
> Characters
> > >> are not only used to represent text, but also arbitrary binary data,
> > e.g.
> > >> from the network or files, which may in fact be malformed as text.  I
> > >> submit that characters are clearly the wrong way to represent such
> data;
> > >> the right way to represent a sequence of _octets_ is using _integers_.
> > >> But people persist, and there are two issues: the first is
> > compatibility,
> > >> and the second is performance.
> > >>
> > >> Regarding the second, an obvious solution is to add a 1-byte integer
> > >> representation (as Marshall has suggested on at least one occasion),
> but
> > >> this represents a potentially nontrivial development effort.
> Therefore
> > I
> > >> suggest an alternate solution, at least for the interim: foreigns
> (scary
> > >> and obscure, per above) that will _intentionally misinterpret_ data
> from
> > >> the outside world as 'UCS-1' and represent it compactly (or do the
> > >> opposite).
> > >>
> > >> Regarding the issue of backwards compatibility, I propose the addition
> > of
> > >> 256 'meta-characters', each corresponding to an octet.  Attempts to
> > >> decode
> > >> correctly formed utf-8 from the outside world will succeed and produce
> > >> corresponding unicode; attempts to decode malformed utf-8 may map each
> > >> incorrect code unit to the corresponding meta-character.  When
> encoded,
> > >> real characters will be utf-8 encoded, but each meta-character will be
> > >> encoded as its corresponding octet.  In this way, arbitrary byte
> streams
> > >> may be passed through j strings; but byte streams which consist
> entirely
> > >> or partly of valid utf-8 can be sensibly interpreted.  This is similar
> > to
> > >> raku's utf8-c8, and to python's surrogateescape.
> > >>
> > >> ---
> > >>
> > >> An implementation detail, sort of.  Variable-width representations
> (such
> > >> as utf-8) should not be used internally.  Many fundamental array
> > >> operations require constant-time random access (with the corresponding
> > >> obvious caveats), which variable-width representations cannot provide;
> > >> and
> > >> even operations which are inherently sequential--like E., i., ;.,
> #--may
> > >> be more difficult or impossible to optimize to the same degree.
> > >> Fixed-width representations therefore provide more predictable
> > >> performance, better performance in nearly all cases, and better
> > >> asymptotic
> > >> performance for many interesting applications.
> > >>
> > >> (The UCS-1 misinterpretation mentioned above is a loophole which
> allows
> > >> people who really care about space to do the variable-width part
> > >> themselves.)
> > >>
> > >> ---
> > >>
> > >> I therefore suggest the following language changes, probably to be
> > >> deferred to version 10:
> > >>
> > >> - 1, 2, and 4-byte character representations are still used
> internally.
> > >>    They are fixed-width, with each code unit representing one code
> > point.
> > >>    In the 4-byte representation, because there are more 32-bit values
> > than
> > >>    unicode code points, some 32-bit values may correspond to
> > >> passed-through
> > >>    bytes of misencoded utf8.  In this way, a j literal can round-trip
> > >>    arbitrary byte sequences.  The remainder of the 32-bit value space
> is
> > >>    completely inaccessible.
> > >>
> > >> - A new primitive verb U:, to replace u:.  u: is removed.  U: has a
> > >>    different name, so that old code will break loudly, rather than
> > >> quietly.
> > >>    If y is an array of integers, then U:y is an array of characters
> with
> > >>    corresponding codepoints; and if y is an array of characters, then
> > U:y
> > >>    is an array of their code points.  (Alternately, make a.
> > impractically
> > >>    large and rely on a.i.y and x{a. for everything.  I disrecommend
> this
> > >>    for the same reason that we have j. and r., and do not write x =
> 0j1
> > *
> > >> y
> > >>    or x * ^ 0j1 * y.)
> > >>
> > >> - Foreigns for reading from files, like 1!:1 and 1!:11 permit 3 modes
> of
> > >>    operation; foreigns for writing to files, 1!:2, 1!:3, and 1!:12,
> > permit
> > >>    2 modes of operation.  The reading modes are:
> > >>
> > >>    1. Throw on misencoded utf-8 (default).
> > >>    2. Pass-through misencoded bytes as meta characters.
> > >>    3. Intentionally misinterpret the file as being 'UCS-1' encoded
> > rather
> > >>       than utf-8 encoded.
> > >>
> > >>    The writing modes are:
> > >>
> > >>    1. Encode as utf-8, passing through meta characters as the
> > >> corresponding
> > >>       octets (default).
> > >>    2. Misinterpret output as 'UCS-1' and perform no encoding.  Only
> > valid
> > >>       for 1-byte characters.
> > >>
> > >> A recommendation: the UCS-1 misinterpretation should be removed if
> > 1-byte
> > >> integers are ever added.
> > >>
> > >> - A new foreign is provided to 'sneeze' character arrays.  This is
> > >>    largely cosmetic, but may be useful for some.  If some string uses
> a
> > >>    4-byte representation, but in fact, all of its elements' code
> points
> > >> are
> > >>    below 65536, then the result will use a smaller representation.
> > (This
> > >>    can also do work on integers, as it can convert them to a boolean
> > >>    representation if they are all 0 or 1; this is, again, marginal.)
> > >>
> > >> Future directions:
> > >>
> > >> Provide functionality for unicode normalization, casefolding, grapheme
> > >> boundary identification, unicode character properties, and others.
> > Maybe
> > >> this should be done by turning U: into a trenchcoat function; or maybe
> > it
> > >> should be done by library code.  There is the potential to reuse
> > existing
> > >> primitives, e.g. <.y might be a lowercased y, but I am wary of such
> > puns.
> > >>
> > >> Thoughts?  Comments?
> > >>
> > >>   -E
> > >> ----------------------------------------------------------------------
> > >> For information about J forums see
> http://www.jsoftware.com/forums.htm
> > >>
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] RFC: unicode

Reply via email to