Re: [Jprogramming] RFC: unicode

Raul Miller Sat, 19 Mar 2022 14:20:56 -0700

The  issue with emoji zwg sequences is that they are unicode
sequences, and not unicode code points. So, unicode code point
mechanism would not be sufficient to treat these sequences as single
entities. (Unicode adopted what might be thought of as some of APL's
"overstrike" mechanisms.)


See also: https://emojipedia.org/emoji-zwj-sequence/

Generally speaking, a programmer working with character sequences is
going to have to develop a toolkit of abstractions for dealing with
the domains of interest. (And, in the general case, this might mean
bringing in humans to make judgement calls.)

Thanks,

-- 
Raul

On Sat, Mar 19, 2022 at 4:29 PM Elijah Stone <[email protected]> wrote:
>
> That is definitely not beyond unicode.  My proposed grapheme boundary
> identification functionality would take care of that.  Though I think that
> belongs in library code.
>
> On Sat, 19 Mar 2022, Don Guinn wrote:
>
> > I apologise for not looking at the changes to unicode. Now limited to 20
> > bits (I think), unicode4 now covers all unicode code-points in one atom.
> > But then I stumbled onto this: 👩‍ - Girl with red hair. Emoji ZWG sequence
> > has characters that require more than one unicode code point. I guess that
> > this is a little beyond unicode. Oh well.
> >
> > 3 u: '👩‍🦰'
> > 240 159 145 169 226 128 141 240 159 166 176
> >
> >   ucpcount '👩‍🦰'
> > 5
> >
> > On Sat, Mar 19, 2022 at 9:36 AM bill lam <[email protected]> wrote:
> >
> >> I don't get it. Can you demo with an example?
> >>
> >> On Sat, Mar 19, 2022 at 11:15 PM Don Guinn <[email protected]> wrote:
> >>
> >> > I use UTF-16 and UTF-32 to try to get the code-point of UTF-8 characters
> >> so
> >> > I can get each character onto one atom. That way I don't have to worry
> >> > about how many atoms each character takes. Unfortunately UTF-16 and
> >> UTF-32
> >> > don't guarantee the characters are in one atom each. It would be nice if
> >> U:
> >> > had an option to give the code-points of unicode characters.
> >> >
> >> > On Sat, Mar 19, 2022 at 8:49 AM bill lam <[email protected]> wrote:
> >> >
> >> > > Further clarification, J language itself knows nothing about unicode
> >> > > standard.
> >> > > u: is the only place when utf8 and utf16 etc are relevant.
> >> > >
> >> > >
> >> > > On Sat, 19 Mar 2022 at 10:17 PM bill lam <[email protected]> wrote:
> >> > >
> >> > > > I think the current behavior of u: is correct and intended.
> >> > > > First of all, J utf8 is not a unicode datatype, it is merely a
> >> > > > interpretation of 1 byte literal.
> >> > > > Similarly 2 byte and 4 byte literal aren't exactly ucs2 and uft32,
> >> and
> >> > > > this is intended.
> >> > > > Operation and comparison between different types of literal are done
> >> by
> >> > > > promotion atom by atom. This will explain the results that you
> >> quoted.
> >> > > >
> >> > > > The handling of unicode in J is not perfect but it is consistent
> >> with J
> >> > > > fundamental concepts such as rank.
> >> > > >
> >> > > > On Sat, 19 Mar 2022 at 7:17 AM Elijah Stone <[email protected]>
> >> > wrote:
> >> > > >
> >> > > >>     x=: 8 u: 97 243 98      NB. same as entering x=: 'aób'
> >> > > >>     y=: 9 u: x
> >> > > >>     z=: 10 u: 97 195 179 98
> >> > > >>     x
> >> > > >> aób
> >> > > >>     y
> >> > > >> aób
> >> > > >>     z
> >> > > >> aÃ³b
> >> > > >>
> >> > > >>     x-:y
> >> > > >> 0
> >> > > >>     NB. ??? they look the same
> >> > > >>
> >> > > >>     x-:z
> >> > > >> 1
> >> > > >>     NB. ??? they look different
> >> > > >>
> >> > > >>     $x
> >> > > >> 4
> >> > > >>     NB. ??? it looks like 3 characters, not 4
> >> > > >>
> >> > > >> Well, this is unicode.  There are good reasons why two things that
> >> > look
> >> > > >> the same might not actually be the same.  For instance:
> >> > > >>
> >> > > >>     ]p=: 10 u: 97 243 98
> >> > > >> aób
> >> > > >>     ]q=: 10 u: 97 111 769 98
> >> > > >> aób
> >> > > >>     p-:q
> >> > > >> 0
> >> > > >>
> >> > > >> But in the above case, x doesn't match y for stupid reasons.  And x
> >> > > >> matches z for stupider ones.
> >> > > >>
> >> > > >> J's default (1-byte) character representation is a weird hodge-podge
> >> > of
> >> > > >> 'UCS-1' (I don't know what else to call it) and UTF-8, and it does
> >> not
> >> > > >> seem well thought through.  The dictionary page for u: seems
> >> confused
> >> > as
> >> > > >> to whether the 1-byte representation corresponds to ASCII or UTF-8,
> >> > and
> >> > > >> similarly as to whether the 2-byte representation is coded as UCS-2
> >> or
> >> > > >> UTF-16.
> >> > > >>
> >> > > >> Most charitably, this is exposing low-level aspects of the encoding
> >> to
> >> > > >> users, but if so, that is unsuitable for a high-level language such
> >> as
> >> > > j,
> >> > > >> and it is inconsistent.  I do not have to worry that 0 1 1 0 1 1 0 1
> >> > > will
> >> > > >> suddenly turn into 36169536663191680, nor that 2.718 will suddenly
> >> > turn
> >> > > >> into 4613302810693613912, but _that is exactly what is happening in
> >> > the
> >> > > >> above code_.
> >> > > >>
> >> > > >> I give you the crowning WTF (maybe it is not so surprising at this
> >> > > >> point...):
> >> > > >>
> >> > > >>     x;y;x,y                NB. pls j
> >> > > >> ┌───┬───┬───────┐
> >> > > >> │aób│aób│aÃ³baób│
> >> > > >> └───┴───┴───────┘
> >> > > >>
> >> > > >> Unicode is delicate and skittish, and must be approached delicately.
> >> > I
> >> > > >> think that there are some essential conflicts between unicode and
> >> > j--as
> >> > > >> the above example with the combining character demonstrates--but
> >> also
> >> > > >> that
> >> > > >> pandora's box is open: literal data _exists_ in j.  Given that that
> >> is
> >> > > >> the
> >> > > >> case, I think it is possible and desirable to do much better than
> >> the
> >> > > >> current scheme.
> >> > > >>
> >> > > >> ---
> >> > > >>
> >> > > >> Unicode text can be broken up in a number of ways.  Graphemes,
> >> > > >> characters,
> >> > > >> code points, code units...
> >> > > >>
> >> > > >> The composition of code units into code points is the only such
> >> > > >> demarcation which is stable and can be counted upon.  It is also a
> >> > > >> demarcation which is necessary for pretty much any interesting text
> >> > > >> processing (to the point that I would suggest any form of 'text
> >> > > >> processing' which does not consider code points is not actually
> >> > > >> processing
> >> > > >> text).  Therefore, I suggest that, at a minimum, no user-exposed
> >> > > >> representation of text should acknowledge a delineation below that
> >> of
> >> > > the
> >> > > >> code point.  If there is any primitive which deals in code units, it
> >> > > >> should be a foreign: scary, obscure, not for everyday use.
> >> > > >>
> >> > > >> A non-obvious but good result of the above is that all strings are
> >> > > >> correctly-formed by construction.  Not all sequences of code units
> >> are
> >> > > >> correctly formed and correspond to valid strings of text.  But all
> >> > > >> sequences of code points _are_, of necessity, correctly formed,
> >> > > otherwise
> >> > > >> there would be ... problems following additions to unicode.  J
> >> > currently
> >> > > >> allows us to create malformed strings, but then complains when we
> >> use
> >> > > >> them
> >> > > >> in certain ways:
> >> > > >>
> >> > > >>     9 u: 1 u: 10 u: 254 255
> >> > > >> |domain error
> >> > > >> |   9     u:1 u:10 u:254 255
> >> > > >>
> >> > > >> ---
> >> > > >>
> >> > > >> It is a question whether j should natively recognise delineations
> >> > above
> >> > > >> the code point.  It pains me to suggest that it should not.
> >> > > >>
> >> > > >> Raku (a pointer-chasing language) has the best-thought-out strings
> >> of
> >> > > any
> >> > > >> programming language I have encountered.  (Unsurprising, given it
> >> was
> >> > > >> written by perl hackers.)  In raku, operations on strings are
> >> > > >> grapheme-oriented.  Raku also normalizes all text by default (which
> >> > > >> solves
> >> > > >> the problem I presented above with combining characters--but rest
> >> > > >> assured,
> >> > > >> it can not solve all such problems).  They even have a scheme for
> >> > > >> space-efficient random access to strings on this basis.
> >> > > >>
> >> > > >> But j is not raku, and it is telling that, though raku has
> >> > > >> multidimensional arrays, its strings are _not_ arrays, and it does
> >> not
> >> > > >> have characters.  The principle problem is a violation of the rules
> >> of
> >> > > >> conformability.  For instance, it is not guaranteed that, for
> >> vectors
> >> > x
> >> > > >> and y, (#x,y) -: x +&# y.  This is not _so_ terrible (though it is
> >> > > pretty
> >> > > >> bad), but from it follows an obvious problem with catenating
> >> > higher-rank
> >> > > >> arrays.  Similar concerns apply at least to i., e., E., and }.  That
> >> > > >> said,
> >> > > >> I would support the addition of primitives to perform normalization
> >> > (as
> >> > > >> well as casefolding etc.) and identification of grapheme boundaries.
> >> > > >>
> >> > > >> ---
> >> > > >>
> >> > > >> It would be wise of me to address the elephant in the room.
> >> > Characters
> >> > > >> are not only used to represent text, but also arbitrary binary data,
> >> > > e.g.
> >> > > >> from the network or files, which may in fact be malformed as text.
> >> I
> >> > > >> submit that characters are clearly the wrong way to represent such
> >> > data;
> >> > > >> the right way to represent a sequence of _octets_ is using
> >> _integers_.
> >> > > >> But people persist, and there are two issues: the first is
> >> > > compatibility,
> >> > > >> and the second is performance.
> >> > > >>
> >> > > >> Regarding the second, an obvious solution is to add a 1-byte integer
> >> > > >> representation (as Marshall has suggested on at least one occasion),
> >> > but
> >> > > >> this represents a potentially nontrivial development effort.
> >> > Therefore
> >> > > I
> >> > > >> suggest an alternate solution, at least for the interim: foreigns
> >> > (scary
> >> > > >> and obscure, per above) that will _intentionally misinterpret_ data
> >> > from
> >> > > >> the outside world as 'UCS-1' and represent it compactly (or do the
> >> > > >> opposite).
> >> > > >>
> >> > > >> Regarding the issue of backwards compatibility, I propose the
> >> addition
> >> > > of
> >> > > >> 256 'meta-characters', each corresponding to an octet.  Attempts to
> >> > > >> decode
> >> > > >> correctly formed utf-8 from the outside world will succeed and
> >> produce
> >> > > >> corresponding unicode; attempts to decode malformed utf-8 may map
> >> each
> >> > > >> incorrect code unit to the corresponding meta-character.  When
> >> > encoded,
> >> > > >> real characters will be utf-8 encoded, but each meta-character will
> >> be
> >> > > >> encoded as its corresponding octet.  In this way, arbitrary byte
> >> > streams
> >> > > >> may be passed through j strings; but byte streams which consist
> >> > entirely
> >> > > >> or partly of valid utf-8 can be sensibly interpreted.  This is
> >> similar
> >> > > to
> >> > > >> raku's utf8-c8, and to python's surrogateescape.
> >> > > >>
> >> > > >> ---
> >> > > >>
> >> > > >> An implementation detail, sort of.  Variable-width representations
> >> > (such
> >> > > >> as utf-8) should not be used internally.  Many fundamental array
> >> > > >> operations require constant-time random access (with the
> >> corresponding
> >> > > >> obvious caveats), which variable-width representations cannot
> >> provide;
> >> > > >> and
> >> > > >> even operations which are inherently sequential--like E., i., ;.,
> >> > #--may
> >> > > >> be more difficult or impossible to optimize to the same degree.
> >> > > >> Fixed-width representations therefore provide more predictable
> >> > > >> performance, better performance in nearly all cases, and better
> >> > > >> asymptotic
> >> > > >> performance for many interesting applications.
> >> > > >>
> >> > > >> (The UCS-1 misinterpretation mentioned above is a loophole which
> >> > allows
> >> > > >> people who really care about space to do the variable-width part
> >> > > >> themselves.)
> >> > > >>
> >> > > >> ---
> >> > > >>
> >> > > >> I therefore suggest the following language changes, probably to be
> >> > > >> deferred to version 10:
> >> > > >>
> >> > > >> - 1, 2, and 4-byte character representations are still used
> >> > internally.
> >> > > >>    They are fixed-width, with each code unit representing one code
> >> > > point.
> >> > > >>    In the 4-byte representation, because there are more 32-bit
> >> values
> >> > > than
> >> > > >>    unicode code points, some 32-bit values may correspond to
> >> > > >> passed-through
> >> > > >>    bytes of misencoded utf8.  In this way, a j literal can
> >> round-trip
> >> > > >>    arbitrary byte sequences.  The remainder of the 32-bit value
> >> space
> >> > is
> >> > > >>    completely inaccessible.
> >> > > >>
> >> > > >> - A new primitive verb U:, to replace u:.  u: is removed.  U: has a
> >> > > >>    different name, so that old code will break loudly, rather than
> >> > > >> quietly.
> >> > > >>    If y is an array of integers, then U:y is an array of characters
> >> > with
> >> > > >>    corresponding codepoints; and if y is an array of characters,
> >> then
> >> > > U:y
> >> > > >>    is an array of their code points.  (Alternately, make a.
> >> > > impractically
> >> > > >>    large and rely on a.i.y and x{a. for everything.  I disrecommend
> >> > this
> >> > > >>    for the same reason that we have j. and r., and do not write x =
> >> > 0j1
> >> > > *
> >> > > >> y
> >> > > >>    or x * ^ 0j1 * y.)
> >> > > >>
> >> > > >> - Foreigns for reading from files, like 1!:1 and 1!:11 permit 3
> >> modes
> >> > of
> >> > > >>    operation; foreigns for writing to files, 1!:2, 1!:3, and 1!:12,
> >> > > permit
> >> > > >>    2 modes of operation.  The reading modes are:
> >> > > >>
> >> > > >>    1. Throw on misencoded utf-8 (default).
> >> > > >>    2. Pass-through misencoded bytes as meta characters.
> >> > > >>    3. Intentionally misinterpret the file as being 'UCS-1' encoded
> >> > > rather
> >> > > >>       than utf-8 encoded.
> >> > > >>
> >> > > >>    The writing modes are:
> >> > > >>
> >> > > >>    1. Encode as utf-8, passing through meta characters as the
> >> > > >> corresponding
> >> > > >>       octets (default).
> >> > > >>    2. Misinterpret output as 'UCS-1' and perform no encoding.  Only
> >> > > valid
> >> > > >>       for 1-byte characters.
> >> > > >>
> >> > > >> A recommendation: the UCS-1 misinterpretation should be removed if
> >> > > 1-byte
> >> > > >> integers are ever added.
> >> > > >>
> >> > > >> - A new foreign is provided to 'sneeze' character arrays.  This is
> >> > > >>    largely cosmetic, but may be useful for some.  If some string
> >> uses
> >> > a
> >> > > >>    4-byte representation, but in fact, all of its elements' code
> >> > points
> >> > > >> are
> >> > > >>    below 65536, then the result will use a smaller representation.
> >> > > (This
> >> > > >>    can also do work on integers, as it can convert them to a boolean
> >> > > >>    representation if they are all 0 or 1; this is, again, marginal.)
> >> > > >>
> >> > > >> Future directions:
> >> > > >>
> >> > > >> Provide functionality for unicode normalization, casefolding,
> >> grapheme
> >> > > >> boundary identification, unicode character properties, and others.
> >> > > Maybe
> >> > > >> this should be done by turning U: into a trenchcoat function; or
> >> maybe
> >> > > it
> >> > > >> should be done by library code.  There is the potential to reuse
> >> > > existing
> >> > > >> primitives, e.g. <.y might be a lowercased y, but I am wary of such
> >> > > puns.
> >> > > >>
> >> > > >> Thoughts?  Comments?
> >> > > >>
> >> > > >>   -E
> >> > > >>
> >> ----------------------------------------------------------------------
> >> > > >> For information about J forums see
> >> > http://www.jsoftware.com/forums.htm
> >> > > >>
> >> > > >
> >> > > ----------------------------------------------------------------------
> >> > > For information about J forums see http://www.jsoftware.com/forums.htm
> >> > >
> >> > ----------------------------------------------------------------------
> >> > For information about J forums see http://www.jsoftware.com/forums.htm
> >> >
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] RFC: unicode

Reply via email to