Unless J has separate datatype for byte and character as in java. J cannot store strings as utf16. The only sane choice is what is being used now and let users to bear the burden.
On Thu, Jan 5, 2023 at 2:12 PM Elijah Stone <[email protected]> wrote: > And no one believes me when I say strings in j are irreperably broken and > need > to be thrown out and redesigned from scratch... > > Your '♥♦♣♠' is intentionally (to borrow from kent pitman) a utf8-encoded > string, comprising 12 utf8 code units, where each aligned group of three > encodes a unicode code point representing a suit. The j datatype > associated > therewith is 'literal', i.e., a sequence of octets. The display of such > objects is literal, and your environment is (correctly) interpreting the > data > as utf-8 encoded. > > 10 u:y takes y an array of integers, however represented, and gives back > an > array of 'literal4' data of the same length, where each atom of the result > corresponds to one atom of the input. > > Display of literal4 data assumes that they are ucs4-encoded, as you say, > and > further assumes that the environment is utf8-oriented, so je treats each > atom > of a literal4 as representing a code point, and encodes it as utf8. In > other > words, your code _units_ are being cast as code _points_ (but note that 10 > u: > itself does no interpretation). > > 9 u: applied to a literal array interprets it as utf8 and attempts to > decode > it, producing code points represented as literal4. I expect this is what > you > are looking for. > > > On Thu, 5 Jan 2023, Raul Miller wrote: > > > 10 u:'♥♦♣♠' > > ♥♦♣♠> > #10 u:'♥♦♣♠' > > 12 > > > > I can't make heads nor tails of this result. > > > > nuvoc suggests that 10 u: should be used to generate unicode4 (which > > probably means that it would use the ucs-4 encoding, containing a > > utf-32 representation of the argument characters), but while it's > > literally the case that the result is in J's unicode4 format: > > > > datatype 10 u:'♥♦♣♠' > > unicode4 > > > > ... it does not look like the argument characters were encoded in this > format. > > > > -- > > Raul > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
