But J doesn't have 8bit integers. On Thu, Jan 5, 2023 at 3:28 PM Elijah Stone <[email protected]> wrote:
> 'byte' is a representational implementation detail. We have integers. > > On Thu, 5 Jan 2023, bill lam wrote: > > > Unless J has separate datatype for byte and character as in java. J > cannot > > store strings as utf16. > > The only sane choice is what is being used now and let users to bear the > > burden. > > > > On Thu, Jan 5, 2023 at 2:12 PM Elijah Stone <[email protected]> wrote: > > > >> And no one believes me when I say strings in j are irreperably broken > and > >> need > >> to be thrown out and redesigned from scratch... > >> > >> Your '♥♦♣♠' is intentionally (to borrow from kent pitman) a utf8-encoded > >> string, comprising 12 utf8 code units, where each aligned group of three > >> encodes a unicode code point representing a suit. The j datatype > >> associated > >> therewith is 'literal', i.e., a sequence of octets. The display of such > >> objects is literal, and your environment is (correctly) interpreting the > >> data > >> as utf-8 encoded. > >> > >> 10 u:y takes y an array of integers, however represented, and gives back > >> an > >> array of 'literal4' data of the same length, where each atom of the > result > >> corresponds to one atom of the input. > >> > >> Display of literal4 data assumes that they are ucs4-encoded, as you say, > >> and > >> further assumes that the environment is utf8-oriented, so je treats each > >> atom > >> of a literal4 as representing a code point, and encodes it as utf8. In > >> other > >> words, your code _units_ are being cast as code _points_ (but note that > 10 > >> u: > >> itself does no interpretation). > >> > >> 9 u: applied to a literal array interprets it as utf8 and attempts to > >> decode > >> it, producing code points represented as literal4. I expect this is > what > >> you > >> are looking for. > >> > >> > >> On Thu, 5 Jan 2023, Raul Miller wrote: > >> > >> > 10 u:'♥♦♣♠' > >> > ♥♦♣♠> >> > #10 u:'♥♦♣♠' > >> > 12 > >> > > >> > I can't make heads nor tails of this result. > >> > > >> > nuvoc suggests that 10 u: should be used to generate unicode4 (which > >> > probably means that it would use the ucs-4 encoding, containing a > >> > utf-32 representation of the argument characters), but while it's > >> > literally the case that the result is in J's unicode4 format: > >> > > >> > datatype 10 u:'♥♦♣♠' > >> > unicode4 > >> > > >> > ... it does not look like the argument characters were encoded in this > >> format. > >> > > >> > -- > >> > Raul > >> > ---------------------------------------------------------------------- > >> > For information about J forums see > http://www.jsoftware.com/forums.htm > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > >> > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
