If I have to do much processing of UTF8 I convert to unicode first. Do my processing, then convert back to UTF8. That way I can use primitives and definitions originally intended for literal will most likely work as expected. I have started using 3&u: instead of a. for converting to numeric as it works for literal, UTF8 and unicode. I just find it easier to deal with a character always counting as one thing instead of usually one, but often two or three things.
On Thu, Jun 16, 2016 at 12:52 PM, robert therriault <[email protected]> wrote: > Thanks Pascal, > > Using the original example > > [s=. 2 6 $ 'ఝ' ,'a','ఝ' > ఝa�� > �ఝa� > > 8 <@(a.i.u:)("0) 7 u: s NB. Arrays need to be dealt with as rank 1 > |rank error > | 8<@(a.i.u:)("0)7 u:s > 8 <@(a.i.u:)("0) 7 u:"1 s NB. Issues still arise with the partial > encodings > |domain error > | 8<@(a.i.u:)("0)7 u:"1 s > 8 <@(a.i.u:)("0) 7 u:"1 {. {: s NB. Issue with the non valid encoding > that J displays as � > |domain error > | 8<@(a.i.u:)("0)7 u:"1{.{:s > > I think that the challenge is the partial encodings. The J IDE displays > these, but the 7 u: gives errors and even using :: for error exceptions I > haven't found a nice way around the issues. > > Cheers, bob > > > On Jun 16, 2016, at 11:37 AM, 'Pascal Jasmin' via Programming < > [email protected]> wrote: > > > > 8 <@(a.i.u:)("0) 7 u: 'ఝ' ,'a','ఝ' > > ┌───────────┬──┬───────────┐ > > │224 176 157│97│224 176 157│ > > └───────────┴──┴───────────┘ > > > > > > > > > > ----- Original Message ----- > > From: robert therriault <[email protected]> > > To: [email protected] > > Sent: Thursday, June 16, 2016 2:33 PM > > Subject: Re: [Jprogramming] Unicode (UTF8) string deconstruction > > > > You are quite right Don, > > > > I should change the request to displaying unicode in UTF8 I suppose. > Converting to unicode as you have done also allows manipulation of > characters within arrays, but I am looking ways to show the results when > reshaping breaks UTF8 representation. > > > > Do you have a way to take a literal array in UTF8 and box the encodings > for each character? > > > > I have seen your posts in the past and they have helped as I work > through this process. Thank you. > > > > One of the ways that I am looking at dealing with the width issue is to > have the character display display in a smaller font so that some of the > unicode display width issues can be resolved. > > > > Cheers, bob > > > >> On Jun 16, 2016, at 11:25 AM, Don Guinn <[email protected]> wrote: > >> > >> You are not dealing with unicode. You have UTF8. > >> > >> ]s=. 7 u: 'ఝ' ,'a','ఝ' NB. s is converted to unicode. > >> > >> ఝaఝ > >> > >> $s > >> > >> 3 > >> > >> <"0 s > >> > >> +---+-+---+ > >> > >> |ఝ|a|ఝ| > >> > >> +---+-+---+ > >> > >> > >> But the display still is messed up because the display first converts > the > >> unicode to UTF8. Then does a byte count to determine how many boxing > >> characters to put around the data. But there is still a problem as many > >> unicode/UTF8 characters beyond ASCII are proportional. Notice how wide > the > >> first and last characters are compared to the "a". > >> > >> On Thu, Jun 16, 2016 at 12:08 PM, robert therriault < > [email protected]> > >> wrote: > >> > >>> I am in the process of extending some of the type and shape > visualizations > >>> that I have done in the past [0] into the realm of unicode. > >>> > >>> If you look through the archives of these message lists you will find > that > >>> unicode can be quite confounding, but my question is relatively simple. > >>> > >>> I would like to take > >>> > >>> [s=. 2 6 $ 'ఝ' ,'a','ఝ' NB. � results from 224 176 157 being broken > >>> across dimensions > >>> ఝa�� > >>> �ఝa� > >>> [encode=. a. i. s NB. shape of 2 6 refers to the encoding > numbers > >>> not the number of characters displayed > >>> 224 176 157 97 224 176 > >>> 157 224 176 157 97 224 > >>> > >>> and convert encode to a form where the encoding for each character is > in > >>> it's own box. Of course, this would be a verb that can work with any > >>> literal array not just the example given. > >>> > >>> [r=. 2 4 $ 224 176 157 ; 97 ; 224 ; 176 ; 157 ; 224 176 157 ; 97 ; 224 > >>> ┌───────────┬───────────┬───┬───┐ > >>> │224 176 157│97 │224│176│ > >>> ├───────────┼───────────┼───┼───┤ > >>> │157 │224 176 157│97 │224│ > >>> └───────────┴───────────┴───┴───┘ > >>> > >>> which could be converted back to > >>> > >>> {&a. each r > >>> ┌───┬───┬─┬─┐ > >>> │ఝ│a │�│�│ > >>> ├───┼───┼─┼─┤ > >>> │� │ఝ│a│�│ > >>> └───┴───┴─┴─┘ > >>> > >>> With this in place it may be possible to have the literal view of > unicode > >>> display a little more consistently > >>> > >>> > >>> Any suggestions would be welcome. > >>> > >>> Cheers, bob > >>> > >>> [0] Video of Enhanced display of literals > >>> https://www.youtube.com/watch?v=BzjfJjGb5cs > >>> ---------------------------------------------------------------------- > >>> For information about J forums see http://www.jsoftware.com/forums.htm > > > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
