I see. But J can only handle unicode in bmp, ie codepoints below 65536, which are atmost 3 byte utf8. u: 65536 |index error | u:65536
Also the display width of a unicode character can vary from 0 to 2. Пт, 17 июн 2016, robert therriault написал(а): > Yes there are certainly illegal utf8 characters in 8 6$ 'ఝ' ,'a','ఝ', but what > I am attempting is to reveal the illegal characters for what they are. Along > the lines > of the shape and type display that I had used incorporating svg. Once i have > that information > in a format that I can separate the illegal characters from the legal and > allow a viewer to see > the information by hovering over the character, then the reasons for 8 6$ 'ఝ' > ,'a','ఝ' looking the > way that it does on the j display becomes more apparent. > > Also, being able to distinguish between the > 1, 2, 3, and 4 byte utf8 representations may allow a bit more consistency in > the way that the boxed > versions of these characters display. > > It remains to be seen how far I get with this, but the ability to show the > representation framework of > a utf8 array is a step. :-) > > Cheers, bob > > > On Jun 17, 2016, at 5:13 PM, bill lam <[email protected]> wrote: > > > > But your s contains illegal utf8 characters. > > > > isutf8=: 1:@(7&u:) ::0: > > > > isutf8 'ఝ' ,'a','ఝ' > > 1 > > isutf8"1[ 8 6$ 'ఝ' ,'a','ఝ' > > 0 0 0 1 0 0 0 0 > > > > isutf8"1[ 8 7$ 'ఝ' ,'a','ఝ' > > 1 1 1 1 1 1 1 1 > > > > Since the 3 wide characters string is a 7 byte in utf8 > > a.i.'ఝ' ,'a','ఝ' > > 224 176 157 97 224 176 157 > > 8 6 $ .... is not what you would expected. perhaps you meant > > > > [s=: 8 6 $ 7 u: 'ఝ' ,'a','ఝ' > > ఝaఝఝaఝ > > ఝaఝఝaఝ > > ఝaఝఝaఝ > > ఝaఝఝaఝ > > ఝaఝఝaఝ > > ఝaఝఝaఝ > > ఝaఝఝaఝ > > ఝaఝఝaఝ > > On Jun 18, 2016 7:30 AM, "robert therriault" <[email protected]> wrote: > > > >> Thanks for all the suggestions everyone. > >> > >> In the end I took a more explicit approach than I normally would, but it > >> seems to work. > >> > >> I am not sure if this is useful for Henry, but it is one approach. > >> > >> [s=. 8 6 $ 'ఝ' ,'a','ఝ' > >> ఝa�� > >> �ఝa� > >> ��ఝa > >> ఝఝ > >> aఝ�� > >> �aఝ� > >> ��aఝ > >> ఝa�� > >> boxutf s > >> ┌───────────┬───────────┬───────────┬───────────┐ > >> │224 176 157│97 │224 │176 │ > >> ├───────────┼───────────┼───────────┼───────────┤ > >> │157 │224 176 157│97 │224 │ > >> ├───────────┼───────────┼───────────┼───────────┤ > >> │176 │157 │224 176 157│97 │ > >> ├───────────┼───────────┼───────────┼───────────┤ > >> │224 176 157│224 176 157│ │ │ > >> ├───────────┼───────────┼───────────┼───────────┤ > >> │97 │224 176 157│224 │176 │ > >> ├───────────┼───────────┼───────────┼───────────┤ > >> │157 │97 │224 176 157│224 │ > >> ├───────────┼───────────┼───────────┼───────────┤ > >> │176 │157 │97 │224 176 157│ > >> ├───────────┼───────────┼───────────┼───────────┤ > >> │224 176 157│97 │224 │176 │ > >> └───────────┴───────────┴───────────┴───────────┘ > >> {&a. each boxutf s > >> ┌───┬───┬───┬───┐ > >> │ఝ│a │� │� │ > >> ├───┼───┼───┼───┤ > >> │� │ఝ│a │� │ > >> ├───┼───┼───┼───┤ > >> │� │� │ఝ│a │ > >> ├───┼───┼───┼───┤ > >> │ఝ│ఝ│ │ │ > >> ├───┼───┼───┼───┤ > >> │a │ఝ│� │� │ > >> ├───┼───┼───┼───┤ > >> │� │a │ఝ│� │ > >> ├───┼───┼───┼───┤ > >> │� │� │a │ఝ│ > >> ├───┼───┼───┼───┤ > >> │ఝ│a │� │� │ > >> └───┴───┴───┴───┘ > >> boxutf > >> }:@utf@(3&u:)@": > >> utf > >> 3 : 0"1 > >> if. y-:'' do. return. end. > >> try. ((utf@:((1<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (1<.#) {. ]))) y > >> catch. try. ((utf@:((2<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (2<.#) {. > >> ]))) y > >> catch. try. ((utf@:((3<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (3<.#) {. > >> ]))) y > >> catch. try. ((utf@:((4<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (4<.#) > >> {. ]))) y > >> catch. ({. ; utf@}.) y > >> end. > >> end. > >> end. > >> end. > >> ) > >> > >> Row by row I am just grabbing up to 4 UTF8 numbers and boxing them. > >> Whenever the numbers are valid I box them and move on with the remaining > >> part of the row. > >> > >> I am sure others will find a more elegant approach, but this seems to work. > >> > >> Cheers, bob > >> > >>> On Jun 16, 2016, at 4:27 PM, bill lam <[email protected]> wrote: > >>> > >>> internal representation of utf8 array is no different from regular > >>> character array, utf8 only applies external interface. If you want to > >>> manipulate unicode within j, you should use the wide character data type > >>> (131072) as suggested by Don. > >>> On Jun 17, 2016 2:33 AM, "robert therriault" <[email protected]> > >> wrote: > >>> > >>>> You are quite right Don, > >>>> > >>>> I should change the request to displaying unicode in UTF8 I suppose. > >>>> Converting to unicode as you have done also allows manipulation of > >>>> characters within arrays, but I am looking ways to show the results when > >>>> reshaping breaks UTF8 representation. > >>>> > >>>> Do you have a way to take a literal array in UTF8 and box the encodings > >>>> for each character? > >>>> > >>>> I have seen your posts in the past and they have helped as I work > >> through > >>>> this process. Thank you. > >>>> > >>>> One of the ways that I am looking at dealing with the width issue is to > >>>> have the character display display in a smaller font so that some of the > >>>> unicode display width issues can be resolved. > >>>> > >>>> Cheers, bob > >>>> > >>>>> On Jun 16, 2016, at 11:25 AM, Don Guinn <[email protected]> wrote: > >>>>> > >>>>> You are not dealing with unicode. You have UTF8. > >>>>> > >>>>> ]s=. 7 u: 'ఝ' ,'a','ఝ' NB. s is converted to unicode. > >>>>> > >>>>> ఝaఝ > >>>>> > >>>>> $s > >>>>> > >>>>> 3 > >>>>> > >>>>> <"0 s > >>>>> > >>>>> +---+-+---+ > >>>>> > >>>>> |ఝ|a|ఝ| > >>>>> > >>>>> +---+-+---+ > >>>>> > >>>>> > >>>>> But the display still is messed up because the display first converts > >> the > >>>>> unicode to UTF8. Then does a byte count to determine how many boxing > >>>>> characters to put around the data. But there is still a problem as many > >>>>> unicode/UTF8 characters beyond ASCII are proportional. Notice how wide > >>>> the > >>>>> first and last characters are compared to the "a". > >>>>> > >>>>> On Thu, Jun 16, 2016 at 12:08 PM, robert therriault < > >>>> [email protected]> > >>>>> wrote: > >>>>> > >>>>>> I am in the process of extending some of the type and shape > >>>> visualizations > >>>>>> that I have done in the past [0] into the realm of unicode. > >>>>>> > >>>>>> If you look through the archives of these message lists you will find > >>>> that > >>>>>> unicode can be quite confounding, but my question is relatively > >> simple. > >>>>>> > >>>>>> I would like to take > >>>>>> > >>>>>> [s=. 2 6 $ 'ఝ' ,'a','ఝ' NB. � results from 224 176 157 being > >> broken > >>>>>> across dimensions > >>>>>> ఝa�� > >>>>>> �ఝa� > >>>>>> [encode=. a. i. s NB. shape of 2 6 refers to the encoding > >>>> numbers > >>>>>> not the number of characters displayed > >>>>>> 224 176 157 97 224 176 > >>>>>> 157 224 176 157 97 224 > >>>>>> > >>>>>> and convert encode to a form where the encoding for each character is > >> in > >>>>>> it's own box. Of course, this would be a verb that can work with any > >>>>>> literal array not just the example given. > >>>>>> > >>>>>> [r=. 2 4 $ 224 176 157 ; 97 ; 224 ; 176 ; 157 ; 224 176 157 ; 97 ; 224 > >>>>>> ┌───────────┬───────────┬───┬───┐ > >>>>>> │224 176 157│97 │224│176│ > >>>>>> ├───────────┼───────────┼───┼───┤ > >>>>>> │157 │224 176 157│97 │224│ > >>>>>> └───────────┴───────────┴───┴───┘ > >>>>>> > >>>>>> which could be converted back to > >>>>>> > >>>>>> {&a. each r > >>>>>> ┌───┬───┬─┬─┐ > >>>>>> │ఝ│a │�│�│ > >>>>>> ├───┼───┼─┼─┤ > >>>>>> │� │ఝ│a│�│ > >>>>>> └───┴───┴─┴─┘ > >>>>>> > >>>>>> With this in place it may be possible to have the literal view of > >>>> unicode > >>>>>> display a little more consistently > >>>>>> > >>>>>> > >>>>>> Any suggestions would be welcome. > >>>>>> > >>>>>> Cheers, bob > >>>>>> > >>>>>> [0] Video of Enhanced display of literals > >>>>>> https://www.youtube.com/watch?v=BzjfJjGb5cs > >>>>>> ---------------------------------------------------------------------- > >>>>>> For information about J forums see > >> http://www.jsoftware.com/forums.htm > >>>>> ---------------------------------------------------------------------- > >>>>> For information about J forums see http://www.jsoftware.com/forums.htm > >>>> > >>>> ---------------------------------------------------------------------- > >>>> For information about J forums see http://www.jsoftware.com/forums.htm > >>> ---------------------------------------------------------------------- > >>> For information about J forums see http://www.jsoftware.com/forums.htm > >> > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm -- regards, ==================================================== GPG key 1024D/4434BAB3 2008-08-24 gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3 gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3 ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
