Yes there are certainly illegal utf8 characters in 8 6$ 'ఝ' ,'a','ఝ', but what I am attempting is to reveal the illegal characters for what they are. Along the lines of the shape and type display that I had used incorporating svg. Once i have that information in a format that I can separate the illegal characters from the legal and allow a viewer to see the information by hovering over the character, then the reasons for 8 6$ 'ఝ' ,'a','ఝ' looking the way that it does on the j display becomes more apparent.
Also, being able to distinguish between the 1, 2, 3, and 4 byte utf8 representations may allow a bit more consistency in the way that the boxed versions of these characters display. It remains to be seen how far I get with this, but the ability to show the representation framework of a utf8 array is a step. :-) Cheers, bob > On Jun 17, 2016, at 5:13 PM, bill lam <[email protected]> wrote: > > But your s contains illegal utf8 characters. > > isutf8=: 1:@(7&u:) ::0: > > isutf8 'ఝ' ,'a','ఝ' > 1 > isutf8"1[ 8 6$ 'ఝ' ,'a','ఝ' > 0 0 0 1 0 0 0 0 > > isutf8"1[ 8 7$ 'ఝ' ,'a','ఝ' > 1 1 1 1 1 1 1 1 > > Since the 3 wide characters string is a 7 byte in utf8 > a.i.'ఝ' ,'a','ఝ' > 224 176 157 97 224 176 157 > 8 6 $ .... is not what you would expected. perhaps you meant > > [s=: 8 6 $ 7 u: 'ఝ' ,'a','ఝ' > ఝaఝఝaఝ > ఝaఝఝaఝ > ఝaఝఝaఝ > ఝaఝఝaఝ > ఝaఝఝaఝ > ఝaఝఝaఝ > ఝaఝఝaఝ > ఝaఝఝaఝ > On Jun 18, 2016 7:30 AM, "robert therriault" <[email protected]> wrote: > >> Thanks for all the suggestions everyone. >> >> In the end I took a more explicit approach than I normally would, but it >> seems to work. >> >> I am not sure if this is useful for Henry, but it is one approach. >> >> [s=. 8 6 $ 'ఝ' ,'a','ఝ' >> ఝa�� >> �ఝa� >> ��ఝa >> ఝఝ >> aఝ�� >> �aఝ� >> ��aఝ >> ఝa�� >> boxutf s >> ┌───────────┬───────────┬───────────┬───────────┐ >> │224 176 157│97 │224 │176 │ >> ├───────────┼───────────┼───────────┼───────────┤ >> │157 │224 176 157│97 │224 │ >> ├───────────┼───────────┼───────────┼───────────┤ >> │176 │157 │224 176 157│97 │ >> ├───────────┼───────────┼───────────┼───────────┤ >> │224 176 157│224 176 157│ │ │ >> ├───────────┼───────────┼───────────┼───────────┤ >> │97 │224 176 157│224 │176 │ >> ├───────────┼───────────┼───────────┼───────────┤ >> │157 │97 │224 176 157│224 │ >> ├───────────┼───────────┼───────────┼───────────┤ >> │176 │157 │97 │224 176 157│ >> ├───────────┼───────────┼───────────┼───────────┤ >> │224 176 157│97 │224 │176 │ >> └───────────┴───────────┴───────────┴───────────┘ >> {&a. each boxutf s >> ┌───┬───┬───┬───┐ >> │ఝ│a │� │� │ >> ├───┼───┼───┼───┤ >> │� │ఝ│a │� │ >> ├───┼───┼───┼───┤ >> │� │� │ఝ│a │ >> ├───┼───┼───┼───┤ >> │ఝ│ఝ│ │ │ >> ├───┼───┼───┼───┤ >> │a │ఝ│� │� │ >> ├───┼───┼───┼───┤ >> │� │a │ఝ│� │ >> ├───┼───┼───┼───┤ >> │� │� │a │ఝ│ >> ├───┼───┼───┼───┤ >> │ఝ│a │� │� │ >> └───┴───┴───┴───┘ >> boxutf >> }:@utf@(3&u:)@": >> utf >> 3 : 0"1 >> if. y-:'' do. return. end. >> try. ((utf@:((1<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (1<.#) {. ]))) y >> catch. try. ((utf@:((2<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (2<.#) {. >> ]))) y >> catch. try. ((utf@:((3<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (3<.#) {. >> ]))) y >> catch. try. ((utf@:((4<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (4<.#) >> {. ]))) y >> catch. ({. ; utf@}.) y >> end. >> end. >> end. >> end. >> ) >> >> Row by row I am just grabbing up to 4 UTF8 numbers and boxing them. >> Whenever the numbers are valid I box them and move on with the remaining >> part of the row. >> >> I am sure others will find a more elegant approach, but this seems to work. >> >> Cheers, bob >> >>> On Jun 16, 2016, at 4:27 PM, bill lam <[email protected]> wrote: >>> >>> internal representation of utf8 array is no different from regular >>> character array, utf8 only applies external interface. If you want to >>> manipulate unicode within j, you should use the wide character data type >>> (131072) as suggested by Don. >>> On Jun 17, 2016 2:33 AM, "robert therriault" <[email protected]> >> wrote: >>> >>>> You are quite right Don, >>>> >>>> I should change the request to displaying unicode in UTF8 I suppose. >>>> Converting to unicode as you have done also allows manipulation of >>>> characters within arrays, but I am looking ways to show the results when >>>> reshaping breaks UTF8 representation. >>>> >>>> Do you have a way to take a literal array in UTF8 and box the encodings >>>> for each character? >>>> >>>> I have seen your posts in the past and they have helped as I work >> through >>>> this process. Thank you. >>>> >>>> One of the ways that I am looking at dealing with the width issue is to >>>> have the character display display in a smaller font so that some of the >>>> unicode display width issues can be resolved. >>>> >>>> Cheers, bob >>>> >>>>> On Jun 16, 2016, at 11:25 AM, Don Guinn <[email protected]> wrote: >>>>> >>>>> You are not dealing with unicode. You have UTF8. >>>>> >>>>> ]s=. 7 u: 'ఝ' ,'a','ఝ' NB. s is converted to unicode. >>>>> >>>>> ఝaఝ >>>>> >>>>> $s >>>>> >>>>> 3 >>>>> >>>>> <"0 s >>>>> >>>>> +---+-+---+ >>>>> >>>>> |ఝ|a|ఝ| >>>>> >>>>> +---+-+---+ >>>>> >>>>> >>>>> But the display still is messed up because the display first converts >> the >>>>> unicode to UTF8. Then does a byte count to determine how many boxing >>>>> characters to put around the data. But there is still a problem as many >>>>> unicode/UTF8 characters beyond ASCII are proportional. Notice how wide >>>> the >>>>> first and last characters are compared to the "a". >>>>> >>>>> On Thu, Jun 16, 2016 at 12:08 PM, robert therriault < >>>> [email protected]> >>>>> wrote: >>>>> >>>>>> I am in the process of extending some of the type and shape >>>> visualizations >>>>>> that I have done in the past [0] into the realm of unicode. >>>>>> >>>>>> If you look through the archives of these message lists you will find >>>> that >>>>>> unicode can be quite confounding, but my question is relatively >> simple. >>>>>> >>>>>> I would like to take >>>>>> >>>>>> [s=. 2 6 $ 'ఝ' ,'a','ఝ' NB. � results from 224 176 157 being >> broken >>>>>> across dimensions >>>>>> ఝa�� >>>>>> �ఝa� >>>>>> [encode=. a. i. s NB. shape of 2 6 refers to the encoding >>>> numbers >>>>>> not the number of characters displayed >>>>>> 224 176 157 97 224 176 >>>>>> 157 224 176 157 97 224 >>>>>> >>>>>> and convert encode to a form where the encoding for each character is >> in >>>>>> it's own box. Of course, this would be a verb that can work with any >>>>>> literal array not just the example given. >>>>>> >>>>>> [r=. 2 4 $ 224 176 157 ; 97 ; 224 ; 176 ; 157 ; 224 176 157 ; 97 ; 224 >>>>>> ┌───────────┬───────────┬───┬───┐ >>>>>> │224 176 157│97 │224│176│ >>>>>> ├───────────┼───────────┼───┼───┤ >>>>>> │157 │224 176 157│97 │224│ >>>>>> └───────────┴───────────┴───┴───┘ >>>>>> >>>>>> which could be converted back to >>>>>> >>>>>> {&a. each r >>>>>> ┌───┬───┬─┬─┐ >>>>>> │ఝ│a │�│�│ >>>>>> ├───┼───┼─┼─┤ >>>>>> │� │ఝ│a│�│ >>>>>> └───┴───┴─┴─┘ >>>>>> >>>>>> With this in place it may be possible to have the literal view of >>>> unicode >>>>>> display a little more consistently >>>>>> >>>>>> >>>>>> Any suggestions would be welcome. >>>>>> >>>>>> Cheers, bob >>>>>> >>>>>> [0] Video of Enhanced display of literals >>>>>> https://www.youtube.com/watch?v=BzjfJjGb5cs >>>>>> ---------------------------------------------------------------------- >>>>>> For information about J forums see >> http://www.jsoftware.com/forums.htm >>>>> ---------------------------------------------------------------------- >>>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>> >>>> ---------------------------------------------------------------------- >>>> For information about J forums see http://www.jsoftware.com/forums.htm >>> ---------------------------------------------------------------------- >>> For information about J forums see http://www.jsoftware.com/forums.htm >> >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
