Looks cool,
FYI, a page that lists double wide unicode characters, http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c the table at https://en.wikipedia.org/wiki/UTF-8 provides a systematic way of counting characters,but converting to unicode and counting is probably faster. ----- Original Message ----- From: robert therriault <[email protected]> To: [email protected] Sent: Monday, July 4, 2016 3:27 PM Subject: Re: [Jprogramming] Unicode (UTF8) string deconstruction I have developed my enhanced view of shapes and types in J a bit further to include unicode and utf8. Video of the viewer is available here https://youtu.be/eN9H-rMk1No and feedback is welcomed. Cheers, bob > On Jun 17, 2016, at 8:54 PM, robert therriault <[email protected]> wrote: > > Thanks Bill, > > If the utf8 is at most 3 bytes that takes a layer of checking out of my utf > verb. > > utf_vts_ > 3 : 0"1 > if. y-:'' do. return. end. > try. ((utf@:((1<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (1<.#) {. ]))) y > catch. try. ((utf@:((2<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (2<.#) {. ]))) y > catch. try. ((utf@:((3<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (3<.#) {. ]))) y > catch. ({. ; utf@}.) y > end. > end. > end. > ) > > For the width, I have a couple of ways to deal with that. Initially, I was > leaning toward shrinking the font > for literals and keeping the background rectangle the same size to match the > other non-literal characters. Now I > am thinking that I will probably end up having the width of the background > rectangle reflect the number of bytes. This would > mean that the wider characters would have lots of room because they would > have at least double wide backgrounds. Single byte > shapes are not as wide as far as I can see, but once I have a chance to test > I'll know whether I have issues with width. > > I think that about does it for me. I am heading off to NYC this week on > vacation. My first time there, so I'll probably be busy > seeing the sights. > > Cheers, bob > >> On Jun 17, 2016, at 8:28 PM, bill lam <[email protected]> wrote: >> >> I see. But J can only handle unicode in bmp, ie codepoints below >> 65536, which are atmost 3 byte utf8. >> u: 65536 >> |index error >> | u:65536 >> >> Also the display width of a unicode character can vary from 0 to 2. >> >> Пт, 17 июн 2016, robert therriault написал(а): >>> Yes there are certainly illegal utf8 characters in 8 6$ 'ఝ' ,'a','ఝ', but >>> what >>> I am attempting is to reveal the illegal characters for what they are. >>> Along the lines >>> of the shape and type display that I had used incorporating svg. Once i >>> have that information >>> in a format that I can separate the illegal characters from the legal and >>> allow a viewer to see >>> the information by hovering over the character, then the reasons for 8 6$ >>> 'ఝ' ,'a','ఝ' looking the >>> way that it does on the j display becomes more apparent. >>> >>> Also, being able to distinguish between the >>> 1, 2, 3, and 4 byte utf8 representations may allow a bit more consistency >>> in the way that the boxed >>> versions of these characters display. >>> >>> It remains to be seen how far I get with this, but the ability to show the >>> representation framework of >>> a utf8 array is a step. :-) >>> >>> Cheers, bob >>> >>>> On Jun 17, 2016, at 5:13 PM, bill lam <[email protected]> wrote: >>>> >>>> But your s contains illegal utf8 characters. >>>> >>>> isutf8=: 1:@(7&u:) ::0: >>>> >>>> isutf8 'ఝ' ,'a','ఝ' >>>> 1 >>>> isutf8"1[ 8 6$ 'ఝ' ,'a','ఝ' >>>> 0 0 0 1 0 0 0 0 >>>> >>>> isutf8"1[ 8 7$ 'ఝ' ,'a','ఝ' >>>> 1 1 1 1 1 1 1 1 >>>> >>>> Since the 3 wide characters string is a 7 byte in utf8 >>>> a.i.'ఝ' ,'a','ఝ' >>>> 224 176 157 97 224 176 157 >>>> 8 6 $ .... is not what you would expected. perhaps you meant >>>> >>>> [s=: 8 6 $ 7 u: 'ఝ' ,'a','ఝ' >>>> ఝaఝఝaఝ >>>> ఝaఝఝaఝ >>>> ఝaఝఝaఝ >>>> ఝaఝఝaఝ >>>> ఝaఝఝaఝ >>>> ఝaఝఝaఝ >>>> ఝaఝఝaఝ >>>> ఝaఝఝaఝ >>>> On Jun 18, 2016 7:30 AM, "robert therriault" <[email protected]> wrote: >>>> >>>>> Thanks for all the suggestions everyone. >>>>> >>>>> In the end I took a more explicit approach than I normally would, but it >>>>> seems to work. >>>>> >>>>> I am not sure if this is useful for Henry, but it is one approach. >>>>> >>>>> [s=. 8 6 $ 'ఝ' ,'a','ఝ' >>>>> ఝa�� >>>>> �ఝa� >>>>> ��ఝa >>>>> ఝఝ >>>>> aఝ�� >>>>> �aఝ� >>>>> ��aఝ >>>>> ఝa�� >>>>> boxutf s >>>>> ┌───────────┬───────────┬───────────┬───────────┐ >>>>> │224 176 157│97 │224 │176 │ >>>>> ├───────────┼───────────┼───────────┼───────────┤ >>>>> │157 │224 176 157│97 │224 │ >>>>> ├───────────┼───────────┼───────────┼───────────┤ >>>>> │176 │157 │224 176 157│97 │ >>>>> ├───────────┼───────────┼───────────┼───────────┤ >>>>> │224 176 157│224 176 157│ │ │ >>>>> ├───────────┼───────────┼───────────┼───────────┤ >>>>> │97 │224 176 157│224 │176 │ >>>>> ├───────────┼───────────┼───────────┼───────────┤ >>>>> │157 │97 │224 176 157│224 │ >>>>> ├───────────┼───────────┼───────────┼───────────┤ >>>>> │176 │157 │97 │224 176 157│ >>>>> ├───────────┼───────────┼───────────┼───────────┤ >>>>> │224 176 157│97 │224 │176 │ >>>>> └───────────┴───────────┴───────────┴───────────┘ >>>>> {&a. each boxutf s >>>>> ┌───┬───┬───┬───┐ >>>>> │ఝ│a │� │� │ >>>>> ├───┼───┼───┼───┤ >>>>> │� │ఝ│a │� │ >>>>> ├───┼───┼───┼───┤ >>>>> │� │� │ఝ│a │ >>>>> ├───┼───┼───┼───┤ >>>>> │ఝ│ఝ│ │ │ >>>>> ├───┼───┼───┼───┤ >>>>> │a │ఝ│� │� │ >>>>> ├───┼───┼───┼───┤ >>>>> │� │a │ఝ│� │ >>>>> ├───┼───┼───┼───┤ >>>>> │� │� │a │ఝ│ >>>>> ├───┼───┼───┼───┤ >>>>> │ఝ│a │� │� │ >>>>> └───┴───┴───┴───┘ >>>>> boxutf >>>>> }:@utf@(3&u:)@": >>>>> utf >>>>> 3 : 0"1 >>>>> if. y-:'' do. return. end. >>>>> try. ((utf@:((1<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (1<.#) {. ]))) y >>>>> catch. try. ((utf@:((2<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (2<.#) {. >>>>> ]))) y >>>>> catch. try. ((utf@:((3<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (3<.#) {. >>>>> ]))) y >>>>> catch. try. ((utf@:((4<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (4<.#) >>>>> {. ]))) y >>>>> catch. ({. ; utf@}.) y >>>>> end. >>>>> end. >>>>> end. >>>>> end. >>>>> ) >>>>> >>>>> Row by row I am just grabbing up to 4 UTF8 numbers and boxing them. >>>>> Whenever the numbers are valid I box them and move on with the remaining >>>>> part of the row. >>>>> >>>>> I am sure others will find a more elegant approach, but this seems to >>>>> work. >>>>> >>>>> Cheers, bob >>>>> >>>>>> On Jun 16, 2016, at 4:27 PM, bill lam <[email protected]> wrote: >>>>>> >>>>>> internal representation of utf8 array is no different from regular >>>>>> character array, utf8 only applies external interface. If you want to >>>>>> manipulate unicode within j, you should use the wide character data type >>>>>> (131072) as suggested by Don. >>>>>> On Jun 17, 2016 2:33 AM, "robert therriault" <[email protected]> >>>>> wrote: >>>>>> >>>>>>> You are quite right Don, >>>>>>> >>>>>>> I should change the request to displaying unicode in UTF8 I suppose. >>>>>>> Converting to unicode as you have done also allows manipulation of >>>>>>> characters within arrays, but I am looking ways to show the results when >>>>>>> reshaping breaks UTF8 representation. >>>>>>> >>>>>>> Do you have a way to take a literal array in UTF8 and box the encodings >>>>>>> for each character? >>>>>>> >>>>>>> I have seen your posts in the past and they have helped as I work >>>>> through >>>>>>> this process. Thank you. >>>>>>> >>>>>>> One of the ways that I am looking at dealing with the width issue is to >>>>>>> have the character display display in a smaller font so that some of the >>>>>>> unicode display width issues can be resolved. >>>>>>> >>>>>>> Cheers, bob >>>>>>> >>>>>>>> On Jun 16, 2016, at 11:25 AM, Don Guinn <[email protected]> wrote: >>>>>>>> >>>>>>>> You are not dealing with unicode. You have UTF8. >>>>>>>> >>>>>>>> ]s=. 7 u: 'ఝ' ,'a','ఝ' NB. s is converted to unicode. >>>>>>>> >>>>>>>> ఝaఝ >>>>>>>> >>>>>>>> $s >>>>>>>> >>>>>>>> 3 >>>>>>>> >>>>>>>> <"0 s >>>>>>>> >>>>>>>> +---+-+---+ >>>>>>>> >>>>>>>> |ఝ|a|ఝ| >>>>>>>> >>>>>>>> +---+-+---+ >>>>>>>> >>>>>>>> >>>>>>>> But the display still is messed up because the display first converts >>>>> the >>>>>>>> unicode to UTF8. Then does a byte count to determine how many boxing >>>>>>>> characters to put around the data. But there is still a problem as many >>>>>>>> unicode/UTF8 characters beyond ASCII are proportional. Notice how wide >>>>>>> the >>>>>>>> first and last characters are compared to the "a". >>>>>>>> >>>>>>>> On Thu, Jun 16, 2016 at 12:08 PM, robert therriault < >>>>>>> [email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I am in the process of extending some of the type and shape >>>>>>> visualizations >>>>>>>>> that I have done in the past [0] into the realm of unicode. >>>>>>>>> >>>>>>>>> If you look through the archives of these message lists you will find >>>>>>> that >>>>>>>>> unicode can be quite confounding, but my question is relatively >>>>> simple. >>>>>>>>> >>>>>>>>> I would like to take >>>>>>>>> >>>>>>>>> [s=. 2 6 $ 'ఝ' ,'a','ఝ' NB. � results from 224 176 157 being >>>>> broken >>>>>>>>> across dimensions >>>>>>>>> ఝa�� >>>>>>>>> �ఝa� >>>>>>>>> [encode=. a. i. s NB. shape of 2 6 refers to the encoding >>>>>>> numbers >>>>>>>>> not the number of characters displayed >>>>>>>>> 224 176 157 97 224 176 >>>>>>>>> 157 224 176 157 97 224 >>>>>>>>> >>>>>>>>> and convert encode to a form where the encoding for each character is >>>>> in >>>>>>>>> it's own box. Of course, this would be a verb that can work with any >>>>>>>>> literal array not just the example given. >>>>>>>>> >>>>>>>>> [r=. 2 4 $ 224 176 157 ; 97 ; 224 ; 176 ; 157 ; 224 176 157 ; 97 ; 224 >>>>>>>>> ┌───────────┬───────────┬───┬───┐ >>>>>>>>> │224 176 157│97 │224│176│ >>>>>>>>> ├───────────┼───────────┼───┼───┤ >>>>>>>>> │157 │224 176 157│97 │224│ >>>>>>>>> └───────────┴───────────┴───┴───┘ >>>>>>>>> >>>>>>>>> which could be converted back to >>>>>>>>> >>>>>>>>> {&a. each r >>>>>>>>> ┌───┬───┬─┬─┐ >>>>>>>>> │ఝ│a │�│�│ >>>>>>>>> ├───┼───┼─┼─┤ >>>>>>>>> │� │ఝ│a│�│ >>>>>>>>> └───┴───┴─┴─┘ >>>>>>>>> >>>>>>>>> With this in place it may be possible to have the literal view of >>>>>>> unicode >>>>>>>>> display a little more consistently >>>>>>>>> >>>>>>>>> >>>>>>>>> Any suggestions would be welcome. >>>>>>>>> >>>>>>>>> Cheers, bob >>>>>>>>> >>>>>>>>> [0] Video of Enhanced display of literals >>>>>>>>> https://www.youtube.com/watch?v=BzjfJjGb5cs >>>>>>>>> ---------------------------------------------------------------------- >>>>>>>>> For information about J forums see >>>>> http://www.jsoftware.com/forums.htm >>>>>>>> ---------------------------------------------------------------------- >>>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>>>>> >>>>>>> ---------------------------------------------------------------------- >>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>>>> ---------------------------------------------------------------------- >>>>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>>> >>>>> ---------------------------------------------------------------------- >>>>> For information about J forums see http://www.jsoftware.com/forums.htm >>>> ---------------------------------------------------------------------- >>>> For information about J forums see http://www.jsoftware.com/forums.htm >>> >>> ---------------------------------------------------------------------- >>> For information about J forums see http://www.jsoftware.com/forums.htm >> >> -- >> regards, >> ==================================================== >> GPG key 1024D/4434BAB3 2008-08-24 >> gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3 >> gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3 >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
