Re: [Jprogramming] Unicode (UTF8) string deconstruction

'Pascal Jasmin' via Programming Mon, 04 Jul 2016 14:03:07 -0700

Looks cool,


FYI, a page that lists double wide unicode characters,

http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c

the table at https://en.wikipedia.org/wiki/UTF-8

provides a systematic way of counting characters,but converting to unicode and 
counting is probably faster.




----- Original Message -----
From: robert therriault <[email protected]>
To: [email protected]
Sent: Monday, July 4, 2016 3:27 PM
Subject: Re: [Jprogramming] Unicode (UTF8) string deconstruction

I have developed my enhanced view of shapes and types in J a bit further to 
include unicode and utf8. 

Video of the viewer is available here https://youtu.be/eN9H-rMk1No and feedback 
is welcomed.

Cheers, bob

> On Jun 17, 2016, at 8:54 PM, robert therriault <[email protected]> wrote:
> 
> Thanks Bill,
> 
> If the utf8 is at most 3 bytes that takes a layer of checking out of my utf 
> verb.
> 
>    utf_vts_
> 3 : 0"1
> if. y-:'' do. return. end.
> try.  ((utf@:((1<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (1<.#) {. ]))) y
>   catch. try. ((utf@:((2<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (2<.#) {. ]))) y
>     catch. try. ((utf@:((3<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (3<.#) {. ]))) y
>                       catch. ({. ; utf@}.) y
>                       end.
>     end.
>   end.
> )
> 
> For the width, I have a couple of ways to deal with that. Initially, I was 
> leaning toward shrinking the font
> for literals and keeping the background rectangle the same size to match the 
> other non-literal characters. Now I 
> am thinking that I will probably end up having the width of the background 
> rectangle reflect the number of bytes. This would
> mean that the wider characters would have lots of room because they would 
> have at least double wide backgrounds. Single byte 
> shapes are not as wide as far as I can see, but once I have a chance to test 
> I'll know whether I have issues with width. 
> 
> I think that about does it for me. I am heading off to NYC this week on 
> vacation. My first time there, so I'll probably be busy
> seeing the sights.
> 
> Cheers, bob
> 
>> On Jun 17, 2016, at 8:28 PM, bill lam <[email protected]> wrote:
>> 
>> I see. But J can only handle unicode in bmp, ie codepoints below
>> 65536, which are atmost 3 byte utf8.
>>  u: 65536
>> |index error
>> |       u:65536
>> 
>> Also the display width of a unicode character can vary from 0 to 2.
>> 
>> Пт, 17 июн 2016, robert therriault написал(а):
>>> Yes there are certainly illegal utf8 characters in 8 6$ 'ఝ' ,'a','ఝ', but 
>>> what
>>> I am attempting is to reveal the illegal characters for what they are. 
>>> Along the lines
>>> of the shape and type display that I had used incorporating svg. Once i 
>>> have that information
>>> in a format that I can separate the illegal characters from the legal and 
>>> allow a viewer to see
>>> the information by hovering over the character, then the reasons for 8 6$ 
>>> 'ఝ' ,'a','ఝ' looking the 
>>> way that it does on the j display becomes more apparent. 
>>> 
>>> Also, being able to distinguish between the 
>>> 1, 2, 3, and 4 byte utf8 representations may allow a bit more consistency 
>>> in the way that the boxed
>>> versions of these  characters display. 
>>> 
>>> It remains to be seen how far I get with this, but the ability to show the 
>>> representation framework of
>>> a utf8 array is a step. :-)
>>> 
>>> Cheers, bob
>>> 
>>>> On Jun 17, 2016, at 5:13 PM, bill lam <[email protected]> wrote:
>>>> 
>>>> But your s contains illegal utf8 characters.
>>>> 
>>>> isutf8=: 1:@(7&u:) ::0:
>>>> 
>>>> isutf8 'ఝ' ,'a','ఝ'
>>>> 1
>>>> isutf8"1[ 8 6$ 'ఝ' ,'a','ఝ'
>>>> 0 0 0 1 0 0 0 0
>>>> 
>>>> isutf8"1[ 8 7$ 'ఝ' ,'a','ఝ'
>>>> 1 1 1 1 1 1 1 1
>>>> 
>>>> Since the 3 wide characters string is a 7 byte in utf8
>>>> a.i.'ఝ' ,'a','ఝ'
>>>> 224 176 157 97 224 176 157
>>>> 8 6 $ .... is not what you would expected. perhaps you meant
>>>> 
>>>> [s=: 8 6 $ 7 u: 'ఝ' ,'a','ఝ'
>>>> ఝaఝఝaఝ
>>>> ఝaఝఝaఝ
>>>> ఝaఝఝaఝ
>>>> ఝaఝఝaఝ
>>>> ఝaఝఝaఝ
>>>> ఝaఝఝaఝ
>>>> ఝaఝఝaఝ
>>>> ఝaఝఝaఝ
>>>> On Jun 18, 2016 7:30 AM, "robert therriault" <[email protected]> wrote:
>>>> 
>>>>> Thanks for all the suggestions everyone.
>>>>> 
>>>>> In the end I took a more explicit approach than I normally would, but it
>>>>> seems to work.
>>>>> 
>>>>> I am not sure if this is useful for Henry, but it is one approach.
>>>>> 
>>>>>  [s=.  8 6 $ 'ఝ' ,'a','ఝ'
>>>>> ఝa��
>>>>> �ఝa�
>>>>> ��ఝa
>>>>> ఝఝ
>>>>> aఝ��
>>>>> �aఝ�
>>>>> ��aఝ
>>>>> ఝa��
>>>>> boxutf  s
>>>>> ┌───────────┬───────────┬───────────┬───────────┐
>>>>> │224 176 157│97         │224        │176        │
>>>>> ├───────────┼───────────┼───────────┼───────────┤
>>>>> │157        │224 176 157│97         │224        │
>>>>> ├───────────┼───────────┼───────────┼───────────┤
>>>>> │176        │157        │224 176 157│97         │
>>>>> ├───────────┼───────────┼───────────┼───────────┤
>>>>> │224 176 157│224 176 157│           │           │
>>>>> ├───────────┼───────────┼───────────┼───────────┤
>>>>> │97         │224 176 157│224        │176        │
>>>>> ├───────────┼───────────┼───────────┼───────────┤
>>>>> │157        │97         │224 176 157│224        │
>>>>> ├───────────┼───────────┼───────────┼───────────┤
>>>>> │176        │157        │97         │224 176 157│
>>>>> ├───────────┼───────────┼───────────┼───────────┤
>>>>> │224 176 157│97         │224        │176        │
>>>>> └───────────┴───────────┴───────────┴───────────┘
>>>>> {&a. each boxutf  s
>>>>> ┌───┬───┬───┬───┐
>>>>> │ఝ│a  │�  │�  │
>>>>> ├───┼───┼───┼───┤
>>>>> │�  │ఝ│a  │�  │
>>>>> ├───┼───┼───┼───┤
>>>>> │�  │�  │ఝ│a  │
>>>>> ├───┼───┼───┼───┤
>>>>> │ఝ│ఝ│   │   │
>>>>> ├───┼───┼───┼───┤
>>>>> │a  │ఝ│�  │�  │
>>>>> ├───┼───┼───┼───┤
>>>>> │�  │a  │ఝ│�  │
>>>>> ├───┼───┼───┼───┤
>>>>> │�  │�  │a  │ఝ│
>>>>> ├───┼───┼───┼───┤
>>>>> │ఝ│a  │�  │�  │
>>>>> └───┴───┴───┴───┘
>>>>> boxutf
>>>>> }:@utf@(3&u:)@":
>>>>> utf
>>>>> 3 : 0"1
>>>>> if. y-:'' do. return. end.
>>>>> try.  ((utf@:((1<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (1<.#) {. ]))) y
>>>>> catch. try. ((utf@:((2<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (2<.#) {.
>>>>> ]))) y
>>>>>   catch. try. ((utf@:((3<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (3<.#) {.
>>>>> ]))) y
>>>>>     catch. try.  ((utf@:((4<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (4<.#)
>>>>> {. ]))) y
>>>>>                     catch. ({. ; utf@}.) y
>>>>>                     end.
>>>>>     end.
>>>>>   end.
>>>>> end.
>>>>> )
>>>>> 
>>>>> Row by row I am just grabbing up to 4 UTF8 numbers and boxing them.
>>>>> Whenever the numbers are valid I box them and move on with the remaining
>>>>> part of the row.
>>>>> 
>>>>> I am sure others will find a more elegant approach, but this seems to 
>>>>> work.
>>>>> 
>>>>> Cheers, bob
>>>>> 
>>>>>> On Jun 16, 2016, at 4:27 PM, bill lam <[email protected]> wrote:
>>>>>> 
>>>>>> internal representation of utf8 array is no different from regular
>>>>>> character array, utf8 only applies external interface. If you want to
>>>>>> manipulate unicode within j, you should use the wide character data type
>>>>>> (131072) as suggested by Don.
>>>>>> On Jun 17, 2016 2:33 AM, "robert therriault" <[email protected]>
>>>>> wrote:
>>>>>> 
>>>>>>> You are quite right Don,
>>>>>>> 
>>>>>>> I should change the request to displaying unicode in UTF8 I suppose.
>>>>>>> Converting to unicode as you have done also allows manipulation of
>>>>>>> characters within arrays, but I am looking ways to show the results when
>>>>>>> reshaping breaks UTF8 representation.
>>>>>>> 
>>>>>>> Do you have a way to take a literal array in UTF8 and box the encodings
>>>>>>> for each character?
>>>>>>> 
>>>>>>> I have seen your posts in the past and they have helped as I work
>>>>> through
>>>>>>> this process. Thank you.
>>>>>>> 
>>>>>>> One of the ways that I am looking at dealing with the width issue is to
>>>>>>> have the character display display in a smaller font so that some of the
>>>>>>> unicode display width issues can be resolved.
>>>>>>> 
>>>>>>> Cheers, bob
>>>>>>> 
>>>>>>>> On Jun 16, 2016, at 11:25 AM, Don Guinn <[email protected]> wrote:
>>>>>>>> 
>>>>>>>> You are not dealing with unicode. You have UTF8.
>>>>>>>> 
>>>>>>>> ]s=.  7 u: 'ఝ' ,'a','ఝ' NB. s is converted to unicode.
>>>>>>>> 
>>>>>>>> ఝaఝ
>>>>>>>> 
>>>>>>>>  $s
>>>>>>>> 
>>>>>>>> 3
>>>>>>>> 
>>>>>>>> <"0 s
>>>>>>>> 
>>>>>>>> +---+-+---+
>>>>>>>> 
>>>>>>>> |ఝ|a|ఝ|
>>>>>>>> 
>>>>>>>> +---+-+---+
>>>>>>>> 
>>>>>>>> 
>>>>>>>> But the display still is messed up because the display first converts
>>>>> the
>>>>>>>> unicode to UTF8. Then does a byte count to determine how many boxing
>>>>>>>> characters to put around the data. But there is still a problem as many
>>>>>>>> unicode/UTF8 characters beyond ASCII are proportional. Notice how wide
>>>>>>> the
>>>>>>>> first and last characters are compared to the "a".
>>>>>>>> 
>>>>>>>> On Thu, Jun 16, 2016 at 12:08 PM, robert therriault <
>>>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> I am in the process of extending some of the type and shape
>>>>>>> visualizations
>>>>>>>>> that I have done in the past [0] into the realm of unicode.
>>>>>>>>> 
>>>>>>>>> If you look through the archives of these message lists you will find
>>>>>>> that
>>>>>>>>> unicode can be quite confounding, but my question is relatively
>>>>> simple.
>>>>>>>>> 
>>>>>>>>> I would like to take
>>>>>>>>> 
>>>>>>>>> [s=.  2 6 $ 'ఝ' ,'a','ఝ'  NB. � results from 224 176 157 being
>>>>> broken
>>>>>>>>> across dimensions
>>>>>>>>> ఝa��
>>>>>>>>> �ఝa�
>>>>>>>>> [encode=. a. i. s       NB. shape of 2 6 refers to the encoding
>>>>>>> numbers
>>>>>>>>> not the number of characters displayed
>>>>>>>>> 224 176 157  97 224 176
>>>>>>>>> 157 224 176 157  97 224
>>>>>>>>> 
>>>>>>>>> and convert encode to a form where the encoding for each character is
>>>>> in
>>>>>>>>> it's own box. Of course, this would be a verb that can work with any
>>>>>>>>> literal array not just the example given.
>>>>>>>>> 
>>>>>>>>> [r=. 2 4 $ 224 176 157 ; 97 ; 224 ; 176 ; 157 ; 224 176 157 ; 97 ; 224
>>>>>>>>> ┌───────────┬───────────┬───┬───┐
>>>>>>>>> │224 176 157│97         │224│176│
>>>>>>>>> ├───────────┼───────────┼───┼───┤
>>>>>>>>> │157        │224 176 157│97 │224│
>>>>>>>>> └───────────┴───────────┴───┴───┘
>>>>>>>>> 
>>>>>>>>> which could be converted back to
>>>>>>>>> 
>>>>>>>>> {&a.  each r
>>>>>>>>> ┌───┬───┬─┬─┐
>>>>>>>>> │ఝ│a  │�│�│
>>>>>>>>> ├───┼───┼─┼─┤
>>>>>>>>> │�  │ఝ│a│�│
>>>>>>>>> └───┴───┴─┴─┘
>>>>>>>>> 
>>>>>>>>> With this in place it may be possible to have the literal view of
>>>>>>> unicode
>>>>>>>>> display a little more consistently
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Any suggestions would be welcome.
>>>>>>>>> 
>>>>>>>>> Cheers, bob
>>>>>>>>> 
>>>>>>>>> [0] Video of Enhanced display of literals
>>>>>>>>> https://www.youtube.com/watch?v=BzjfJjGb5cs
>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>> For information about J forums see
>>>>> http://www.jsoftware.com/forums.htm
>>>>>>>> ----------------------------------------------------------------------
>>>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>>>> 
>>>>>>> ----------------------------------------------------------------------
>>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>>> ----------------------------------------------------------------------
>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>> 
>>>>> ----------------------------------------------------------------------
>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>> 
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> 
>> -- 
>> regards,
>> ====================================================
>> GPG key 1024D/4434BAB3 2008-08-24
>> gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
>> gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3

>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> 
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Unicode (UTF8) string deconstruction

Reply via email to