Re: [Jprogramming] Unicode (UTF8) string deconstruction

robert therriault Fri, 17 Jun 2016 20:54:34 -0700

Thanks Bill,

If the utf8 is at most 3 bytes that takes a layer of checking out of my utf 
verb.


    utf_vts_
3 : 0"1
if. y-:'' do. return. end.
try.  ((utf@:((1<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (1<.#) {. ]))) y
   catch. try. ((utf@:((2<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (2<.#) {. ]))) y
     catch. try. ((utf@:((3<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (3<.#) {. ]))) y
                       catch. ({. ; utf@}.) y
                       end.
     end.
   end.
)

For the width, I have a couple of ways to deal with that. Initially, I was 
leaning toward shrinking the font
for literals and keeping the background rectangle the same size to match the 
other non-literal characters. Now I 
am thinking that I will probably end up having the width of the background 
rectangle reflect the number of bytes. This would
mean that the wider characters would have lots of room because they would have 
at least double wide backgrounds. Single byte 
shapes are not as wide as far as I can see, but once I have a chance to test 
I'll know whether I have issues with width. 

I think that about does it for me. I am heading off to NYC this week on 
vacation. My first time there, so I'll probably be busy
seeing the sights.

Cheers, bob

> On Jun 17, 2016, at 8:28 PM, bill lam <[email protected]> wrote:
> 
> I see. But J can only handle unicode in bmp, ie codepoints below
> 65536, which are atmost 3 byte utf8.
>   u: 65536
> |index error
> |       u:65536
> 
> Also the display width of a unicode character can vary from 0 to 2.
> 
> Пт, 17 июн 2016, robert therriault написал(а):
>> Yes there are certainly illegal utf8 characters in 8 6$ 'ఝ' ,'a','ఝ', but 
>> what
>> I am attempting is to reveal the illegal characters for what they are. Along 
>> the lines
>> of the shape and type display that I had used incorporating svg. Once i have 
>> that information
>> in a format that I can separate the illegal characters from the legal and 
>> allow a viewer to see
>> the information by hovering over the character, then the reasons for 8 6$ 
>> 'ఝ' ,'a','ఝ' looking the 
>> way that it does on the j display becomes more apparent. 
>> 
>> Also, being able to distinguish between the 
>> 1, 2, 3, and 4 byte utf8 representations may allow a bit more consistency in 
>> the way that the boxed
>> versions of these  characters display. 
>> 
>> It remains to be seen how far I get with this, but the ability to show the 
>> representation framework of
>> a utf8 array is a step. :-)
>> 
>> Cheers, bob
>> 
>>> On Jun 17, 2016, at 5:13 PM, bill lam <[email protected]> wrote:
>>> 
>>> But your s contains illegal utf8 characters.
>>> 
>>> isutf8=: 1:@(7&u:) ::0:
>>> 
>>>  isutf8 'ఝ' ,'a','ఝ'
>>> 1
>>>  isutf8"1[ 8 6$ 'ఝ' ,'a','ఝ'
>>> 0 0 0 1 0 0 0 0
>>> 
>>>  isutf8"1[ 8 7$ 'ఝ' ,'a','ఝ'
>>> 1 1 1 1 1 1 1 1
>>> 
>>> Since the 3 wide characters string is a 7 byte in utf8
>>> a.i.'ఝ' ,'a','ఝ'
>>> 224 176 157 97 224 176 157
>>> 8 6 $ .... is not what you would expected. perhaps you meant
>>> 
>>>  [s=: 8 6 $ 7 u: 'ఝ' ,'a','ఝ'
>>> ఝaఝఝaఝ
>>> ఝaఝఝaఝ
>>> ఝaఝఝaఝ
>>> ఝaఝఝaఝ
>>> ఝaఝఝaఝ
>>> ఝaఝఝaఝ
>>> ఝaఝఝaఝ
>>> ఝaఝఝaఝ
>>> On Jun 18, 2016 7:30 AM, "robert therriault" <[email protected]> wrote:
>>> 
>>>> Thanks for all the suggestions everyone.
>>>> 
>>>> In the end I took a more explicit approach than I normally would, but it
>>>> seems to work.
>>>> 
>>>> I am not sure if this is useful for Henry, but it is one approach.
>>>> 
>>>>   [s=.  8 6 $ 'ఝ' ,'a','ఝ'
>>>> ఝa��
>>>> �ఝa�
>>>> ��ఝa
>>>> ఝఝ
>>>> aఝ��
>>>> �aఝ�
>>>> ��aఝ
>>>> ఝa��
>>>>  boxutf  s
>>>> ┌───────────┬───────────┬───────────┬───────────┐
>>>> │224 176 157│97         │224        │176        │
>>>> ├───────────┼───────────┼───────────┼───────────┤
>>>> │157        │224 176 157│97         │224        │
>>>> ├───────────┼───────────┼───────────┼───────────┤
>>>> │176        │157        │224 176 157│97         │
>>>> ├───────────┼───────────┼───────────┼───────────┤
>>>> │224 176 157│224 176 157│           │           │
>>>> ├───────────┼───────────┼───────────┼───────────┤
>>>> │97         │224 176 157│224        │176        │
>>>> ├───────────┼───────────┼───────────┼───────────┤
>>>> │157        │97         │224 176 157│224        │
>>>> ├───────────┼───────────┼───────────┼───────────┤
>>>> │176        │157        │97         │224 176 157│
>>>> ├───────────┼───────────┼───────────┼───────────┤
>>>> │224 176 157│97         │224        │176        │
>>>> └───────────┴───────────┴───────────┴───────────┘
>>>>  {&a. each boxutf  s
>>>> ┌───┬───┬───┬───┐
>>>> │ఝ│a  │�  │�  │
>>>> ├───┼───┼───┼───┤
>>>> │�  │ఝ│a  │�  │
>>>> ├───┼───┼───┼───┤
>>>> │�  │�  │ఝ│a  │
>>>> ├───┼───┼───┼───┤
>>>> │ఝ│ఝ│   │   │
>>>> ├───┼───┼───┼───┤
>>>> │a  │ఝ│�  │�  │
>>>> ├───┼───┼───┼───┤
>>>> │�  │a  │ఝ│�  │
>>>> ├───┼───┼───┼───┤
>>>> │�  │�  │a  │ఝ│
>>>> ├───┼───┼───┼───┤
>>>> │ఝ│a  │�  │�  │
>>>> └───┴───┴───┴───┘
>>>>  boxutf
>>>> }:@utf@(3&u:)@":
>>>>  utf
>>>> 3 : 0"1
>>>> if. y-:'' do. return. end.
>>>> try.  ((utf@:((1<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (1<.#) {. ]))) y
>>>>  catch. try. ((utf@:((2<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (2<.#) {.
>>>> ]))) y
>>>>    catch. try. ((utf@:((3<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (3<.#) {.
>>>> ]))) y
>>>>      catch. try.  ((utf@:((4<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (4<.#)
>>>> {. ]))) y
>>>>                      catch. ({. ; utf@}.) y
>>>>                      end.
>>>>      end.
>>>>    end.
>>>>  end.
>>>> )
>>>> 
>>>> Row by row I am just grabbing up to 4 UTF8 numbers and boxing them.
>>>> Whenever the numbers are valid I box them and move on with the remaining
>>>> part of the row.
>>>> 
>>>> I am sure others will find a more elegant approach, but this seems to work.
>>>> 
>>>> Cheers, bob
>>>> 
>>>>> On Jun 16, 2016, at 4:27 PM, bill lam <[email protected]> wrote:
>>>>> 
>>>>> internal representation of utf8 array is no different from regular
>>>>> character array, utf8 only applies external interface. If you want to
>>>>> manipulate unicode within j, you should use the wide character data type
>>>>> (131072) as suggested by Don.
>>>>> On Jun 17, 2016 2:33 AM, "robert therriault" <[email protected]>
>>>> wrote:
>>>>> 
>>>>>> You are quite right Don,
>>>>>> 
>>>>>> I should change the request to displaying unicode in UTF8 I suppose.
>>>>>> Converting to unicode as you have done also allows manipulation of
>>>>>> characters within arrays, but I am looking ways to show the results when
>>>>>> reshaping breaks UTF8 representation.
>>>>>> 
>>>>>> Do you have a way to take a literal array in UTF8 and box the encodings
>>>>>> for each character?
>>>>>> 
>>>>>> I have seen your posts in the past and they have helped as I work
>>>> through
>>>>>> this process. Thank you.
>>>>>> 
>>>>>> One of the ways that I am looking at dealing with the width issue is to
>>>>>> have the character display display in a smaller font so that some of the
>>>>>> unicode display width issues can be resolved.
>>>>>> 
>>>>>> Cheers, bob
>>>>>> 
>>>>>>> On Jun 16, 2016, at 11:25 AM, Don Guinn <[email protected]> wrote:
>>>>>>> 
>>>>>>> You are not dealing with unicode. You have UTF8.
>>>>>>> 
>>>>>>> ]s=.  7 u: 'ఝ' ,'a','ఝ' NB. s is converted to unicode.
>>>>>>> 
>>>>>>> ఝaఝ
>>>>>>> 
>>>>>>>   $s
>>>>>>> 
>>>>>>> 3
>>>>>>> 
>>>>>>> <"0 s
>>>>>>> 
>>>>>>> +---+-+---+
>>>>>>> 
>>>>>>> |ఝ|a|ఝ|
>>>>>>> 
>>>>>>> +---+-+---+
>>>>>>> 
>>>>>>> 
>>>>>>> But the display still is messed up because the display first converts
>>>> the
>>>>>>> unicode to UTF8. Then does a byte count to determine how many boxing
>>>>>>> characters to put around the data. But there is still a problem as many
>>>>>>> unicode/UTF8 characters beyond ASCII are proportional. Notice how wide
>>>>>> the
>>>>>>> first and last characters are compared to the "a".
>>>>>>> 
>>>>>>> On Thu, Jun 16, 2016 at 12:08 PM, robert therriault <
>>>>>> [email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> I am in the process of extending some of the type and shape
>>>>>> visualizations
>>>>>>>> that I have done in the past [0] into the realm of unicode.
>>>>>>>> 
>>>>>>>> If you look through the archives of these message lists you will find
>>>>>> that
>>>>>>>> unicode can be quite confounding, but my question is relatively
>>>> simple.
>>>>>>>> 
>>>>>>>> I would like to take
>>>>>>>> 
>>>>>>>> [s=.  2 6 $ 'ఝ' ,'a','ఝ'  NB. � results from 224 176 157 being
>>>> broken
>>>>>>>> across dimensions
>>>>>>>> ఝa��
>>>>>>>> �ఝa�
>>>>>>>> [encode=. a. i. s       NB. shape of 2 6 refers to the encoding
>>>>>> numbers
>>>>>>>> not the number of characters displayed
>>>>>>>> 224 176 157  97 224 176
>>>>>>>> 157 224 176 157  97 224
>>>>>>>> 
>>>>>>>> and convert encode to a form where the encoding for each character is
>>>> in
>>>>>>>> it's own box. Of course, this would be a verb that can work with any
>>>>>>>> literal array not just the example given.
>>>>>>>> 
>>>>>>>> [r=. 2 4 $ 224 176 157 ; 97 ; 224 ; 176 ; 157 ; 224 176 157 ; 97 ; 224
>>>>>>>> ┌───────────┬───────────┬───┬───┐
>>>>>>>> │224 176 157│97         │224│176│
>>>>>>>> ├───────────┼───────────┼───┼───┤
>>>>>>>> │157        │224 176 157│97 │224│
>>>>>>>> └───────────┴───────────┴───┴───┘
>>>>>>>> 
>>>>>>>> which could be converted back to
>>>>>>>> 
>>>>>>>> {&a.  each r
>>>>>>>> ┌───┬───┬─┬─┐
>>>>>>>> │ఝ│a  │�│�│
>>>>>>>> ├───┼───┼─┼─┤
>>>>>>>> │�  │ఝ│a│�│
>>>>>>>> └───┴───┴─┴─┘
>>>>>>>> 
>>>>>>>> With this in place it may be possible to have the literal view of
>>>>>> unicode
>>>>>>>> display a little more consistently
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Any suggestions would be welcome.
>>>>>>>> 
>>>>>>>> Cheers, bob
>>>>>>>> 
>>>>>>>> [0] Video of Enhanced display of literals
>>>>>>>> https://www.youtube.com/watch?v=BzjfJjGb5cs
>>>>>>>> ----------------------------------------------------------------------
>>>>>>>> For information about J forums see
>>>> http://www.jsoftware.com/forums.htm
>>>>>>> ----------------------------------------------------------------------
>>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>>> 
>>>>>> ----------------------------------------------------------------------
>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>> ----------------------------------------------------------------------
>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>> 
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> 
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> 
> -- 
> regards,
> ====================================================
> GPG key 1024D/4434BAB3 2008-08-24
> gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
> gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Unicode (UTF8) string deconstruction

Reply via email to