Re: [Jprogramming] Unicode (UTF8) string deconstruction

'Pascal Jasmin' via Programming Thu, 16 Jun 2016 14:43:54 -0700

the problem is that this is invalid utf8, I think


{. 2 6 $ 'ఝ' ,'a','ఝ'
ఝa��


(8 <@(a.i.u:)("0) 7 u: ]) every   2 # < 'ఝ' ,'a','ఝ'
┌───────────┬──┬───────────┐
│224 176 157│97│224 176 157│
├───────────┼──┼───────────┤
│224 176 157│97│224 176 157│
└───────────┴──┴───────────┘


in 


3 u: 8 u:("1) 2 6 $ 'ఝ' ,'a','ఝ'
224 176 157  97 224 176
157 224 176 157  97 224

the first line cuts off the valid termination of last utf8 character.  
Similarly with 2nd line.


----- Original Message -----
From: robert therriault <[email protected]>
To: [email protected]
Sent: Thursday, June 16, 2016 2:52 PM
Subject: Re: [Jprogramming] Unicode (UTF8) string deconstruction

Thanks Pascal,

Using the original example

[s=.  2 6 $ 'ఝ' ,'a','ఝ'
ఝa��
�ఝa�

    8 <@(a.i.u:)("0) 7 u: s  NB. Arrays need to be dealt with as rank 1
|rank error
|   8<@(a.i.u:)("0)7     u:s
    8 <@(a.i.u:)("0) 7 u:"1 s  NB. Issues still arise with the partial encodings
|domain error
|   8<@(a.i.u:)("0)7     u:"1 s
    8 <@(a.i.u:)("0) 7 u:"1 {. {: s  NB. Issue with the non valid encoding that 
J displays as �
|domain error
|   8<@(a.i.u:)("0)7     u:"1{.{:s

I think that the challenge is the partial encodings. The J IDE displays these, 
but the 7 u: gives errors and even using :: for error exceptions I haven't 
found a nice way around the issues.

Cheers, bob

> On Jun 16, 2016, at 11:37 AM, 'Pascal Jasmin' via Programming 
> <[email protected]> wrote:
> 
> 8 <@(a.i.u:)("0) 7 u: 'ఝ' ,'a','ఝ'
> ┌───────────┬──┬───────────┐
> │224 176 157│97│224 176 157│
> └───────────┴──┴───────────┘
> 
> 
> 
> 
> ----- Original Message -----
> From: robert therriault <[email protected]>
> To: [email protected]
> Sent: Thursday, June 16, 2016 2:33 PM
> Subject: Re: [Jprogramming] Unicode (UTF8) string deconstruction
> 
> You are quite right Don,
> 
> I should change the request to displaying unicode in UTF8 I suppose. 
> Converting to unicode as you have done also allows manipulation of characters 
> within arrays, but I am looking ways to show the results when reshaping 
> breaks UTF8 representation. 
> 
> Do you have a way to take a literal array in UTF8 and box the encodings for 
> each character?
> 
> I have seen your posts in the past and they have helped as I work through 
> this process. Thank you.
> 
> One of the ways that I am looking at dealing with the width issue is to have 
> the character display display in a smaller font so that some of the unicode 
> display width issues can be resolved.
> 
> Cheers, bob
> 
>> On Jun 16, 2016, at 11:25 AM, Don Guinn <[email protected]> wrote:
>> 
>> You are not dealing with unicode. You have UTF8.
>> 
>>  ]s=.  7 u: 'ఝ' ,'a','ఝ' NB. s is converted to unicode.
>> 
>> ఝaఝ
>> 
>>     $s
>> 
>> 3
>> 
>>  <"0 s
>> 
>> +---+-+---+
>> 
>> |ఝ|a|ఝ|
>> 
>> +---+-+---+
>> 
>> 
>> But the display still is messed up because the display first converts the
>> unicode to UTF8. Then does a byte count to determine how many boxing
>> characters to put around the data. But there is still a problem as many
>> unicode/UTF8 characters beyond ASCII are proportional. Notice how wide the
>> first and last characters are compared to the "a".
>> 
>> On Thu, Jun 16, 2016 at 12:08 PM, robert therriault <[email protected]>
>> wrote:
>> 
>>> I am in the process of extending some of the type and shape visualizations
>>> that I have done in the past [0] into the realm of unicode.
>>> 
>>> If you look through the archives of these message lists you will find that
>>> unicode can be quite confounding, but my question is relatively simple.
>>> 
>>> I would like to take
>>> 
>>>   [s=.  2 6 $ 'ఝ' ,'a','ఝ'  NB. � results from 224 176 157 being broken
>>> across dimensions
>>> ఝa��
>>> �ఝa�
>>>  [encode=. a. i. s       NB. shape of 2 6 refers to the encoding numbers
>>> not the number of characters displayed
>>> 224 176 157  97 224 176
>>> 157 224 176 157  97 224
>>> 
>>> and convert encode to a form where the encoding for each character is in
>>> it's own box. Of course, this would be a verb that can work with any
>>> literal array not just the example given.
>>> 
>>> [r=. 2 4 $ 224 176 157 ; 97 ; 224 ; 176 ; 157 ; 224 176 157 ; 97 ; 224
>>> ┌───────────┬───────────┬───┬───┐
>>> │224 176 157│97         │224│176│
>>> ├───────────┼───────────┼───┼───┤
>>> │157        │224 176 157│97 │224│
>>> └───────────┴───────────┴───┴───┘
>>> 
>>> which could be converted back to
>>> 
>>>   {&a.  each r
>>> ┌───┬───┬─┬─┐
>>> │ఝ│a  │�│�│
>>> ├───┼───┼─┼─┤
>>> │�  │ఝ│a│�│
>>> └───┴───┴─┴─┘
>>> 
>>> With this in place it may be possible to have the literal view of unicode
>>> display a little more consistently
>>> 
>>> 
>>> Any suggestions would be welcome.
>>> 
>>> Cheers, bob
>>> 
>>> [0] Video of Enhanced display of literals
>>> https://www.youtube.com/watch?v=BzjfJjGb5cs
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm

> 
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> 
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Unicode (UTF8) string deconstruction

Reply via email to