Re: [Jprogramming] Unicode (UTF8) string deconstruction

robert therriault Fri, 17 Jun 2016 16:30:39 -0700

Thanks for all the suggestions everyone.

In the end I took a more explicit approach than I normally would, but it seems 
to work.


I am not sure if this is useful for Henry, but it is one approach.

    [s=.  8 6 $ 'ఝ' ,'a','ఝ'
ఝa��
�ఝa�
��ఝa
ఝఝ
aఝ��
�aఝ�
��aఝ
ఝa��
   boxutf  s
┌───────────┬───────────┬───────────┬───────────┐
│224 176 157│97         │224        │176        │
├───────────┼───────────┼───────────┼───────────┤
│157        │224 176 157│97         │224        │
├───────────┼───────────┼───────────┼───────────┤
│176        │157        │224 176 157│97         │
├───────────┼───────────┼───────────┼───────────┤
│224 176 157│224 176 157│           │           │
├───────────┼───────────┼───────────┼───────────┤
│97         │224 176 157│224        │176        │
├───────────┼───────────┼───────────┼───────────┤
│157        │97         │224 176 157│224        │
├───────────┼───────────┼───────────┼───────────┤
│176        │157        │97         │224 176 157│
├───────────┼───────────┼───────────┼───────────┤
│224 176 157│97         │224        │176        │
└───────────┴───────────┴───────────┴───────────┘
   {&a. each boxutf  s
┌───┬───┬───┬───┐
│ఝ│a  │�  │�  │
├───┼───┼───┼───┤
│�  │ఝ│a  │�  │
├───┼───┼───┼───┤
│�  │�  │ఝ│a  │
├───┼───┼───┼───┤
│ఝ│ఝ│   │   │
├───┼───┼───┼───┤
│a  │ఝ│�  │�  │
├───┼───┼───┼───┤
│�  │a  │ఝ│�  │
├───┼───┼───┼───┤
│�  │�  │a  │ఝ│
├───┼───┼───┼───┤
│ఝ│a  │�  │�  │
└───┴───┴───┴───┘
   boxutf  
}:@utf@(3&u:)@":
   utf
3 : 0"1
if. y-:'' do. return. end.
try.  ((utf@:((1<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (1<.#) {. ]))) y
   catch. try. ((utf@:((2<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (2<.#) {. ]))) y
     catch. try. ((utf@:((3<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (3<.#) {. ]))) y
       catch. try.  ((utf@:((4<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (4<.#) {. ]))) 
y
                       catch. ({. ; utf@}.) y
                       end.
       end.
     end.
   end.
)

Row by row I am just grabbing up to 4 UTF8 numbers and boxing them. Whenever 
the numbers are valid I box them and move on with the remaining part of the row.

I am sure others will find a more elegant approach, but this seems to work.

Cheers, bob

> On Jun 16, 2016, at 4:27 PM, bill lam <[email protected]> wrote:
> 
> internal representation of utf8 array is no different from regular
> character array, utf8 only applies external interface. If you want to
> manipulate unicode within j, you should use the wide character data type
> (131072) as suggested by Don.
> On Jun 17, 2016 2:33 AM, "robert therriault" <[email protected]> wrote:
> 
>> You are quite right Don,
>> 
>> I should change the request to displaying unicode in UTF8 I suppose.
>> Converting to unicode as you have done also allows manipulation of
>> characters within arrays, but I am looking ways to show the results when
>> reshaping breaks UTF8 representation.
>> 
>> Do you have a way to take a literal array in UTF8 and box the encodings
>> for each character?
>> 
>> I have seen your posts in the past and they have helped as I work through
>> this process. Thank you.
>> 
>> One of the ways that I am looking at dealing with the width issue is to
>> have the character display display in a smaller font so that some of the
>> unicode display width issues can be resolved.
>> 
>> Cheers, bob
>> 
>>> On Jun 16, 2016, at 11:25 AM, Don Guinn <[email protected]> wrote:
>>> 
>>> You are not dealing with unicode. You have UTF8.
>>> 
>>>  ]s=.  7 u: 'ఝ' ,'a','ఝ' NB. s is converted to unicode.
>>> 
>>> ఝaఝ
>>> 
>>>     $s
>>> 
>>> 3
>>> 
>>>  <"0 s
>>> 
>>> +---+-+---+
>>> 
>>> |ఝ|a|ఝ|
>>> 
>>> +---+-+---+
>>> 
>>> 
>>> But the display still is messed up because the display first converts the
>>> unicode to UTF8. Then does a byte count to determine how many boxing
>>> characters to put around the data. But there is still a problem as many
>>> unicode/UTF8 characters beyond ASCII are proportional. Notice how wide
>> the
>>> first and last characters are compared to the "a".
>>> 
>>> On Thu, Jun 16, 2016 at 12:08 PM, robert therriault <
>> [email protected]>
>>> wrote:
>>> 
>>>> I am in the process of extending some of the type and shape
>> visualizations
>>>> that I have done in the past [0] into the realm of unicode.
>>>> 
>>>> If you look through the archives of these message lists you will find
>> that
>>>> unicode can be quite confounding, but my question is relatively simple.
>>>> 
>>>> I would like to take
>>>> 
>>>>   [s=.  2 6 $ 'ఝ' ,'a','ఝ'  NB. � results from 224 176 157 being broken
>>>> across dimensions
>>>> ఝa��
>>>> �ఝa�
>>>>  [encode=. a. i. s       NB. shape of 2 6 refers to the encoding
>> numbers
>>>> not the number of characters displayed
>>>> 224 176 157  97 224 176
>>>> 157 224 176 157  97 224
>>>> 
>>>> and convert encode to a form where the encoding for each character is in
>>>> it's own box. Of course, this would be a verb that can work with any
>>>> literal array not just the example given.
>>>> 
>>>> [r=. 2 4 $ 224 176 157 ; 97 ; 224 ; 176 ; 157 ; 224 176 157 ; 97 ; 224
>>>> ┌───────────┬───────────┬───┬───┐
>>>> │224 176 157│97         │224│176│
>>>> ├───────────┼───────────┼───┼───┤
>>>> │157        │224 176 157│97 │224│
>>>> └───────────┴───────────┴───┴───┘
>>>> 
>>>> which could be converted back to
>>>> 
>>>>   {&a.  each r
>>>> ┌───┬───┬─┬─┐
>>>> │ఝ│a  │�│�│
>>>> ├───┼───┼─┼─┤
>>>> │�  │ఝ│a│�│
>>>> └───┴───┴─┴─┘
>>>> 
>>>> With this in place it may be possible to have the literal view of
>> unicode
>>>> display a little more consistently
>>>> 
>>>> 
>>>> Any suggestions would be welcome.
>>>> 
>>>> Cheers, bob
>>>> 
>>>> [0] Video of Enhanced display of literals
>>>> https://www.youtube.com/watch?v=BzjfJjGb5cs
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> 
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Unicode (UTF8) string deconstruction

Reply via email to