Thanks for all the suggestions everyone.
In the end I took a more explicit approach than I normally would, but it seems
to work.
I am not sure if this is useful for Henry, but it is one approach.
[s=. 8 6 $ 'ఝ' ,'a','ఝ'
ఝa��
�ఝa�
��ఝa
ఝఝ
aఝ��
�aఝ�
��aఝ
ఝa��
boxutf s
┌───────────┬───────────┬───────────┬───────────┐
│224 176 157│97 │224 │176 │
├───────────┼───────────┼───────────┼───────────┤
│157 │224 176 157│97 │224 │
├───────────┼───────────┼───────────┼───────────┤
│176 │157 │224 176 157│97 │
├───────────┼───────────┼───────────┼───────────┤
│224 176 157│224 176 157│ │ │
├───────────┼───────────┼───────────┼───────────┤
│97 │224 176 157│224 │176 │
├───────────┼───────────┼───────────┼───────────┤
│157 │97 │224 176 157│224 │
├───────────┼───────────┼───────────┼───────────┤
│176 │157 │97 │224 176 157│
├───────────┼───────────┼───────────┼───────────┤
│224 176 157│97 │224 │176 │
└───────────┴───────────┴───────────┴───────────┘
{&a. each boxutf s
┌───┬───┬───┬───┐
│ఝ│a │� │� │
├───┼───┼───┼───┤
│� │ఝ│a │� │
├───┼───┼───┼───┤
│� │� │ఝ│a │
├───┼───┼───┼───┤
│ఝ│ఝ│ │ │
├───┼───┼───┼───┤
│a │ఝ│� │� │
├───┼───┼───┼───┤
│� │a │ఝ│� │
├───┼───┼───┼───┤
│� │� │a │ఝ│
├───┼───┼───┼───┤
│ఝ│a │� │� │
└───┴───┴───┴───┘
boxutf
}:@utf@(3&u:)@":
utf
3 : 0"1
if. y-:'' do. return. end.
try. ((utf@:((1<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (1<.#) {. ]))) y
catch. try. ((utf@:((2<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (2<.#) {. ]))) y
catch. try. ((utf@:((3<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (3<.#) {. ]))) y
catch. try. ((utf@:((4<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (4<.#) {. ])))
y
catch. ({. ; utf@}.) y
end.
end.
end.
end.
)
Row by row I am just grabbing up to 4 UTF8 numbers and boxing them. Whenever
the numbers are valid I box them and move on with the remaining part of the row.
I am sure others will find a more elegant approach, but this seems to work.
Cheers, bob
> On Jun 16, 2016, at 4:27 PM, bill lam <[email protected]> wrote:
>
> internal representation of utf8 array is no different from regular
> character array, utf8 only applies external interface. If you want to
> manipulate unicode within j, you should use the wide character data type
> (131072) as suggested by Don.
> On Jun 17, 2016 2:33 AM, "robert therriault" <[email protected]> wrote:
>
>> You are quite right Don,
>>
>> I should change the request to displaying unicode in UTF8 I suppose.
>> Converting to unicode as you have done also allows manipulation of
>> characters within arrays, but I am looking ways to show the results when
>> reshaping breaks UTF8 representation.
>>
>> Do you have a way to take a literal array in UTF8 and box the encodings
>> for each character?
>>
>> I have seen your posts in the past and they have helped as I work through
>> this process. Thank you.
>>
>> One of the ways that I am looking at dealing with the width issue is to
>> have the character display display in a smaller font so that some of the
>> unicode display width issues can be resolved.
>>
>> Cheers, bob
>>
>>> On Jun 16, 2016, at 11:25 AM, Don Guinn <[email protected]> wrote:
>>>
>>> You are not dealing with unicode. You have UTF8.
>>>
>>> ]s=. 7 u: 'ఝ' ,'a','ఝ' NB. s is converted to unicode.
>>>
>>> ఝaఝ
>>>
>>> $s
>>>
>>> 3
>>>
>>> <"0 s
>>>
>>> +---+-+---+
>>>
>>> |ఝ|a|ఝ|
>>>
>>> +---+-+---+
>>>
>>>
>>> But the display still is messed up because the display first converts the
>>> unicode to UTF8. Then does a byte count to determine how many boxing
>>> characters to put around the data. But there is still a problem as many
>>> unicode/UTF8 characters beyond ASCII are proportional. Notice how wide
>> the
>>> first and last characters are compared to the "a".
>>>
>>> On Thu, Jun 16, 2016 at 12:08 PM, robert therriault <
>> [email protected]>
>>> wrote:
>>>
>>>> I am in the process of extending some of the type and shape
>> visualizations
>>>> that I have done in the past [0] into the realm of unicode.
>>>>
>>>> If you look through the archives of these message lists you will find
>> that
>>>> unicode can be quite confounding, but my question is relatively simple.
>>>>
>>>> I would like to take
>>>>
>>>> [s=. 2 6 $ 'ఝ' ,'a','ఝ' NB. � results from 224 176 157 being broken
>>>> across dimensions
>>>> ఝa��
>>>> �ఝa�
>>>> [encode=. a. i. s NB. shape of 2 6 refers to the encoding
>> numbers
>>>> not the number of characters displayed
>>>> 224 176 157 97 224 176
>>>> 157 224 176 157 97 224
>>>>
>>>> and convert encode to a form where the encoding for each character is in
>>>> it's own box. Of course, this would be a verb that can work with any
>>>> literal array not just the example given.
>>>>
>>>> [r=. 2 4 $ 224 176 157 ; 97 ; 224 ; 176 ; 157 ; 224 176 157 ; 97 ; 224
>>>> ┌───────────┬───────────┬───┬───┐
>>>> │224 176 157│97 │224│176│
>>>> ├───────────┼───────────┼───┼───┤
>>>> │157 │224 176 157│97 │224│
>>>> └───────────┴───────────┴───┴───┘
>>>>
>>>> which could be converted back to
>>>>
>>>> {&a. each r
>>>> ┌───┬───┬─┬─┐
>>>> │ఝ│a │�│�│
>>>> ├───┼───┼─┼─┤
>>>> │� │ఝ│a│�│
>>>> └───┴───┴─┴─┘
>>>>
>>>> With this in place it may be possible to have the literal view of
>> unicode
>>>> display a little more consistently
>>>>
>>>>
>>>> Any suggestions would be welcome.
>>>>
>>>> Cheers, bob
>>>>
>>>> [0] Video of Enhanced display of literals
>>>> https://www.youtube.com/watch?v=BzjfJjGb5cs
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm