I see. But J can only handle unicode in bmp, ie codepoints below
65536, which are atmost 3 byte utf8.
   u: 65536
|index error
|       u:65536

Also the display width of a unicode character can vary from 0 to 2.

Пт, 17 июн 2016, robert therriault написал(а):
> Yes there are certainly illegal utf8 characters in 8 6$ 'ఝ' ,'a','ఝ', but what
> I am attempting is to reveal the illegal characters for what they are. Along 
> the lines
> of the shape and type display that I had used incorporating svg. Once i have 
> that information
> in a format that I can separate the illegal characters from the legal and 
> allow a viewer to see
> the information by hovering over the character, then the reasons for 8 6$ 'ఝ' 
> ,'a','ఝ' looking the 
> way that it does on the j display becomes more apparent. 
> 
> Also, being able to distinguish between the 
> 1, 2, 3, and 4 byte utf8 representations may allow a bit more consistency in 
> the way that the boxed
> versions of these  characters display. 
> 
> It remains to be seen how far I get with this, but the ability to show the 
> representation framework of
> a utf8 array is a step. :-)
> 
> Cheers, bob
> 
> > On Jun 17, 2016, at 5:13 PM, bill lam <[email protected]> wrote:
> > 
> > But your s contains illegal utf8 characters.
> > 
> > isutf8=: 1:@(7&u:) ::0:
> > 
> >   isutf8 'ఝ' ,'a','ఝ'
> > 1
> >   isutf8"1[ 8 6$ 'ఝ' ,'a','ఝ'
> > 0 0 0 1 0 0 0 0
> > 
> >   isutf8"1[ 8 7$ 'ఝ' ,'a','ఝ'
> > 1 1 1 1 1 1 1 1
> > 
> > Since the 3 wide characters string is a 7 byte in utf8
> >  a.i.'ఝ' ,'a','ఝ'
> > 224 176 157 97 224 176 157
> > 8 6 $ .... is not what you would expected. perhaps you meant
> > 
> >   [s=: 8 6 $ 7 u: 'ఝ' ,'a','ఝ'
> > ఝaఝఝaఝ
> > ఝaఝఝaఝ
> > ఝaఝఝaఝ
> > ఝaఝఝaఝ
> > ఝaఝఝaఝ
> > ఝaఝఝaఝ
> > ఝaఝఝaఝ
> > ఝaఝఝaఝ
> > On Jun 18, 2016 7:30 AM, "robert therriault" <[email protected]> wrote:
> > 
> >> Thanks for all the suggestions everyone.
> >> 
> >> In the end I took a more explicit approach than I normally would, but it
> >> seems to work.
> >> 
> >> I am not sure if this is useful for Henry, but it is one approach.
> >> 
> >>    [s=.  8 6 $ 'ఝ' ,'a','ఝ'
> >> ఝa��
> >> �ఝa�
> >> ��ఝa
> >> ఝఝ
> >> aఝ��
> >> �aఝ�
> >> ��aఝ
> >> ఝa��
> >>   boxutf  s
> >> ┌───────────┬───────────┬───────────┬───────────┐
> >> │224 176 157│97         │224        │176        │
> >> ├───────────┼───────────┼───────────┼───────────┤
> >> │157        │224 176 157│97         │224        │
> >> ├───────────┼───────────┼───────────┼───────────┤
> >> │176        │157        │224 176 157│97         │
> >> ├───────────┼───────────┼───────────┼───────────┤
> >> │224 176 157│224 176 157│           │           │
> >> ├───────────┼───────────┼───────────┼───────────┤
> >> │97         │224 176 157│224        │176        │
> >> ├───────────┼───────────┼───────────┼───────────┤
> >> │157        │97         │224 176 157│224        │
> >> ├───────────┼───────────┼───────────┼───────────┤
> >> │176        │157        │97         │224 176 157│
> >> ├───────────┼───────────┼───────────┼───────────┤
> >> │224 176 157│97         │224        │176        │
> >> └───────────┴───────────┴───────────┴───────────┘
> >>   {&a. each boxutf  s
> >> ┌───┬───┬───┬───┐
> >> │ఝ│a  │�  │�  │
> >> ├───┼───┼───┼───┤
> >> │�  │ఝ│a  │�  │
> >> ├───┼───┼───┼───┤
> >> │�  │�  │ఝ│a  │
> >> ├───┼───┼───┼───┤
> >> │ఝ│ఝ│   │   │
> >> ├───┼───┼───┼───┤
> >> │a  │ఝ│�  │�  │
> >> ├───┼───┼───┼───┤
> >> │�  │a  │ఝ│�  │
> >> ├───┼───┼───┼───┤
> >> │�  │�  │a  │ఝ│
> >> ├───┼───┼───┼───┤
> >> │ఝ│a  │�  │�  │
> >> └───┴───┴───┴───┘
> >>   boxutf
> >> }:@utf@(3&u:)@":
> >>   utf
> >> 3 : 0"1
> >> if. y-:'' do. return. end.
> >> try.  ((utf@:((1<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (1<.#) {. ]))) y
> >>   catch. try. ((utf@:((2<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (2<.#) {.
> >> ]))) y
> >>     catch. try. ((utf@:((3<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (3<.#) {.
> >> ]))) y
> >>       catch. try.  ((utf@:((4<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (4<.#)
> >> {. ]))) y
> >>                       catch. ({. ; utf@}.) y
> >>                       end.
> >>       end.
> >>     end.
> >>   end.
> >> )
> >> 
> >> Row by row I am just grabbing up to 4 UTF8 numbers and boxing them.
> >> Whenever the numbers are valid I box them and move on with the remaining
> >> part of the row.
> >> 
> >> I am sure others will find a more elegant approach, but this seems to work.
> >> 
> >> Cheers, bob
> >> 
> >>> On Jun 16, 2016, at 4:27 PM, bill lam <[email protected]> wrote:
> >>> 
> >>> internal representation of utf8 array is no different from regular
> >>> character array, utf8 only applies external interface. If you want to
> >>> manipulate unicode within j, you should use the wide character data type
> >>> (131072) as suggested by Don.
> >>> On Jun 17, 2016 2:33 AM, "robert therriault" <[email protected]>
> >> wrote:
> >>> 
> >>>> You are quite right Don,
> >>>> 
> >>>> I should change the request to displaying unicode in UTF8 I suppose.
> >>>> Converting to unicode as you have done also allows manipulation of
> >>>> characters within arrays, but I am looking ways to show the results when
> >>>> reshaping breaks UTF8 representation.
> >>>> 
> >>>> Do you have a way to take a literal array in UTF8 and box the encodings
> >>>> for each character?
> >>>> 
> >>>> I have seen your posts in the past and they have helped as I work
> >> through
> >>>> this process. Thank you.
> >>>> 
> >>>> One of the ways that I am looking at dealing with the width issue is to
> >>>> have the character display display in a smaller font so that some of the
> >>>> unicode display width issues can be resolved.
> >>>> 
> >>>> Cheers, bob
> >>>> 
> >>>>> On Jun 16, 2016, at 11:25 AM, Don Guinn <[email protected]> wrote:
> >>>>> 
> >>>>> You are not dealing with unicode. You have UTF8.
> >>>>> 
> >>>>> ]s=.  7 u: 'ఝ' ,'a','ఝ' NB. s is converted to unicode.
> >>>>> 
> >>>>> ఝaఝ
> >>>>> 
> >>>>>    $s
> >>>>> 
> >>>>> 3
> >>>>> 
> >>>>> <"0 s
> >>>>> 
> >>>>> +---+-+---+
> >>>>> 
> >>>>> |ఝ|a|ఝ|
> >>>>> 
> >>>>> +---+-+---+
> >>>>> 
> >>>>> 
> >>>>> But the display still is messed up because the display first converts
> >> the
> >>>>> unicode to UTF8. Then does a byte count to determine how many boxing
> >>>>> characters to put around the data. But there is still a problem as many
> >>>>> unicode/UTF8 characters beyond ASCII are proportional. Notice how wide
> >>>> the
> >>>>> first and last characters are compared to the "a".
> >>>>> 
> >>>>> On Thu, Jun 16, 2016 at 12:08 PM, robert therriault <
> >>>> [email protected]>
> >>>>> wrote:
> >>>>> 
> >>>>>> I am in the process of extending some of the type and shape
> >>>> visualizations
> >>>>>> that I have done in the past [0] into the realm of unicode.
> >>>>>> 
> >>>>>> If you look through the archives of these message lists you will find
> >>>> that
> >>>>>> unicode can be quite confounding, but my question is relatively
> >> simple.
> >>>>>> 
> >>>>>> I would like to take
> >>>>>> 
> >>>>>>  [s=.  2 6 $ 'ఝ' ,'a','ఝ'  NB. � results from 224 176 157 being
> >> broken
> >>>>>> across dimensions
> >>>>>> ఝa��
> >>>>>> �ఝa�
> >>>>>> [encode=. a. i. s       NB. shape of 2 6 refers to the encoding
> >>>> numbers
> >>>>>> not the number of characters displayed
> >>>>>> 224 176 157  97 224 176
> >>>>>> 157 224 176 157  97 224
> >>>>>> 
> >>>>>> and convert encode to a form where the encoding for each character is
> >> in
> >>>>>> it's own box. Of course, this would be a verb that can work with any
> >>>>>> literal array not just the example given.
> >>>>>> 
> >>>>>> [r=. 2 4 $ 224 176 157 ; 97 ; 224 ; 176 ; 157 ; 224 176 157 ; 97 ; 224
> >>>>>> ┌───────────┬───────────┬───┬───┐
> >>>>>> │224 176 157│97         │224│176│
> >>>>>> ├───────────┼───────────┼───┼───┤
> >>>>>> │157        │224 176 157│97 │224│
> >>>>>> └───────────┴───────────┴───┴───┘
> >>>>>> 
> >>>>>> which could be converted back to
> >>>>>> 
> >>>>>>  {&a.  each r
> >>>>>> ┌───┬───┬─┬─┐
> >>>>>> │ఝ│a  │�│�│
> >>>>>> ├───┼───┼─┼─┤
> >>>>>> │�  │ఝ│a│�│
> >>>>>> └───┴───┴─┴─┘
> >>>>>> 
> >>>>>> With this in place it may be possible to have the literal view of
> >>>> unicode
> >>>>>> display a little more consistently
> >>>>>> 
> >>>>>> 
> >>>>>> Any suggestions would be welcome.
> >>>>>> 
> >>>>>> Cheers, bob
> >>>>>> 
> >>>>>> [0] Video of Enhanced display of literals
> >>>>>> https://www.youtube.com/watch?v=BzjfJjGb5cs
> >>>>>> ----------------------------------------------------------------------
> >>>>>> For information about J forums see
> >> http://www.jsoftware.com/forums.htm
> >>>>> ----------------------------------------------------------------------
> >>>>> For information about J forums see http://www.jsoftware.com/forums.htm
> >>>> 
> >>>> ----------------------------------------------------------------------
> >>>> For information about J forums see http://www.jsoftware.com/forums.htm
> >>> ----------------------------------------------------------------------
> >>> For information about J forums see http://www.jsoftware.com/forums.htm
> >> 
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> 
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm

-- 
regards,
====================================================
GPG key 1024D/4434BAB3 2008-08-24
gpg --keyserver subkeys.pgp.net --recv-keys 4434BAB3
gpg --keyserver subkeys.pgp.net --armor --export 4434BAB3
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to