Re: [Jprogramming] Unicode (UTF8) string deconstruction

bill lam Fri, 17 Jun 2016 17:14:06 -0700

But your s contains illegal utf8 characters.

isutf8=: 1:@(7&u:) ::0:


   isutf8 'ఝ' ,'a','ఝ'
1
   isutf8"1[ 8 6$ 'ఝ' ,'a','ఝ'
0 0 0 1 0 0 0 0

   isutf8"1[ 8 7$ 'ఝ' ,'a','ఝ'
1 1 1 1 1 1 1 1

Since the 3 wide characters string is a 7 byte in utf8
  a.i.'ఝ' ,'a','ఝ'
224 176 157 97 224 176 157
8 6 $ .... is not what you would expected. perhaps you meant

   [s=: 8 6 $ 7 u: 'ఝ' ,'a','ఝ'
ఝaఝఝaఝ
ఝaఝఝaఝ
ఝaఝఝaఝ
ఝaఝఝaఝ
ఝaఝఝaఝ
ఝaఝఝaఝ
ఝaఝఝaఝ
ఝaఝఝaఝ
On Jun 18, 2016 7:30 AM, "robert therriault" <[email protected]> wrote:

> Thanks for all the suggestions everyone.
>
> In the end I took a more explicit approach than I normally would, but it
> seems to work.
>
> I am not sure if this is useful for Henry, but it is one approach.
>
>     [s=.  8 6 $ 'ఝ' ,'a','ఝ'
> ఝa��
> �ఝa�
> ��ఝa
> ఝఝ
> aఝ��
> �aఝ�
> ��aఝ
> ఝa��
>    boxutf  s
> ┌───────────┬───────────┬───────────┬───────────┐
> │224 176 157│97         │224        │176        │
> ├───────────┼───────────┼───────────┼───────────┤
> │157        │224 176 157│97         │224        │
> ├───────────┼───────────┼───────────┼───────────┤
> │176        │157        │224 176 157│97         │
> ├───────────┼───────────┼───────────┼───────────┤
> │224 176 157│224 176 157│           │           │
> ├───────────┼───────────┼───────────┼───────────┤
> │97         │224 176 157│224        │176        │
> ├───────────┼───────────┼───────────┼───────────┤
> │157        │97         │224 176 157│224        │
> ├───────────┼───────────┼───────────┼───────────┤
> │176        │157        │97         │224 176 157│
> ├───────────┼───────────┼───────────┼───────────┤
> │224 176 157│97         │224        │176        │
> └───────────┴───────────┴───────────┴───────────┘
>    {&a. each boxutf  s
> ┌───┬───┬───┬───┐
> │ఝ│a  │�  │�  │
> ├───┼───┼───┼───┤
> │�  │ఝ│a  │�  │
> ├───┼───┼───┼───┤
> │�  │�  │ఝ│a  │
> ├───┼───┼───┼───┤
> │ఝ│ఝ│   │   │
> ├───┼───┼───┼───┤
> │a  │ఝ│�  │�  │
> ├───┼───┼───┼───┤
> │�  │a  │ఝ│�  │
> ├───┼───┼───┼───┤
> │�  │�  │a  │ఝ│
> ├───┼───┼───┼───┤
> │ఝ│a  │�  │�  │
> └───┴───┴───┴───┘
>    boxutf
> }:@utf@(3&u:)@":
>    utf
> 3 : 0"1
> if. y-:'' do. return. end.
> try.  ((utf@:((1<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (1<.#) {. ]))) y
>    catch. try. ((utf@:((2<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (2<.#) {.
> ]))) y
>      catch. try. ((utf@:((3<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (3<.#) {.
> ]))) y
>        catch. try.  ((utf@:((4<.#)}.]));~((3 u: ":)@: (7 u: a.{~ (4<.#)
> {. ]))) y
>                        catch. ({. ; utf@}.) y
>                        end.
>        end.
>      end.
>    end.
> )
>
> Row by row I am just grabbing up to 4 UTF8 numbers and boxing them.
> Whenever the numbers are valid I box them and move on with the remaining
> part of the row.
>
> I am sure others will find a more elegant approach, but this seems to work.
>
> Cheers, bob
>
> > On Jun 16, 2016, at 4:27 PM, bill lam <[email protected]> wrote:
> >
> > internal representation of utf8 array is no different from regular
> > character array, utf8 only applies external interface. If you want to
> > manipulate unicode within j, you should use the wide character data type
> > (131072) as suggested by Don.
> > On Jun 17, 2016 2:33 AM, "robert therriault" <[email protected]>
> wrote:
> >
> >> You are quite right Don,
> >>
> >> I should change the request to displaying unicode in UTF8 I suppose.
> >> Converting to unicode as you have done also allows manipulation of
> >> characters within arrays, but I am looking ways to show the results when
> >> reshaping breaks UTF8 representation.
> >>
> >> Do you have a way to take a literal array in UTF8 and box the encodings
> >> for each character?
> >>
> >> I have seen your posts in the past and they have helped as I work
> through
> >> this process. Thank you.
> >>
> >> One of the ways that I am looking at dealing with the width issue is to
> >> have the character display display in a smaller font so that some of the
> >> unicode display width issues can be resolved.
> >>
> >> Cheers, bob
> >>
> >>> On Jun 16, 2016, at 11:25 AM, Don Guinn <[email protected]> wrote:
> >>>
> >>> You are not dealing with unicode. You have UTF8.
> >>>
> >>>  ]s=.  7 u: 'ఝ' ,'a','ఝ' NB. s is converted to unicode.
> >>>
> >>> ఝaఝ
> >>>
> >>>     $s
> >>>
> >>> 3
> >>>
> >>>  <"0 s
> >>>
> >>> +---+-+---+
> >>>
> >>> |ఝ|a|ఝ|
> >>>
> >>> +---+-+---+
> >>>
> >>>
> >>> But the display still is messed up because the display first converts
> the
> >>> unicode to UTF8. Then does a byte count to determine how many boxing
> >>> characters to put around the data. But there is still a problem as many
> >>> unicode/UTF8 characters beyond ASCII are proportional. Notice how wide
> >> the
> >>> first and last characters are compared to the "a".
> >>>
> >>> On Thu, Jun 16, 2016 at 12:08 PM, robert therriault <
> >> [email protected]>
> >>> wrote:
> >>>
> >>>> I am in the process of extending some of the type and shape
> >> visualizations
> >>>> that I have done in the past [0] into the realm of unicode.
> >>>>
> >>>> If you look through the archives of these message lists you will find
> >> that
> >>>> unicode can be quite confounding, but my question is relatively
> simple.
> >>>>
> >>>> I would like to take
> >>>>
> >>>>   [s=.  2 6 $ 'ఝ' ,'a','ఝ'  NB. � results from 224 176 157 being
> broken
> >>>> across dimensions
> >>>> ఝa��
> >>>> �ఝa�
> >>>>  [encode=. a. i. s       NB. shape of 2 6 refers to the encoding
> >> numbers
> >>>> not the number of characters displayed
> >>>> 224 176 157  97 224 176
> >>>> 157 224 176 157  97 224
> >>>>
> >>>> and convert encode to a form where the encoding for each character is
> in
> >>>> it's own box. Of course, this would be a verb that can work with any
> >>>> literal array not just the example given.
> >>>>
> >>>> [r=. 2 4 $ 224 176 157 ; 97 ; 224 ; 176 ; 157 ; 224 176 157 ; 97 ; 224
> >>>> ┌───────────┬───────────┬───┬───┐
> >>>> │224 176 157│97         │224│176│
> >>>> ├───────────┼───────────┼───┼───┤
> >>>> │157        │224 176 157│97 │224│
> >>>> └───────────┴───────────┴───┴───┘
> >>>>
> >>>> which could be converted back to
> >>>>
> >>>>   {&a.  each r
> >>>> ┌───┬───┬─┬─┐
> >>>> │ఝ│a  │�│�│
> >>>> ├───┼───┼─┼─┤
> >>>> │�  │ఝ│a│�│
> >>>> └───┴───┴─┴─┘
> >>>>
> >>>> With this in place it may be possible to have the literal view of
> >> unicode
> >>>> display a little more consistently
> >>>>
> >>>>
> >>>> Any suggestions would be welcome.
> >>>>
> >>>> Cheers, bob
> >>>>
> >>>> [0] Video of Enhanced display of literals
> >>>> https://www.youtube.com/watch?v=BzjfJjGb5cs
> >>>> ----------------------------------------------------------------------
> >>>> For information about J forums see
> http://www.jsoftware.com/forums.htm
> >>> ----------------------------------------------------------------------
> >>> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Unicode (UTF8) string deconstruction

Reply via email to