Re: [Jprogramming] Unicode (UTF8) string deconstruction

Don Guinn Thu, 16 Jun 2016 12:45:41 -0700

If I have to do much processing of UTF8 I convert to unicode first. Do my
processing, then convert back to UTF8. That way I can use primitives and
definitions originally intended for literal will most likely work as
expected. I have started using 3&u: instead of a. for converting to numeric
as it works for literal, UTF8 and unicode. I just find it easier to deal
with a character always counting as one thing instead of usually one, but
often two or three things.


On Thu, Jun 16, 2016 at 12:52 PM, robert therriault <[email protected]>
wrote:

> Thanks Pascal,
>
> Using the original example
>
>  [s=.  2 6 $ 'ఝ' ,'a','ఝ'
> ఝa��
> �ఝa�
>
>     8 <@(a.i.u:)("0) 7 u: s  NB. Arrays need to be dealt with as rank 1
> |rank error
> |   8<@(a.i.u:)("0)7     u:s
>     8 <@(a.i.u:)("0) 7 u:"1 s  NB. Issues still arise with the partial
> encodings
> |domain error
> |   8<@(a.i.u:)("0)7     u:"1 s
>     8 <@(a.i.u:)("0) 7 u:"1 {. {: s  NB. Issue with the non valid encoding
> that J displays as �
> |domain error
> |   8<@(a.i.u:)("0)7     u:"1{.{:s
>
> I think that the challenge is the partial encodings. The J IDE displays
> these, but the 7 u: gives errors and even using :: for error exceptions I
> haven't found a nice way around the issues.
>
> Cheers, bob
>
> > On Jun 16, 2016, at 11:37 AM, 'Pascal Jasmin' via Programming <
> [email protected]> wrote:
> >
> > 8 <@(a.i.u:)("0) 7 u: 'ఝ' ,'a','ఝ'
> > ┌───────────┬──┬───────────┐
> > │224 176 157│97│224 176 157│
> > └───────────┴──┴───────────┘
> >
> >
> >
> >
> > ----- Original Message -----
> > From: robert therriault <[email protected]>
> > To: [email protected]
> > Sent: Thursday, June 16, 2016 2:33 PM
> > Subject: Re: [Jprogramming] Unicode (UTF8) string deconstruction
> >
> > You are quite right Don,
> >
> > I should change the request to displaying unicode in UTF8 I suppose.
> Converting to unicode as you have done also allows manipulation of
> characters within arrays, but I am looking ways to show the results when
> reshaping breaks UTF8 representation.
> >
> > Do you have a way to take a literal array in UTF8 and box the encodings
> for each character?
> >
> > I have seen your posts in the past and they have helped as I work
> through this process. Thank you.
> >
> > One of the ways that I am looking at dealing with the width issue is to
> have the character display display in a smaller font so that some of the
> unicode display width issues can be resolved.
> >
> > Cheers, bob
> >
> >> On Jun 16, 2016, at 11:25 AM, Don Guinn <[email protected]> wrote:
> >>
> >> You are not dealing with unicode. You have UTF8.
> >>
> >>  ]s=.  7 u: 'ఝ' ,'a','ఝ' NB. s is converted to unicode.
> >>
> >> ఝaఝ
> >>
> >>     $s
> >>
> >> 3
> >>
> >>  <"0 s
> >>
> >> +---+-+---+
> >>
> >> |ఝ|a|ఝ|
> >>
> >> +---+-+---+
> >>
> >>
> >> But the display still is messed up because the display first converts
> the
> >> unicode to UTF8. Then does a byte count to determine how many boxing
> >> characters to put around the data. But there is still a problem as many
> >> unicode/UTF8 characters beyond ASCII are proportional. Notice how wide
> the
> >> first and last characters are compared to the "a".
> >>
> >> On Thu, Jun 16, 2016 at 12:08 PM, robert therriault <
> [email protected]>
> >> wrote:
> >>
> >>> I am in the process of extending some of the type and shape
> visualizations
> >>> that I have done in the past [0] into the realm of unicode.
> >>>
> >>> If you look through the archives of these message lists you will find
> that
> >>> unicode can be quite confounding, but my question is relatively simple.
> >>>
> >>> I would like to take
> >>>
> >>>   [s=.  2 6 $ 'ఝ' ,'a','ఝ'  NB. � results from 224 176 157 being broken
> >>> across dimensions
> >>> ఝa��
> >>> �ఝa�
> >>>  [encode=. a. i. s       NB. shape of 2 6 refers to the encoding
> numbers
> >>> not the number of characters displayed
> >>> 224 176 157  97 224 176
> >>> 157 224 176 157  97 224
> >>>
> >>> and convert encode to a form where the encoding for each character is
> in
> >>> it's own box. Of course, this would be a verb that can work with any
> >>> literal array not just the example given.
> >>>
> >>> [r=. 2 4 $ 224 176 157 ; 97 ; 224 ; 176 ; 157 ; 224 176 157 ; 97 ; 224
> >>> ┌───────────┬───────────┬───┬───┐
> >>> │224 176 157│97         │224│176│
> >>> ├───────────┼───────────┼───┼───┤
> >>> │157        │224 176 157│97 │224│
> >>> └───────────┴───────────┴───┴───┘
> >>>
> >>> which could be converted back to
> >>>
> >>>   {&a.  each r
> >>> ┌───┬───┬─┬─┐
> >>> │ఝ│a  │�│�│
> >>> ├───┼───┼─┼─┤
> >>> │�  │ఝ│a│�│
> >>> └───┴───┴─┴─┘
> >>>
> >>> With this in place it may be possible to have the literal view of
> unicode
> >>> display a little more consistently
> >>>
> >>>
> >>> Any suggestions would be welcome.
> >>>
> >>> Cheers, bob
> >>>
> >>> [0] Video of Enhanced display of literals
> >>> https://www.youtube.com/watch?v=BzjfJjGb5cs
> >>> ----------------------------------------------------------------------
> >>> For information about J forums see http://www.jsoftware.com/forums.htm
> >
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Unicode (UTF8) string deconstruction

Reply via email to