Interesting...

I tried for a shorter version:

   bu=: [:(a.i.8&u:)&.>9&u:

But, of course, this rejects incomplete unicode sequences with a domain error...

Thanks,

-- 
Raul


On Tue, Sep 3, 2019 at 11:46 PM 'robert therriault' via Programming
<[email protected]> wrote:
>
> Henry,
>
> How can I turn down such a gracious invitation. :-)
>
> Seriously, I will do my  best, having bounced around the unicode and unicode4 
> aspects of J while putting jig together. I don't claim to be able match the 
> clear writing that you and Ian have done, but revision always welcomed and 
> that is what wiki's are made for.
>
> And I do have two verbs that parse the code points which might be useful as 
> well.
>
>     9 u: 128512
>
>
> boxutf=: 3 : 0"1  NB. for literals
> a=.a: [ t=.3 u: y
> while. #t do.
>  select. s=. 127 191 223 239 I. {. t
>   case. (0;1) do. t=.}.t [ a=.a,< {.t
>   case.       do. if. 0={:t1=.s{.t do. a=.a,<"0 t1-.0
>                                    elseif. 191 < >./ }.t1 do. a=.a,<"0 s{.t1 
> [ s=.>:@:(1 i.~ 191 < }.) t1
>                                    elseif.                do. a=.a,< t1   
> end. t=. s }.t
>  end.
> end.
> }.a
> )
>     boxutf ": 9 u: 128512  NB. converted to literal
> ┌───────────────┐
> │240 159 152 128│
> └───────────────┘
>    240 159 152 128 { a.
>
>    3 !: 0 [240 159 152 128 { a.
> 2  NB. tyoe literal
>     boxutf 2{. ": 9 u: 128512  NB. incomplete code breaks into nondisplayable 
> characters
> ┌───┬───┐
> │240│159│
> └───┴───┘
>     240 159 { a.
> ��
>
> boxuni=: 3 : 0"1  NB. for unicode and unicode4
> a=.a: [ t=.3 u: y
> while. #t do.
>  select.  55295 57343 I. {. t
>   case. (0;2) do. t=. }. t [ a=. a , < {. t
>   case.       do. if. (56320&> +. 57343&<:) {: t1=.2 {. t  do. t=.  }. t [ 
> a=.a , < {. t else. t=.2 }. t [ a=.a , < t1 end.
>  end.
> end.
> }.a
> )
>     boxuni_jig_  9 u: 128512
> ┌──────┐
> │128512│
> └──────┘
>    boxuni_jig_  7 u: 128512
> ┌───────────┐
> │55357 56832│
> └───────────┘
>     7 u: 55357 56832
>
>     7 u: 55357   NB. incomplete code breaks into nondisplayable characters.
> ���
>
> The real challenge with unicode is that you get can deep into the weeds 
> pretty fast.
>
> I'll try to come up with something in the next couple of days. Anyone's 
> suggestions on the best way to approach this are welcome.
>
> Cheers, bob
>
> > On Sep 3, 2019, at 7:04 PM, Henry Rich <[email protected]> wrote:
> >
> > The introductory page for Unicode
> >
> > https://code.jsoftware.com/wiki/Vocabulary/UnicodeCodePoint
> >
> > does not discuss 4-byte characters, or the concept of surrogate pairs with 
> > 2-byte characters.
> >
> > 4-byte precision is called unicode4 in NuVoc.  If someone would add 
> > discussion of these to the page, they would be a Hero.  I'm just saying.
> >
> > Henry Rich
> >
> > ---
> > This email has been checked for viruses by AVG.
> > https://www.avg.com
> >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to