Interesting... I tried for a shorter version:
bu=: [:(a.i.8&u:)&.>9&u: But, of course, this rejects incomplete unicode sequences with a domain error... Thanks, -- Raul On Tue, Sep 3, 2019 at 11:46 PM 'robert therriault' via Programming <[email protected]> wrote: > > Henry, > > How can I turn down such a gracious invitation. :-) > > Seriously, I will do my best, having bounced around the unicode and unicode4 > aspects of J while putting jig together. I don't claim to be able match the > clear writing that you and Ian have done, but revision always welcomed and > that is what wiki's are made for. > > And I do have two verbs that parse the code points which might be useful as > well. > > 9 u: 128512 > > > boxutf=: 3 : 0"1 NB. for literals > a=.a: [ t=.3 u: y > while. #t do. > select. s=. 127 191 223 239 I. {. t > case. (0;1) do. t=.}.t [ a=.a,< {.t > case. do. if. 0={:t1=.s{.t do. a=.a,<"0 t1-.0 > elseif. 191 < >./ }.t1 do. a=.a,<"0 s{.t1 > [ s=.>:@:(1 i.~ 191 < }.) t1 > elseif. do. a=.a,< t1 > end. t=. s }.t > end. > end. > }.a > ) > boxutf ": 9 u: 128512 NB. converted to literal > ┌───────────────┐ > │240 159 152 128│ > └───────────────┘ > 240 159 152 128 { a. > > 3 !: 0 [240 159 152 128 { a. > 2 NB. tyoe literal > boxutf 2{. ": 9 u: 128512 NB. incomplete code breaks into nondisplayable > characters > ┌───┬───┐ > │240│159│ > └───┴───┘ > 240 159 { a. > �� > > boxuni=: 3 : 0"1 NB. for unicode and unicode4 > a=.a: [ t=.3 u: y > while. #t do. > select. 55295 57343 I. {. t > case. (0;2) do. t=. }. t [ a=. a , < {. t > case. do. if. (56320&> +. 57343&<:) {: t1=.2 {. t do. t=. }. t [ > a=.a , < {. t else. t=.2 }. t [ a=.a , < t1 end. > end. > end. > }.a > ) > boxuni_jig_ 9 u: 128512 > ┌──────┐ > │128512│ > └──────┘ > boxuni_jig_ 7 u: 128512 > ┌───────────┐ > │55357 56832│ > └───────────┘ > 7 u: 55357 56832 > > 7 u: 55357 NB. incomplete code breaks into nondisplayable characters. > ��� > > The real challenge with unicode is that you get can deep into the weeds > pretty fast. > > I'll try to come up with something in the next couple of days. Anyone's > suggestions on the best way to approach this are welcome. > > Cheers, bob > > > On Sep 3, 2019, at 7:04 PM, Henry Rich <[email protected]> wrote: > > > > The introductory page for Unicode > > > > https://code.jsoftware.com/wiki/Vocabulary/UnicodeCodePoint > > > > does not discuss 4-byte characters, or the concept of surrogate pairs with > > 2-byte characters. > > > > 4-byte precision is called unicode4 in NuVoc. If someone would add > > discussion of these to the page, they would be a Hero. I'm just saying. > > > > Henry Rich > > > > --- > > This email has been checked for viruses by AVG. > > https://www.avg.com > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
