Yeah, one of my challenges is to go back and understand the choices that I made when developing these verbs. The simple ways can decode complete sequences, but I needed something that would decode incomplete sequences consistent with the way that J did it. Once I had that working I moved on, but now I may need to go back and really understand what I was doing in that process.
At one point I had tried sequential machines, but I found that they were a lot slower than the explicit version that I came up with and even more opaque. Cheers, bob > On Sep 4, 2019, at 8:16 AM, Raul Miller <[email protected]> wrote: > > Interesting... > > I tried for a shorter version: > > bu=: [:(a.i.8&u:)&.>9&u: > > But, of course, this rejects incomplete unicode sequences with a domain > error... > > Thanks, > > -- > Raul > > > On Tue, Sep 3, 2019 at 11:46 PM 'robert therriault' via Programming > <[email protected]> wrote: >> >> Henry, >> >> How can I turn down such a gracious invitation. :-) >> >> Seriously, I will do my best, having bounced around the unicode and >> unicode4 aspects of J while putting jig together. I don't claim to be able >> match the clear writing that you and Ian have done, but revision always >> welcomed and that is what wiki's are made for. >> >> And I do have two verbs that parse the code points which might be useful as >> well. >> >> 9 u: 128512 >> >> >> boxutf=: 3 : 0"1 NB. for literals >> a=.a: [ t=.3 u: y >> while. #t do. >> select. s=. 127 191 223 239 I. {. t >> case. (0;1) do. t=.}.t [ a=.a,< {.t >> case. do. if. 0={:t1=.s{.t do. a=.a,<"0 t1-.0 >> elseif. 191 < >./ }.t1 do. a=.a,<"0 s{.t1 >> [ s=.>:@:(1 i.~ 191 < }.) t1 >> elseif. do. a=.a,< t1 >> end. t=. s }.t >> end. >> end. >> }.a >> ) >> boxutf ": 9 u: 128512 NB. converted to literal >> ┌───────────────┐ >> │240 159 152 128│ >> └───────────────┘ >> 240 159 152 128 { a. >> >> 3 !: 0 [240 159 152 128 { a. >> 2 NB. tyoe literal >> boxutf 2{. ": 9 u: 128512 NB. incomplete code breaks into nondisplayable >> characters >> ┌───┬───┐ >> │240│159│ >> └───┴───┘ >> 240 159 { a. >> �� >> >> boxuni=: 3 : 0"1 NB. for unicode and unicode4 >> a=.a: [ t=.3 u: y >> while. #t do. >> select. 55295 57343 I. {. t >> case. (0;2) do. t=. }. t [ a=. a , < {. t >> case. do. if. (56320&> +. 57343&<:) {: t1=.2 {. t do. t=. }. t [ >> a=.a , < {. t else. t=.2 }. t [ a=.a , < t1 end. >> end. >> end. >> }.a >> ) >> boxuni_jig_ 9 u: 128512 >> ┌──────┐ >> │128512│ >> └──────┘ >> boxuni_jig_ 7 u: 128512 >> ┌───────────┐ >> │55357 56832│ >> └───────────┘ >> 7 u: 55357 56832 >> >> 7 u: 55357 NB. incomplete code breaks into nondisplayable characters. >> ��� >> >> The real challenge with unicode is that you get can deep into the weeds >> pretty fast. >> >> I'll try to come up with something in the next couple of days. Anyone's >> suggestions on the best way to approach this are welcome. >> >> Cheers, bob >> >>> On Sep 3, 2019, at 7:04 PM, Henry Rich <[email protected]> wrote: >>> >>> The introductory page for Unicode >>> >>> https://code.jsoftware.com/wiki/Vocabulary/UnicodeCodePoint >>> >>> does not discuss 4-byte characters, or the concept of surrogate pairs with >>> 2-byte characters. >>> >>> 4-byte precision is called unicode4 in NuVoc. If someone would add >>> discussion of these to the page, they would be a Hero. I'm just saying. >>> >>> Henry Rich >>> >>> --- >>> This email has been checked for viruses by AVG. >>> https://www.avg.com >>> >>> ---------------------------------------------------------------------- >>> For information about J forums see http://www.jsoftware.com/forums.htm >> >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
