Yes, I was surprised (and pleased) that the text file came through ok. As for throwing errors on malformed utf-8 sequences, here's how that could be implemented:
(1) We introduce an "error row" which uses operation 2 for all character classes (and, for consistency, identifies the next row as itself -- this is mostly so that improper use of the error row would be relatively obvious). (Operation 2 emits a token, but the important thing here is that it throws an error if j=-1.) (2) For each row which is part of a partially complete utf-8 character, any appearance of any character class which is not a utf-8 suffix character would use operation 3 and would identify the next row as the error row. (Operation 3 emits a token, but the important thing here is that it sets j=-1.) (3) Each of the different utf-8 prefixes would lead to a different row in the state table. For example, the character class containing character 224 would get a beginning of token row (which gets the error row treatment and) which leads to a row that expects a utf-8 suffix (which gets the error row treatment) followed by a second utf-8 suffix (which follows the pattern set by row 1 of the state table). Hopefully that makes sense. Note that the final token in a string could not trigger an error. That's a limitation of the engine and corresponds approximately to how utf-8 must be treated when a low level buffer boundary splits a utf-8 character. Anyways, the point is that the sequential machine can support the sort of "count a small number of steps" which is needed here. The difficulty is more that the machine stops when it reaches the end of the string. If that's a sensitivity, this could be handled by appending a linefeed character to the end of the string before processing and then removing a final linefeed character from the last token after tokenization. Again, I hope this makes sense... Thanks, -- Raul On Sat, Nov 21, 2020 at 9:22 AM Don Guinn <[email protected]> wrote: > > The text file worked great! > > As to the UTF-8 codes. What is important is to avoid splitting the start > bytes from the continuation bytes. Validating the UTF-8 codes is a > difficult task. The start byte includes the number of continuation bytes to > follow. It would be an error if the number of continuation bytes didn't > agree. > > I tried my test on your definitions and it failed. Attached is a text file > with your definitions and my test following. > > Well, I can't view the attachment. I don't know if it's there or not. Just > in case, here is my test. > > > NB. A noun to show the handling of UTF-8 in ;: > test=:{{)n > The symbol for the Euro is ₠ > Other symools like π show up also > How about ⌹ in APL NB. ⌹ > Common expressions like 'H₂O' for water > Common expressions like H₂O for water > }} > > NB. How ;: this sj and mj handles it > ,.<;.2(0;sj;mj);:test > > NB. Assigning UTF8 as character > mj=: 2 (128+i.128)}mj > > NB. How UTF-8 is now handled > ,.<;.2(0;sj;mj);:test > > On Fri, Nov 20, 2020 at 2:46 PM Raul Miller <[email protected]> wrote: > > > Here's an updated version, which also retains utf-8 character > > sequences within token boundaries (instead of splitting them up into > > multiple tokens). I had originally posted this to the jbeta forum, but > > it's really a programming topic, and probably belongs here. > > > > mj=: 256$0 NB. X other > > mj=: 1 (9,a.i.' ')}mj NB. S space and tab > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B > > mj=: 3 (a.i.'N')}mj NB. N the letter N > > mj=: 4 (a.i.'B')}mj NB. B the letter B > > mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _ > > mj=: 6 (a.i.'.')}mj NB. . the decimal point > > mj=: 7 (a.i.':')}mj NB. : the colon > > mj=: 8 (a.i.'''')}mj NB. Q quote > > mj=: 9 (a.i.'{')}mj NB. { the left curly brace > > mj=:10 (10)} mj NB. LF > > mj=:11 (a.i.'}')}mj NB. } the right curly brace > > mj=:12 (192+i.64)}mj NB. U utf-8 octet prefix > > mj=:13 (128+i.64)}mj NB. V utf-8 octet suffix > > > > sj=: 0 10#:10*}.".;._2(0 :0) > > ' X S A N B 9 . : Q { LF } U V']0 > > 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 NB. 0 space > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 1 other > > 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 2 alp/num > > 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 3 N > > 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 4 NB > > 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 5 NB. > > 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 NB. 6 num > > 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 7.0 7.0 NB. 7 ' > > 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 NB. 8 '' > > 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 9 comment > > 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 NB. 10 LF > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 15.2 16.2 NB. 11 { > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 15.2 16.2 NB. 12 } > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 13 {{ > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 14 }} > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15 > > partial > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 NB. 16 utf-8 > > ) > > > > As I noted in the beta forum -- increasing the complexity of the state > > table adds both rows and columns (size grows proportional to or faster > > than the square of the number of distinct character types when each > > character type requires a distinct handling). So it's good to keep > > this thing as simple as possible. > > > > Also, I've not tested this extensively, and it's possible I'll need to > > make further changes (let me know if you spot any problems). > > > > That said... note also that I have *not* implemented the unicode > > guideline which might suggest that the tokenizer should throw an error > > on malformed utf-8 sequences. That would require several more rows and > > columns to achieve the recommended inconvenience. This would also > > introduce email line wrap, because the state table would become that > > fat. (I'll attach a copy here as a .txt file, to see if an earlier > > suggestion -- that the forum would preserve .txt attachments -- might > > be a way of working around that issue. I suspect not, but it's easy > > enough to test...) > > > > This has *not* been implemented in the current jbeta as the ;: monad. > > I am not sure if it should, since J the language is based on ascii, > > not unicode -- it's just convenient that unicode supports an ascii > > subset. > > > > Still... we often do have reason to work with utf-8. > > > > Thanks, > > > > -- > > Raul > > > > On Sun, Nov 8, 2020 at 11:01 AM Raul Miller <[email protected]> wrote: > > > > > > Oh, oops, I should have spotted that. Thanks. > > > > > > Updated state table: > > > > > > mj=: 256$0 NB. X other > > > mj=: 1 (9,a.i.' ')}mj NB. S space and tab > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B > > > mj=: 3 (a.i.'N')}mj NB. N the letter N > > > mj=: 4 (a.i.'B')}mj NB. B the letter B > > > mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _ > > > mj=: 6 (a.i.'.')}mj NB. . the decimal point > > > mj=: 7 (a.i.':')}mj NB. : the colon > > > mj=: 8 (a.i.'''')}mj NB. Q quote > > > mj=: 9 (a.i.'{')}mj NB. { the left curly brace > > > mj=:10 (10)} mj NB. LF > > > mj=:11 (a.i.'}')}mj NB. } the right curly brace > > > > > > sj=: 0 10#:10*}.".;._2(0 :0) > > > ' X S A N B 9 . : Q { LF }']0 > > > 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 NB. 0 space > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 1 other > > > 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 2 alp/num > > > 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 3 N > > > 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 NB. 4 NB > > > 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 NB. 5 NB. > > > 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 NB. 6 num > > > 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 NB. 7 ' > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 NB. 8 '' > > > 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 NB. 9 comment > > > 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 NB. 10 LF > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 NB. 11 { > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 NB. 12 } > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 13 {{ > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 14 }} > > > ) > > > > > > (Note that I haven't coerced this state table to integer form -- > > > floats and integers occupy the same space on 64 bit systems, and the > > > model doesn't really care about representation.) > > > > > > Thanks, > > > > > > -- > > > Raul > > > > > > On Sun, Nov 8, 2020 at 10:53 AM Henry Rich <[email protected]> wrote: > > > > > > > > I was talking about > > > > > > > > ;: LF,'.' > > > > +-+-+ > > > > | |.| > > > > +-+-+ > > > > > > > > Henry Rich > > > > > > > > On 11/8/2020 8:38 AM, Raul Miller wrote: > > > > > I tested for that case: > > > > > > > > > > #;:'NB.',LF,LF > > > > > 3 > > > > > #(0;sj;mj) sq 'NB.',LF,LF > > > > > 3 > > > > > #(0;sj;mj) sq 'NB.',LF,LF,LF > > > > > 4 > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > -- > > > > This email has been checked for viruses by AVG. > > > > https://www.avg.com > > > > > > > > ---------------------------------------------------------------------- > > > > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
