I was not precise in my earlier response. I should have said that detecting the wrong number of UTF-8 continuation bytes would be difficult in the sequential machine as you would probably need to detect seven possible U starts and seven U continues to properly check. That would make a very large JS, although only checking for up to three continuation bytes would probably be sufficient.
And I should have said that your SJ and MJ did not catch the UTF-8 characters in my test. Each byte was still treated as "other". Your approach is better as it allows the possibility of treating UTF-8 as "other", but would contain more than one byte - the entire UTF-8 sequence. I haven't looked at your SJ yet to try to find out why it doesn't catch the UTF-8. But what does one do if there is an error in the data? ;: returns errors if SJ and MJ are not constructed properly, but there is no way to report an error for bad data. And if there were, what would a programmer do about it? So is it necessary to detect bad UTF-8 sequences? Probably not. And for now at least treating all UTF-8 like alp/num would probably be what one would want. Let the display of the data show the errors. On Sat, Nov 21, 2020 at 9:12 AM Raul Miller <[email protected]> wrote: > (That said, it's also worth noting that the state table I presented > here doesn't give an error for unbalanced quotes, either. So if you > want errors to be thrown, you should probably be updating the state > table to force an error there, also.) > > (And, I should note that I haven't taken a look at what the resulting > errors look like. So I don't know how informative the resulting error > messages would be...) > > Thanks again, > > -- > Raul > > On Sat, Nov 21, 2020 at 11:04 AM Raul Miller <[email protected]> > wrote: > > > > Yes, I was surprised (and pleased) that the text file came through ok. > > > > As for throwing errors on malformed utf-8 sequences, here's how that > > could be implemented: > > > > (1) We introduce an "error row" which uses operation 2 for all > > character classes (and, for consistency, identifies the next row as > > itself -- this is mostly so that improper use of the error row would > > be relatively obvious). (Operation 2 emits a token, but the important > > thing here is that it throws an error if j=-1.) > > > > (2) For each row which is part of a partially complete utf-8 > > character, any appearance of any character class which is not a utf-8 > > suffix character would use operation 3 and would identify the next row > > as the error row. (Operation 3 emits a token, but the important thing > > here is that it sets j=-1.) > > > > (3) Each of the different utf-8 prefixes would lead to a different row > > in the state table. For example, the character class containing > > character 224 would get a beginning of token row (which gets the error > > row treatment and) which leads to a row that expects a utf-8 suffix > > (which gets the error row treatment) followed by a second utf-8 suffix > > (which follows the pattern set by row 1 of the state table). > > > > Hopefully that makes sense. > > > > Note that the final token in a string could not trigger an error. > > That's a limitation of the engine and corresponds approximately to how > > utf-8 must be treated when a low level buffer boundary splits a utf-8 > > character. > > > > Anyways, the point is that the sequential machine can support the sort > > of "count a small number of steps" which is needed here. The > > difficulty is more that the machine stops when it reaches the end of > > the string. If that's a sensitivity, this could be handled by > > appending a linefeed character to the end of the string before > > processing and then removing a final linefeed character from the last > > token after tokenization. > > > > Again, I hope this makes sense... > > > > Thanks, > > > > -- > > Raul > > > > > > > > > > On Sat, Nov 21, 2020 at 9:22 AM Don Guinn <[email protected]> wrote: > > > > > > The text file worked great! > > > > > > As to the UTF-8 codes. What is important is to avoid splitting the > start > > > bytes from the continuation bytes. Validating the UTF-8 codes is a > > > difficult task. The start byte includes the number of continuation > bytes to > > > follow. It would be an error if the number of continuation bytes didn't > > > agree. > > > > > > I tried my test on your definitions and it failed. Attached is a text > file > > > with your definitions and my test following. > > > > > > Well, I can't view the attachment. I don't know if it's there or not. > Just > > > in case, here is my test. > > > > > > > > > NB. A noun to show the handling of UTF-8 in ;: > > > test=:{{)n > > > The symbol for the Euro is ₠ > > > Other symools like π show up also > > > How about ⌹ in APL NB. ⌹ > > > Common expressions like 'H₂O' for water > > > Common expressions like H₂O for water > > > }} > > > > > > NB. How ;: this sj and mj handles it > > > ,.<;.2(0;sj;mj);:test > > > > > > NB. Assigning UTF8 as character > > > mj=: 2 (128+i.128)}mj > > > > > > NB. How UTF-8 is now handled > > > ,.<;.2(0;sj;mj);:test > > > > > > On Fri, Nov 20, 2020 at 2:46 PM Raul Miller <[email protected]> > wrote: > > > > > > > Here's an updated version, which also retains utf-8 character > > > > sequences within token boundaries (instead of splitting them up into > > > > multiple tokens). I had originally posted this to the jbeta forum, > but > > > > it's really a programming topic, and probably belongs here. > > > > > > > > mj=: 256$0 NB. X other > > > > mj=: 1 (9,a.i.' ')}mj NB. S space and tab > > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B > > > > mj=: 3 (a.i.'N')}mj NB. N the letter N > > > > mj=: 4 (a.i.'B')}mj NB. B the letter B > > > > mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _ > > > > mj=: 6 (a.i.'.')}mj NB. . the decimal point > > > > mj=: 7 (a.i.':')}mj NB. : the colon > > > > mj=: 8 (a.i.'''')}mj NB. Q quote > > > > mj=: 9 (a.i.'{')}mj NB. { the left curly brace > > > > mj=:10 (10)} mj NB. LF > > > > mj=:11 (a.i.'}')}mj NB. } the right curly brace > > > > mj=:12 (192+i.64)}mj NB. U utf-8 octet prefix > > > > mj=:13 (128+i.64)}mj NB. V utf-8 octet suffix > > > > > > > > sj=: 0 10#:10*}.".;._2(0 :0) > > > > ' X S A N B 9 . : Q { LF } U V']0 > > > > 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 NB. 0 > space > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 1 > other > > > > 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 2 > alp/num > > > > 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 3 N > > > > 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 4 > NB > > > > 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 5 > NB. > > > > 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 NB. 6 > num > > > > 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 7.0 7.0 NB. 7 ' > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 NB. 8 > '' > > > > 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 9 > comment > > > > 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 NB. 10 > LF > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 15.2 16.2 NB. 11 > { > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 15.2 16.2 NB. 12 > } > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 13 > {{ > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 14 > }} > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15 > > > > partial > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 NB. 16 > utf-8 > > > > ) > > > > > > > > As I noted in the beta forum -- increasing the complexity of the > state > > > > table adds both rows and columns (size grows proportional to or > faster > > > > than the square of the number of distinct character types when each > > > > character type requires a distinct handling). So it's good to keep > > > > this thing as simple as possible. > > > > > > > > Also, I've not tested this extensively, and it's possible I'll need > to > > > > make further changes (let me know if you spot any problems). > > > > > > > > That said... note also that I have *not* implemented the unicode > > > > guideline which might suggest that the tokenizer should throw an > error > > > > on malformed utf-8 sequences. That would require several more rows > and > > > > columns to achieve the recommended inconvenience. This would also > > > > introduce email line wrap, because the state table would become that > > > > fat. (I'll attach a copy here as a .txt file, to see if an earlier > > > > suggestion -- that the forum would preserve .txt attachments -- might > > > > be a way of working around that issue. I suspect not, but it's easy > > > > enough to test...) > > > > > > > > This has *not* been implemented in the current jbeta as the ;: monad. > > > > I am not sure if it should, since J the language is based on ascii, > > > > not unicode -- it's just convenient that unicode supports an ascii > > > > subset. > > > > > > > > Still... we often do have reason to work with utf-8. > > > > > > > > Thanks, > > > > > > > > -- > > > > Raul > > > > > > > > On Sun, Nov 8, 2020 at 11:01 AM Raul Miller <[email protected]> > wrote: > > > > > > > > > > Oh, oops, I should have spotted that. Thanks. > > > > > > > > > > Updated state table: > > > > > > > > > > mj=: 256$0 NB. X other > > > > > mj=: 1 (9,a.i.' ')}mj NB. S space and tab > > > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B > > > > > mj=: 3 (a.i.'N')}mj NB. N the letter N > > > > > mj=: 4 (a.i.'B')}mj NB. B the letter B > > > > > mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _ > > > > > mj=: 6 (a.i.'.')}mj NB. . the decimal point > > > > > mj=: 7 (a.i.':')}mj NB. : the colon > > > > > mj=: 8 (a.i.'''')}mj NB. Q quote > > > > > mj=: 9 (a.i.'{')}mj NB. { the left curly brace > > > > > mj=:10 (10)} mj NB. LF > > > > > mj=:11 (a.i.'}')}mj NB. } the right curly brace > > > > > > > > > > sj=: 0 10#:10*}.".;._2(0 :0) > > > > > ' X S A N B 9 . : Q { LF }']0 > > > > > 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 NB. 0 space > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 1 other > > > > > 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 2 alp/num > > > > > 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 3 N > > > > > 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 NB. 4 NB > > > > > 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 NB. 5 NB. > > > > > 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 NB. 6 num > > > > > 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 NB. 7 ' > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 NB. 8 '' > > > > > 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 NB. 9 comment > > > > > 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 NB. 10 LF > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 NB. 11 { > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 NB. 12 } > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 13 {{ > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 14 }} > > > > > ) > > > > > > > > > > (Note that I haven't coerced this state table to integer form -- > > > > > floats and integers occupy the same space on 64 bit systems, and > the > > > > > model doesn't really care about representation.) > > > > > > > > > > Thanks, > > > > > > > > > > -- > > > > > Raul > > > > > > > > > > On Sun, Nov 8, 2020 at 10:53 AM Henry Rich <[email protected]> > wrote: > > > > > > > > > > > > I was talking about > > > > > > > > > > > > ;: LF,'.' > > > > > > +-+-+ > > > > > > | |.| > > > > > > +-+-+ > > > > > > > > > > > > Henry Rich > > > > > > > > > > > > On 11/8/2020 8:38 AM, Raul Miller wrote: > > > > > > > I tested for that case: > > > > > > > > > > > > > > #;:'NB.',LF,LF > > > > > > > 3 > > > > > > > #(0;sj;mj) sq 'NB.',LF,LF > > > > > > > 3 > > > > > > > #(0;sj;mj) sq 'NB.',LF,LF,LF > > > > > > > 4 > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > This email has been checked for viruses by AVG. > > > > > > https://www.avg.com > > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > > > For information about J forums see > http://www.jsoftware.com/forums.htm > > > > > ---------------------------------------------------------------------- > > > > For information about J forums see > http://www.jsoftware.com/forums.htm > > > > > > > ---------------------------------------------------------------------- > > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
