Yes. I forgot about J=._1 . And I just looked at the UTF-8 description and saw that it had been restricted to 3 continuation bytes, for a maximum of 21 bits, in 2003. Previously allowed up to 7 continuation bytes.
Hey! This has been a fun exercise whether or not it is put into production. On Sat, Nov 21, 2020 at 11:48 AM Raul Miller <[email protected]> wrote: > I think we need to be a little clearer about what you're trying to > achieve when you say "catch the utf-8 characters". > > Your change to mj made all utf-8 *octets* be treated as alphabetic. > That's... an approach, certainly. > > Meanwhile, according to https://en.wikipedia.org/wiki/UTF-8#Encoding > there are three different utf-8 starts. I don't know what you are > referring to when you say that there are seven possible U starts. > > But there is a way for ;: to report errors in the stream. See attached > for a demonstration (I made quoted strings be an error.) > > Thanks, > > -- > Raul > > On Sat, Nov 21, 2020 at 1:05 PM Don Guinn <[email protected]> wrote: > > > > I was not precise in my earlier response. I should have said that > detecting > > the wrong number of UTF-8 continuation bytes would be difficult in the > > sequential machine as you would probably need to detect seven possible U > > starts and seven U continues to properly check. That would make a very > > large JS, although only checking for up to three continuation bytes would > > probably be sufficient. > > > > And I should have said that your SJ and MJ did not catch the UTF-8 > > characters in my test. Each byte was still treated as "other". Your > > approach is better as it allows the possibility of treating UTF-8 as > > "other", but would contain more than one byte - the entire UTF-8 > sequence. > > I haven't looked at your SJ yet to try to find out why it doesn't catch > the > > UTF-8. > > > > But what does one do if there is an error in the data? ;: returns errors > if > > SJ and MJ are not constructed properly, but there is no way to report an > > error for bad data. And if there were, what would a programmer do about > it? > > So is it necessary to detect bad UTF-8 sequences? Probably not. And for > now > > at least treating all UTF-8 like alp/num would probably be what one would > > want. Let the display of the data show the errors. > > > > On Sat, Nov 21, 2020 at 9:12 AM Raul Miller <[email protected]> > wrote: > > > > > (That said, it's also worth noting that the state table I presented > > > here doesn't give an error for unbalanced quotes, either. So if you > > > want errors to be thrown, you should probably be updating the state > > > table to force an error there, also.) > > > > > > (And, I should note that I haven't taken a look at what the resulting > > > errors look like. So I don't know how informative the resulting error > > > messages would be...) > > > > > > Thanks again, > > > > > > -- > > > Raul > > > > > > On Sat, Nov 21, 2020 at 11:04 AM Raul Miller <[email protected]> > > > wrote: > > > > > > > > Yes, I was surprised (and pleased) that the text file came through > ok. > > > > > > > > As for throwing errors on malformed utf-8 sequences, here's how that > > > > could be implemented: > > > > > > > > (1) We introduce an "error row" which uses operation 2 for all > > > > character classes (and, for consistency, identifies the next row as > > > > itself -- this is mostly so that improper use of the error row would > > > > be relatively obvious). (Operation 2 emits a token, but the important > > > > thing here is that it throws an error if j=-1.) > > > > > > > > (2) For each row which is part of a partially complete utf-8 > > > > character, any appearance of any character class which is not a utf-8 > > > > suffix character would use operation 3 and would identify the next > row > > > > as the error row. (Operation 3 emits a token, but the important thing > > > > here is that it sets j=-1.) > > > > > > > > (3) Each of the different utf-8 prefixes would lead to a different > row > > > > in the state table. For example, the character class containing > > > > character 224 would get a beginning of token row (which gets the > error > > > > row treatment and) which leads to a row that expects a utf-8 suffix > > > > (which gets the error row treatment) followed by a second utf-8 > suffix > > > > (which follows the pattern set by row 1 of the state table). > > > > > > > > Hopefully that makes sense. > > > > > > > > Note that the final token in a string could not trigger an error. > > > > That's a limitation of the engine and corresponds approximately to > how > > > > utf-8 must be treated when a low level buffer boundary splits a utf-8 > > > > character. > > > > > > > > Anyways, the point is that the sequential machine can support the > sort > > > > of "count a small number of steps" which is needed here. The > > > > difficulty is more that the machine stops when it reaches the end of > > > > the string. If that's a sensitivity, this could be handled by > > > > appending a linefeed character to the end of the string before > > > > processing and then removing a final linefeed character from the last > > > > token after tokenization. > > > > > > > > Again, I hope this makes sense... > > > > > > > > Thanks, > > > > > > > > -- > > > > Raul > > > > > > > > > > > > > > > > > > > > On Sat, Nov 21, 2020 at 9:22 AM Don Guinn <[email protected]> > wrote: > > > > > > > > > > The text file worked great! > > > > > > > > > > As to the UTF-8 codes. What is important is to avoid splitting the > > > start > > > > > bytes from the continuation bytes. Validating the UTF-8 codes is a > > > > > difficult task. The start byte includes the number of continuation > > > bytes to > > > > > follow. It would be an error if the number of continuation bytes > didn't > > > > > agree. > > > > > > > > > > I tried my test on your definitions and it failed. Attached is a > text > > > file > > > > > with your definitions and my test following. > > > > > > > > > > Well, I can't view the attachment. I don't know if it's there or > not. > > > Just > > > > > in case, here is my test. > > > > > > > > > > > > > > > NB. A noun to show the handling of UTF-8 in ;: > > > > > test=:{{)n > > > > > The symbol for the Euro is ₠ > > > > > Other symools like π show up also > > > > > How about ⌹ in APL NB. ⌹ > > > > > Common expressions like 'H₂O' for water > > > > > Common expressions like H₂O for water > > > > > }} > > > > > > > > > > NB. How ;: this sj and mj handles it > > > > > ,.<;.2(0;sj;mj);:test > > > > > > > > > > NB. Assigning UTF8 as character > > > > > mj=: 2 (128+i.128)}mj > > > > > > > > > > NB. How UTF-8 is now handled > > > > > ,.<;.2(0;sj;mj);:test > > > > > > > > > > On Fri, Nov 20, 2020 at 2:46 PM Raul Miller <[email protected] > > > > > wrote: > > > > > > > > > > > Here's an updated version, which also retains utf-8 character > > > > > > sequences within token boundaries (instead of splitting them up > into > > > > > > multiple tokens). I had originally posted this to the jbeta > forum, > > > but > > > > > > it's really a programming topic, and probably belongs here. > > > > > > > > > > > > mj=: 256$0 NB. X other > > > > > > mj=: 1 (9,a.i.' ')}mj NB. S space and tab > > > > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B > > > > > > mj=: 3 (a.i.'N')}mj NB. N the letter N > > > > > > mj=: 4 (a.i.'B')}mj NB. B the letter B > > > > > > mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _ > > > > > > mj=: 6 (a.i.'.')}mj NB. . the decimal point > > > > > > mj=: 7 (a.i.':')}mj NB. : the colon > > > > > > mj=: 8 (a.i.'''')}mj NB. Q quote > > > > > > mj=: 9 (a.i.'{')}mj NB. { the left curly brace > > > > > > mj=:10 (10)} mj NB. LF > > > > > > mj=:11 (a.i.'}')}mj NB. } the right curly brace > > > > > > mj=:12 (192+i.64)}mj NB. U utf-8 octet prefix > > > > > > mj=:13 (128+i.64)}mj NB. V utf-8 octet suffix > > > > > > > > > > > > sj=: 0 10#:10*}.".;._2(0 :0) > > > > > > ' X S A N B 9 . : Q { LF } U V']0 > > > > > > 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 > NB. 0 > > > space > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 > NB. 1 > > > other > > > > > > 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 > NB. 2 > > > alp/num > > > > > > 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 > NB. 3 N > > > > > > 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 > NB. 4 > > > NB > > > > > > 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 9.0 9.0 > NB. 5 > > > NB. > > > > > > 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 > NB. 6 > > > num > > > > > > 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 7.0 7.0 > NB. 7 ' > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 > NB. 8 > > > '' > > > > > > 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 9.0 9.0 > NB. 9 > > > comment > > > > > > 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 > NB. 10 > > > LF > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 15.2 16.2 > NB. 11 > > > { > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 15.2 16.2 > NB. 12 > > > } > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 > NB. 13 > > > {{ > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 > NB. 14 > > > }} > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 > NB. 15 > > > > > > partial > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 > NB. 16 > > > utf-8 > > > > > > ) > > > > > > > > > > > > As I noted in the beta forum -- increasing the complexity of the > > > state > > > > > > table adds both rows and columns (size grows proportional to or > > > faster > > > > > > than the square of the number of distinct character types when > each > > > > > > character type requires a distinct handling). So it's good to > keep > > > > > > this thing as simple as possible. > > > > > > > > > > > > Also, I've not tested this extensively, and it's possible I'll > need > > > to > > > > > > make further changes (let me know if you spot any problems). > > > > > > > > > > > > That said... note also that I have *not* implemented the unicode > > > > > > guideline which might suggest that the tokenizer should throw an > > > error > > > > > > on malformed utf-8 sequences. That would require several more > rows > > > and > > > > > > columns to achieve the recommended inconvenience. This would also > > > > > > introduce email line wrap, because the state table would become > that > > > > > > fat. (I'll attach a copy here as a .txt file, to see if an > earlier > > > > > > suggestion -- that the forum would preserve .txt attachments -- > might > > > > > > be a way of working around that issue. I suspect not, but it's > easy > > > > > > enough to test...) > > > > > > > > > > > > This has *not* been implemented in the current jbeta as the ;: > monad. > > > > > > I am not sure if it should, since J the language is based on > ascii, > > > > > > not unicode -- it's just convenient that unicode supports an > ascii > > > > > > subset. > > > > > > > > > > > > Still... we often do have reason to work with utf-8. > > > > > > > > > > > > Thanks, > > > > > > > > > > > > -- > > > > > > Raul > > > > > > > > > > > > On Sun, Nov 8, 2020 at 11:01 AM Raul Miller < > [email protected]> > > > wrote: > > > > > > > > > > > > > > Oh, oops, I should have spotted that. Thanks. > > > > > > > > > > > > > > Updated state table: > > > > > > > > > > > > > > mj=: 256$0 NB. X other > > > > > > > mj=: 1 (9,a.i.' ')}mj NB. S space and tab > > > > > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B > > > > > > > mj=: 3 (a.i.'N')}mj NB. N the letter N > > > > > > > mj=: 4 (a.i.'B')}mj NB. B the letter B > > > > > > > mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _ > > > > > > > mj=: 6 (a.i.'.')}mj NB. . the decimal point > > > > > > > mj=: 7 (a.i.':')}mj NB. : the colon > > > > > > > mj=: 8 (a.i.'''')}mj NB. Q quote > > > > > > > mj=: 9 (a.i.'{')}mj NB. { the left curly brace > > > > > > > mj=:10 (10)} mj NB. LF > > > > > > > mj=:11 (a.i.'}')}mj NB. } the right curly brace > > > > > > > > > > > > > > sj=: 0 10#:10*}.".;._2(0 :0) > > > > > > > ' X S A N B 9 . : Q { LF }']0 > > > > > > > 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 NB. 0 space > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 1 other > > > > > > > 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 2 > alp/num > > > > > > > 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 3 N > > > > > > > 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 NB. 4 NB > > > > > > > 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 NB. 5 NB. > > > > > > > 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 NB. 6 num > > > > > > > 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 NB. 7 ' > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 NB. 8 '' > > > > > > > 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 NB. 9 > comment > > > > > > > 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 NB. 10 LF > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 NB. 11 { > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 NB. 12 } > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 13 {{ > > > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 14 }} > > > > > > > ) > > > > > > > > > > > > > > (Note that I haven't coerced this state table to integer form > -- > > > > > > > floats and integers occupy the same space on 64 bit systems, > and > > > the > > > > > > > model doesn't really care about representation.) > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > -- > > > > > > > Raul > > > > > > > > > > > > > > On Sun, Nov 8, 2020 at 10:53 AM Henry Rich < > [email protected]> > > > wrote: > > > > > > > > > > > > > > > > I was talking about > > > > > > > > > > > > > > > > ;: LF,'.' > > > > > > > > +-+-+ > > > > > > > > | |.| > > > > > > > > +-+-+ > > > > > > > > > > > > > > > > Henry Rich > > > > > > > > > > > > > > > > On 11/8/2020 8:38 AM, Raul Miller wrote: > > > > > > > > > I tested for that case: > > > > > > > > > > > > > > > > > > #;:'NB.',LF,LF > > > > > > > > > 3 > > > > > > > > > #(0;sj;mj) sq 'NB.',LF,LF > > > > > > > > > 3 > > > > > > > > > #(0;sj;mj) sq 'NB.',LF,LF,LF > > > > > > > > > 4 > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > This email has been checked for viruses by AVG. > > > > > > > > https://www.avg.com > > > > > > > > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > > > > > For information about J forums see > > > http://www.jsoftware.com/forums.htm > > > > > > > > > ---------------------------------------------------------------------- > > > > > > For information about J forums see > > > http://www.jsoftware.com/forums.htm > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > > For information about J forums see > http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
