I was not precise in my earlier response. I should have said that detecting
the wrong number of UTF-8 continuation bytes would be difficult in the
sequential machine as you would probably need to detect seven possible U
starts and seven U continues to properly check. That would make a very
large JS, although only checking for up to three continuation bytes would
probably be sufficient.

And I should have said that your SJ and MJ did not catch the UTF-8
characters in my test. Each byte was still treated as "other". Your
approach is better as it allows the possibility of treating UTF-8 as
"other", but would contain more than one byte - the entire UTF-8 sequence.
I haven't looked at your SJ yet to try to find out why it doesn't catch the
UTF-8.

But what does one do if there is an error in the data? ;: returns errors if
SJ and MJ are not constructed properly, but there is no way to report an
error for bad data. And if there were, what would a programmer do about it?
So is it necessary to detect bad UTF-8 sequences? Probably not. And for now
at least treating all UTF-8 like alp/num would probably be what one would
want. Let the display of the data show the errors.

On Sat, Nov 21, 2020 at 9:12 AM Raul Miller <[email protected]> wrote:

> (That said, it's also worth noting that the state table I presented
> here doesn't give an error for unbalanced quotes, either. So if you
> want errors to be thrown, you should probably be updating the state
> table to force an error there, also.)
>
> (And, I should note that I haven't taken a look at what the resulting
> errors look like. So I don't know how informative the resulting error
> messages would be...)
>
> Thanks again,
>
> --
> Raul
>
> On Sat, Nov 21, 2020 at 11:04 AM Raul Miller <[email protected]>
> wrote:
> >
> > Yes, I was surprised (and pleased) that the text file came through ok.
> >
> > As for throwing errors on malformed utf-8 sequences, here's how that
> > could be implemented:
> >
> > (1) We introduce an "error row" which uses operation 2 for all
> > character classes (and, for consistency, identifies the next row as
> > itself -- this is mostly so that improper use of the error row would
> > be relatively obvious). (Operation 2 emits a token, but the important
> > thing here is that it throws an error if j=-1.)
> >
> > (2) For each row which is part of a partially complete utf-8
> > character, any appearance of any character class which is not a utf-8
> > suffix character would use operation 3 and would identify the next row
> > as the error row. (Operation 3 emits a token, but the important thing
> > here is that it sets j=-1.)
> >
> > (3) Each of the different utf-8 prefixes would lead to a different row
> > in the state table. For example, the character class containing
> > character 224 would get a beginning of token row (which gets the error
> > row treatment and) which leads to a row that expects a utf-8 suffix
> > (which gets the error row treatment) followed by a second utf-8 suffix
> > (which follows the pattern set by row 1 of the state table).
> >
> > Hopefully that makes sense.
> >
> > Note that the final token in a string could not trigger an error.
> > That's a limitation of the engine and corresponds approximately to how
> > utf-8 must be treated when a low level buffer boundary splits a utf-8
> > character.
> >
> > Anyways, the point is that the sequential machine can support the sort
> > of "count a small number of steps" which is needed here. The
> > difficulty is more that the machine stops when it reaches the end of
> > the string.  If that's a sensitivity, this could be handled by
> > appending a linefeed character to the end of the string before
> > processing and then removing a final linefeed character from the last
> > token after tokenization.
> >
> > Again, I hope this makes sense...
> >
> > Thanks,
> >
> > --
> > Raul
> >
> >
> >
> >
> > On Sat, Nov 21, 2020 at 9:22 AM Don Guinn <[email protected]> wrote:
> > >
> > > The text file worked great!
> > >
> > > As to the UTF-8 codes. What is important is to avoid splitting the
> start
> > > bytes from the continuation bytes. Validating the UTF-8 codes is a
> > > difficult task. The start byte includes the number of continuation
> bytes to
> > > follow. It would be an error if the number of continuation bytes didn't
> > > agree.
> > >
> > > I tried my test on your definitions and it failed. Attached is a text
> file
> > > with your definitions and my test following.
> > >
> > > Well, I can't view the attachment. I don't know if it's there or not.
> Just
> > > in case, here is my test.
> > >
> > >
> > > NB. A noun to show the handling of UTF-8 in ;:
> > > test=:{{)n
> > > The symbol for the Euro is ₠
> > > Other symools like π show up also
> > > How about ⌹ in APL NB. ⌹
> > > Common expressions like 'H₂O' for water
> > > Common expressions like H₂O for water
> > > }}
> > >
> > > NB. How ;: this sj and mj handles it
> > > ,.<;.2(0;sj;mj);:test
> > >
> > > NB. Assigning UTF8 as character
> > > mj=: 2 (128+i.128)}mj
> > >
> > > NB. How UTF-8 is now handled
> > > ,.<;.2(0;sj;mj);:test
> > >
> > > On Fri, Nov 20, 2020 at 2:46 PM Raul Miller <[email protected]>
> wrote:
> > >
> > > > Here's an updated version, which also retains utf-8 character
> > > > sequences within token boundaries (instead of splitting them up into
> > > > multiple tokens). I had originally posted this to the jbeta forum,
> but
> > > > it's really a programming topic, and probably belongs here.
> > > >
> > > > mj=: 256$0                     NB. X other
> > > > mj=: 1 (9,a.i.' ')}mj          NB. S space and tab
> > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj  NB. A A-Z a-z excluding N B
> > > > mj=: 3 (a.i.'N')}mj            NB. N the letter N
> > > > mj=: 4 (a.i.'B')}mj            NB. B the letter B
> > > > mj=: 5 (a.i.'0123456789_')}mj  NB. 9 digits and _
> > > > mj=: 6 (a.i.'.')}mj            NB. . the decimal point
> > > > mj=: 7 (a.i.':')}mj            NB. : the colon
> > > > mj=: 8 (a.i.'''')}mj           NB. Q quote
> > > > mj=: 9 (a.i.'{')}mj            NB. { the left curly brace
> > > > mj=:10 (10)} mj                NB. LF
> > > > mj=:11 (a.i.'}')}mj            NB. } the right curly brace
> > > > mj=:12 (192+i.64)}mj           NB. U utf-8 octet prefix
> > > > mj=:13 (128+i.64)}mj           NB. V utf-8 octet suffix
> > > >
> > > > sj=: 0 10#:10*}.".;._2(0 :0)
> > > > ' X   S   A   N   B   9   .   :   Q    {    LF   }   U    V']0
> > > >  1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 NB. 0
> space
> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 1
> other
> > > >  1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 2
> alp/num
> > > >  1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 3 N
> > > >  1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 4
> NB
> > > >  9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0  9.0 10.2  9.0  9.0  9.0 NB. 5
> NB.
> > > >  1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 NB. 6
> num
> > > >  7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0  7.0  7.0  7.0  7.0  7.0 NB. 7 '
> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 NB. 8
> ''
> > > >  9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0  9.0 10.2  9.0  9.0  9.0 NB. 9
> comment
> > > >  1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 NB. 10
> LF
> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2  1.2 15.2 16.2 NB. 11
> {
> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2  1.2 10.2 14.0 15.2 16.2 NB. 12
> }
> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 15.2 16.2 NB. 13
> {{
> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 15.2 16.2 NB. 14
> }}
> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15
> > > > partial
> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 NB. 16
> utf-8
> > > > )
> > > >
> > > > As I noted in the beta forum -- increasing the complexity of the
> state
> > > > table adds both rows and columns (size grows proportional to or
> faster
> > > > than the square of the number of distinct character types when each
> > > > character type requires a distinct handling). So it's good to keep
> > > > this thing as simple as possible.
> > > >
> > > > Also, I've not tested this extensively, and it's possible I'll need
> to
> > > > make further changes (let me know if you spot any problems).
> > > >
> > > > That said... note also that I have *not* implemented the unicode
> > > > guideline which might suggest that the tokenizer should throw an
> error
> > > > on malformed utf-8 sequences. That would require several more rows
> and
> > > > columns to achieve the recommended inconvenience. This would also
> > > > introduce email line wrap, because the state table would become that
> > > > fat. (I'll attach a copy here as a .txt file, to see if an earlier
> > > > suggestion -- that the forum would preserve .txt attachments -- might
> > > > be a way of working around that issue. I suspect not, but it's easy
> > > > enough to test...)
> > > >
> > > > This has *not* been implemented in the current jbeta as the ;: monad.
> > > > I am not sure if it should, since J the language is based on ascii,
> > > > not unicode -- it's just convenient that unicode supports an ascii
> > > > subset.
> > > >
> > > > Still... we often do have reason to work with utf-8.
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Raul
> > > >
> > > > On Sun, Nov 8, 2020 at 11:01 AM Raul Miller <[email protected]>
> wrote:
> > > > >
> > > > > Oh, oops, I should have spotted that. Thanks.
> > > > >
> > > > > Updated state table:
> > > > >
> > > > > mj=: 256$0                     NB. X other
> > > > > mj=: 1 (9,a.i.' ')}mj          NB. S space and tab
> > > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj  NB. A A-Z a-z excluding N B
> > > > > mj=: 3 (a.i.'N')}mj            NB. N the letter N
> > > > > mj=: 4 (a.i.'B')}mj            NB. B the letter B
> > > > > mj=: 5 (a.i.'0123456789_')}mj  NB. 9 digits and _
> > > > > mj=: 6 (a.i.'.')}mj            NB. . the decimal point
> > > > > mj=: 7 (a.i.':')}mj            NB. : the colon
> > > > > mj=: 8 (a.i.'''')}mj           NB. Q quote
> > > > > mj=: 9 (a.i.'{')}mj            NB. { the left curly brace
> > > > > mj=:10 (10)} mj                NB. LF
> > > > > mj=:11 (a.i.'}')}mj            NB. } the right curly brace
> > > > >
> > > > > sj=: 0 10#:10*}.".;._2(0 :0)
> > > > > ' X   S   A   N   B   9   .   :   Q    {    LF   }']0
> > > > >  1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 NB. 0 space
> > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 1 other
> > > > >  1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 2 alp/num
> > > > >  1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 3 N
> > > > >  1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 NB. 4 NB
> > > > >  9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0  9.0 10.2  9.0 NB. 5 NB.
> > > > >  1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 NB. 6 num
> > > > >  7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0  7.0  7.0  7.0 NB. 7 '
> > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 NB. 8 ''
> > > > >  9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0  9.0 10.2  9.0 NB. 9 comment
> > > > >  1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 NB. 10 LF
> > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2  1.2 NB. 11 {
> > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2  1.2 10.2 14.0 NB. 12 }
> > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 NB. 13 {{
> > > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 NB. 14 }}
> > > > > )
> > > > >
> > > > > (Note that I haven't coerced this state table to integer form --
> > > > > floats and integers occupy the same space on 64 bit systems, and
> the
> > > > > model doesn't really care about representation.)
> > > > >
> > > > > Thanks,
> > > > >
> > > > > --
> > > > > Raul
> > > > >
> > > > > On Sun, Nov 8, 2020 at 10:53 AM Henry Rich <[email protected]>
> wrote:
> > > > > >
> > > > > > I was talking about
> > > > > >
> > > > > >     ;: LF,'.'
> > > > > > +-+-+
> > > > > > | |.|
> > > > > > +-+-+
> > > > > >
> > > > > > Henry Rich
> > > > > >
> > > > > > On 11/8/2020 8:38 AM, Raul Miller wrote:
> > > > > > > I tested for that case:
> > > > > > >
> > > > > > >     #;:'NB.',LF,LF
> > > > > > > 3
> > > > > > >    #(0;sj;mj) sq 'NB.',LF,LF
> > > > > > > 3
> > > > > > >     #(0;sj;mj) sq 'NB.',LF,LF,LF
> > > > > > > 4
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > This email has been checked for viruses by AVG.
> > > > > > https://www.avg.com
> > > > > >
> > > > > >
> ----------------------------------------------------------------------
> > > > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > >
> ----------------------------------------------------------------------
> > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > >
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to