(That said, it's also worth noting that the state table I presented
here doesn't give an error for unbalanced quotes, either. So if you
want errors to be thrown, you should probably be updating the state
table to force an error there, also.)

(And, I should note that I haven't taken a look at what the resulting
errors look like. So I don't know how informative the resulting error
messages would be...)

Thanks again,

-- 
Raul

On Sat, Nov 21, 2020 at 11:04 AM Raul Miller <[email protected]> wrote:
>
> Yes, I was surprised (and pleased) that the text file came through ok.
>
> As for throwing errors on malformed utf-8 sequences, here's how that
> could be implemented:
>
> (1) We introduce an "error row" which uses operation 2 for all
> character classes (and, for consistency, identifies the next row as
> itself -- this is mostly so that improper use of the error row would
> be relatively obvious). (Operation 2 emits a token, but the important
> thing here is that it throws an error if j=-1.)
>
> (2) For each row which is part of a partially complete utf-8
> character, any appearance of any character class which is not a utf-8
> suffix character would use operation 3 and would identify the next row
> as the error row. (Operation 3 emits a token, but the important thing
> here is that it sets j=-1.)
>
> (3) Each of the different utf-8 prefixes would lead to a different row
> in the state table. For example, the character class containing
> character 224 would get a beginning of token row (which gets the error
> row treatment and) which leads to a row that expects a utf-8 suffix
> (which gets the error row treatment) followed by a second utf-8 suffix
> (which follows the pattern set by row 1 of the state table).
>
> Hopefully that makes sense.
>
> Note that the final token in a string could not trigger an error.
> That's a limitation of the engine and corresponds approximately to how
> utf-8 must be treated when a low level buffer boundary splits a utf-8
> character.
>
> Anyways, the point is that the sequential machine can support the sort
> of "count a small number of steps" which is needed here. The
> difficulty is more that the machine stops when it reaches the end of
> the string.  If that's a sensitivity, this could be handled by
> appending a linefeed character to the end of the string before
> processing and then removing a final linefeed character from the last
> token after tokenization.
>
> Again, I hope this makes sense...
>
> Thanks,
>
> --
> Raul
>
>
>
>
> On Sat, Nov 21, 2020 at 9:22 AM Don Guinn <[email protected]> wrote:
> >
> > The text file worked great!
> >
> > As to the UTF-8 codes. What is important is to avoid splitting the start
> > bytes from the continuation bytes. Validating the UTF-8 codes is a
> > difficult task. The start byte includes the number of continuation bytes to
> > follow. It would be an error if the number of continuation bytes didn't
> > agree.
> >
> > I tried my test on your definitions and it failed. Attached is a text file
> > with your definitions and my test following.
> >
> > Well, I can't view the attachment. I don't know if it's there or not. Just
> > in case, here is my test.
> >
> >
> > NB. A noun to show the handling of UTF-8 in ;:
> > test=:{{)n
> > The symbol for the Euro is ₠
> > Other symools like π show up also
> > How about ⌹ in APL NB. ⌹
> > Common expressions like 'H₂O' for water
> > Common expressions like H₂O for water
> > }}
> >
> > NB. How ;: this sj and mj handles it
> > ,.<;.2(0;sj;mj);:test
> >
> > NB. Assigning UTF8 as character
> > mj=: 2 (128+i.128)}mj
> >
> > NB. How UTF-8 is now handled
> > ,.<;.2(0;sj;mj);:test
> >
> > On Fri, Nov 20, 2020 at 2:46 PM Raul Miller <[email protected]> wrote:
> >
> > > Here's an updated version, which also retains utf-8 character
> > > sequences within token boundaries (instead of splitting them up into
> > > multiple tokens). I had originally posted this to the jbeta forum, but
> > > it's really a programming topic, and probably belongs here.
> > >
> > > mj=: 256$0                     NB. X other
> > > mj=: 1 (9,a.i.' ')}mj          NB. S space and tab
> > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj  NB. A A-Z a-z excluding N B
> > > mj=: 3 (a.i.'N')}mj            NB. N the letter N
> > > mj=: 4 (a.i.'B')}mj            NB. B the letter B
> > > mj=: 5 (a.i.'0123456789_')}mj  NB. 9 digits and _
> > > mj=: 6 (a.i.'.')}mj            NB. . the decimal point
> > > mj=: 7 (a.i.':')}mj            NB. : the colon
> > > mj=: 8 (a.i.'''')}mj           NB. Q quote
> > > mj=: 9 (a.i.'{')}mj            NB. { the left curly brace
> > > mj=:10 (10)} mj                NB. LF
> > > mj=:11 (a.i.'}')}mj            NB. } the right curly brace
> > > mj=:12 (192+i.64)}mj           NB. U utf-8 octet prefix
> > > mj=:13 (128+i.64)}mj           NB. V utf-8 octet suffix
> > >
> > > sj=: 0 10#:10*}.".;._2(0 :0)
> > > ' X   S   A   N   B   9   .   :   Q    {    LF   }   U    V']0
> > >  1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 NB. 0 space
> > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 1 other
> > >  1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 2 
> > > alp/num
> > >  1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 3 N
> > >  1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 4 NB
> > >  9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0  9.0 10.2  9.0  9.0  9.0 NB. 5 NB.
> > >  1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 NB. 6 num
> > >  7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0  7.0  7.0  7.0  7.0  7.0 NB. 7 '
> > >  1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 NB. 8 ''
> > >  9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0  9.0 10.2  9.0  9.0  9.0 NB. 9 
> > > comment
> > >  1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 NB. 10 LF
> > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2  1.2 15.2 16.2 NB. 11 {
> > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2  1.2 10.2 14.0 15.2 16.2 NB. 12 }
> > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 15.2 16.2 NB. 13 {{
> > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 15.2 16.2 NB. 14 }}
> > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15
> > > partial
> > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 NB. 16 utf-8
> > > )
> > >
> > > As I noted in the beta forum -- increasing the complexity of the state
> > > table adds both rows and columns (size grows proportional to or faster
> > > than the square of the number of distinct character types when each
> > > character type requires a distinct handling). So it's good to keep
> > > this thing as simple as possible.
> > >
> > > Also, I've not tested this extensively, and it's possible I'll need to
> > > make further changes (let me know if you spot any problems).
> > >
> > > That said... note also that I have *not* implemented the unicode
> > > guideline which might suggest that the tokenizer should throw an error
> > > on malformed utf-8 sequences. That would require several more rows and
> > > columns to achieve the recommended inconvenience. This would also
> > > introduce email line wrap, because the state table would become that
> > > fat. (I'll attach a copy here as a .txt file, to see if an earlier
> > > suggestion -- that the forum would preserve .txt attachments -- might
> > > be a way of working around that issue. I suspect not, but it's easy
> > > enough to test...)
> > >
> > > This has *not* been implemented in the current jbeta as the ;: monad.
> > > I am not sure if it should, since J the language is based on ascii,
> > > not unicode -- it's just convenient that unicode supports an ascii
> > > subset.
> > >
> > > Still... we often do have reason to work with utf-8.
> > >
> > > Thanks,
> > >
> > > --
> > > Raul
> > >
> > > On Sun, Nov 8, 2020 at 11:01 AM Raul Miller <[email protected]> wrote:
> > > >
> > > > Oh, oops, I should have spotted that. Thanks.
> > > >
> > > > Updated state table:
> > > >
> > > > mj=: 256$0                     NB. X other
> > > > mj=: 1 (9,a.i.' ')}mj          NB. S space and tab
> > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj  NB. A A-Z a-z excluding N B
> > > > mj=: 3 (a.i.'N')}mj            NB. N the letter N
> > > > mj=: 4 (a.i.'B')}mj            NB. B the letter B
> > > > mj=: 5 (a.i.'0123456789_')}mj  NB. 9 digits and _
> > > > mj=: 6 (a.i.'.')}mj            NB. . the decimal point
> > > > mj=: 7 (a.i.':')}mj            NB. : the colon
> > > > mj=: 8 (a.i.'''')}mj           NB. Q quote
> > > > mj=: 9 (a.i.'{')}mj            NB. { the left curly brace
> > > > mj=:10 (10)} mj                NB. LF
> > > > mj=:11 (a.i.'}')}mj            NB. } the right curly brace
> > > >
> > > > sj=: 0 10#:10*}.".;._2(0 :0)
> > > > ' X   S   A   N   B   9   .   :   Q    {    LF   }']0
> > > >  1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 NB. 0 space
> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 1 other
> > > >  1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 2 alp/num
> > > >  1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 3 N
> > > >  1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 NB. 4 NB
> > > >  9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0  9.0 10.2  9.0 NB. 5 NB.
> > > >  1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 NB. 6 num
> > > >  7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0  7.0  7.0  7.0 NB. 7 '
> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 NB. 8 ''
> > > >  9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0  9.0 10.2  9.0 NB. 9 comment
> > > >  1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 NB. 10 LF
> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2  1.2 NB. 11 {
> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2  1.2 10.2 14.0 NB. 12 }
> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 NB. 13 {{
> > > >  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 NB. 14 }}
> > > > )
> > > >
> > > > (Note that I haven't coerced this state table to integer form --
> > > > floats and integers occupy the same space on 64 bit systems, and the
> > > > model doesn't really care about representation.)
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Raul
> > > >
> > > > On Sun, Nov 8, 2020 at 10:53 AM Henry Rich <[email protected]> wrote:
> > > > >
> > > > > I was talking about
> > > > >
> > > > >     ;: LF,'.'
> > > > > +-+-+
> > > > > | |.|
> > > > > +-+-+
> > > > >
> > > > > Henry Rich
> > > > >
> > > > > On 11/8/2020 8:38 AM, Raul Miller wrote:
> > > > > > I tested for that case:
> > > > > >
> > > > > >     #;:'NB.',LF,LF
> > > > > > 3
> > > > > >    #(0;sj;mj) sq 'NB.',LF,LF
> > > > > > 3
> > > > > >     #(0;sj;mj) sq 'NB.',LF,LF,LF
> > > > > > 4
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > This email has been checked for viruses by AVG.
> > > > > https://www.avg.com
> > > > >
> > > > > ----------------------------------------------------------------------
> > > > > For information about J forums see http://www.jsoftware.com/forums.htm
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to