Here's a version of that implementation with both the comment line in
sj fixed, and support for utf-8 tokenization. (Note that this will
also tokenize mal-formed utf-8 -- hypothetically speaking, it may be
desirable to reject invalid utf-8 content. If you really want that, I
can show you how to achieve that.)

I have only minimally tested this -- if you notice any malfunctions,
let me know.

Beware potential email line wrap...

mj=: 256$0                     NB. X other
mj=: 1 (9,a.i.' ')}mj          NB. S space and tab
mj=: 2 (,(a.i.'Aa')+/i.26)}mj  NB. A A-Z a-z excluding N B
mj=: 3 (a.i.'N')}mj            NB. N the letter N
mj=: 4 (a.i.'B')}mj            NB. B the letter B
mj=: 5 (a.i.'0123456789_')}mj  NB. 9 digits and _
mj=: 6 (a.i.'.')}mj            NB. . the decimal point
mj=: 7 (a.i.':')}mj            NB. : the colon
mj=: 8 (a.i.'''')}mj           NB. Q quote
mj=: 9 (a.i.'{')}mj            NB. { the left curly brace
mj=:10 (10)} mj                NB. LF
mj=:11 (a.i.'}')}mj            NB. } the right curly brace
mj=:12 (192+i.64)}mj           NB. U utf-8 octet prefix
mj=:13 (128+i.64)}mj           NB. V utf-8 octet suffix

sj=: 0 10#:10*}.".;._2(0 :0)
' X   S   A   N   B   9   .   :   Q    {    LF   }   U    V']0
 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 NB. 0 space
 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 1 other
 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 2 alp/num
 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 3 N
 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 4 NB
 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0  9.0 10.2  9.0  9.0  9.0 NB. 5 NB.
 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 NB. 6 num
 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0  7.0  7.0  7.0  7.0  7.0 NB. 7 '
 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 NB. 8 ''
 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0  9.0 10.2  9.0  9.0  9.0 NB. 9 comment
 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 NB. 10 LF
 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2  1.2 15.2 16.2 NB. 11 {
 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2  1.2 10.2 14.0 15.2 16.2 NB. 12 }
 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 15.2 16.2 NB. 13 {{
 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 15.2 16.2 NB. 14 }}
 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15 partial
 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 16 utf-8
)

Notice also that adding to the tokenizer increases both the number of
rows and the number of columns in sj. So it's highly desirable to keep
these things simple.

Good luck,

--
Raul




On Fri, Nov 20, 2020 at 10:51 AM Don Guinn <[email protected]> wrote:
>
> You are correct. They are not. I took their definitions from one of the
> beta forum messages concerning the implementation of Direct Definition. I
> am showing them below.
>
>
> mj=: 256$0                     NB. X other
> mj=: 1 (9,a.i.' ')}mj          NB. S space and tab
> mj=: 2 (,(a.i.'Aa')+/i.26)}mj  NB. A A-Z a-z excluding N B
> mj=: 3 (a.i.'N')}mj            NB. N the letter N
> mj=: 4 (a.i.'B')}mj            NB. B the letter B
> mj=: 5 (a.i.'0123456789_')}mj  NB. 9 digits and _
> mj=: 6 (a.i.'.')}mj            NB. . the decimal point
> mj=: 7 (a.i.':')}mj            NB. : the colon
> mj=: 8 (a.i.'''')}mj           NB. Q quote
> mj=: 9 (a.i.'{')}mj            NB. { the left curly brace
> mj=:10 (10)}mj                 NB. LF and CR
> mj=:11 (a.i.'}')}mj            NB. } the right curly brace
>
> NB. mj=: 2 (128+i.128)        }mj
>
> sj=: 0 10#:10*}.".;._2(0 :0)
> ' X   S   A   N   B   9   .   :   Q    {    }   LF ']0
>  1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 NB. 0 space
>  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 1 other
>  1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 2 alp/num
>  1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 3 N
>  1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 NB. 4 NB
>  9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0  9.0 10.2  9.0 NB. 5 NB.
>  1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 NB. 6 num
>  7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0  7.0  7.0  7.0 NB. 7 '
>  1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 NB. 8 ''
>  9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0  9.0 10.2  9.0 NB. 9 comment
>  1.2 0.2 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 10 LF
>  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2  1.2 NB. 11 {
>  1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2  1.2 10.2 14.0 NB. 12 }
>  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 NB. 13 {{
>  1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2  1.2 10.2  1.2 NB. 14 }}
> )
>
> On Fri, Nov 20, 2020 at 8:32 AM Henry Rich <[email protected]> wrote:
>
> > What are sj/mj?  They are not part of the JE, right?  I think you are
> > saying that (x ;: y) can be used to parse well-formed UTF-8 with a small
> > change to the input translation.
> >
> > Henry Rich
> >
> > On 11/20/2020 10:17 AM, Don Guinn wrote:
> > > Sequential machine does not do well when dealing with UTF-8. It works
> > well
> > > within comments (NB.) and literals ('⌹'), but outside those cases it
> > makes
> > > a mess.
> > >
> > >
> > > Given some of the changes to ;: in the beta it seems that it would be
> > > desirable to have UTF-8 handled outside of comments and literals as
> > handled
> > > in them. There is a simple change that can be made to mj that
> > accomplishes
> > > that. Simply assigning the value 2 for letters for the range 128+i.128
> > > accomplishes that making UTF-8 like letters a-z and A-Z.
> > >
> > >
> > > I don't know where J will be going with UTF-8 and other unicode handling,
> > > but this seems to me to help in the handling of UTF-8 in the sequential
> > > machine.
> > >
> > >
> > > Example shown below:
> > >
> > > NB. Definitions for sj and mj not shown but as
> > >
> > > NB. the current beta.
> > >
> > > NB. A noun to show the handling of UTF-8 in ;:
> > >
> > > test=:{{)n
> > >
> > > The symbol for the Euro is ₠
> > >
> > > Other symbols like π show up also
> > >
> > > How about ⌹ in APL
> > >
> > > Common expressions like 'H₂O' for water
> > >
> > > Common expressions like H₂O for water
> > >
> > > }}
> > >
> > > NB. How ;: in beta handles it
> > >
> > > ,.<;.2(0;sj;mj);:test
> > >
> > > +-----------------------------------------------+
> > >
> > > |+---+------+---+---+----+--+-+-+-+-+ |
> > >
> > > ||The|symbol|for|the|Euro|is|â|‚| | | |
> > >
> > > |+---+------+---+---+----+--+-+-+-+-+ |
> > >
> > > +-----------------------------------------------+
> > >
> > > |+-----+-------+----+-+-+----+--+----+-+ |
> > >
> > > ||Other|symools|like|Ï|€|show|up|also| | |
> > >
> > > |+-----+-------+----+-+-+----+--+----+-+ |
> > >
> > > +-----------------------------------------------+
> > >
> > > |+---+-----+-+-+-+--+---+-+ |
> > >
> > > ||How|about|â|Œ|¹|in|APL| | |
> > >
> > > |+---+-----+-+-+-+--+---+-+ |
> > >
> > > +-----------------------------------------------+
> > >
> > > |+------+-----------+----+-----+---+-----+-+ |
> > >
> > > ||Common|expressions|like|'H₂O'|for|water| | |
> > >
> > > |+------+-----------+----+-----+---+-----+-+ |
> > >
> > > +-----------------------------------------------+
> > >
> > > |+------+-----------+----+-+-+-+-+-+---+-----+-+|
> > >
> > > ||Common|expressions|like|H|â|‚|‚|O|for|water| ||
> > >
> > > |+------+-----------+----+-+-+-+-+-+---+-----+-+|
> > >
> > > +-----------------------------------------------+
> > >
> > > NB. Assigning UTF8 as character
> > >
> > > mj=: 2 (128+i.128)}mj
> > >
> > > NB. How UTF-8 is now handled
> > >
> > > ,.<;.2(0;sj;mj);:test
> > >
> > > +-------------------------------------------+
> > >
> > > |+---+------+---+---+----+--+-+-+ |
> > >
> > > ||The|symbol|for|the|Euro|is|₠| | |
> > >
> > > |+---+------+---+---+----+--+-+-+ |
> > >
> > > +-------------------------------------------+
> > >
> > > |+-----+-------+----+-+----+--+----+-+ |
> > >
> > > ||Other|symools|like|π|show|up|also| | |
> > >
> > > |+-----+-------+----+-+----+--+----+-+ |
> > >
> > > +-------------------------------------------+
> > >
> > > |+---+-----+-+--+---+-+ |
> > >
> > > ||How|about|⌹|in|APL| | |
> > >
> > > |+---+-----+-+--+---+-+ |
> > >
> > > +-------------------------------------------+
> > >
> > > |+------+-----------+----+-----+---+-----+-+|
> > >
> > > ||Common|expressions|like|'H₂O'|for|water| ||
> > >
> > > |+------+-----------+----+-----+---+-----+-+|
> > >
> > > +-------------------------------------------+
> > >
> > > |+------+-----------+----+---+---+-----+-+ |
> > >
> > > ||Common|expressions|like|H₂O|for|water| | |
> > >
> > > |+------+-----------+----+---+---+-----+-+ |
> > >
> > > +-------------------------------------------+
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> >
> > --
> > This email has been checked for viruses by AVG.
> > https://www.avg.com
> >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to