Here's an updated version, which also retains utf-8 character
sequences within token boundaries (instead of splitting them up into
multiple tokens). I had originally posted this to the jbeta forum, but
it's really a programming topic, and probably belongs here.
mj=: 256$0 NB. X other
mj=: 1 (9,a.i.' ')}mj NB. S space and tab
mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B
mj=: 3 (a.i.'N')}mj NB. N the letter N
mj=: 4 (a.i.'B')}mj NB. B the letter B
mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _
mj=: 6 (a.i.'.')}mj NB. . the decimal point
mj=: 7 (a.i.':')}mj NB. : the colon
mj=: 8 (a.i.'''')}mj NB. Q quote
mj=: 9 (a.i.'{')}mj NB. { the left curly brace
mj=:10 (10)} mj NB. LF
mj=:11 (a.i.'}')}mj NB. } the right curly brace
mj=:12 (192+i.64)}mj NB. U utf-8 octet prefix
mj=:13 (128+i.64)}mj NB. V utf-8 octet suffix
sj=: 0 10#:10*}.".;._2(0 :0)
' X S A N B 9 . : Q { LF } U V']0
1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 NB. 0 space
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 1 other
1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 2 alp/num
1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 3 N
1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 4 NB
9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 5 NB.
1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 NB. 6 num
7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 7.0 7.0 NB. 7 '
1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 NB. 8 ''
9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 9 comment
1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 NB. 10 LF
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 15.2 16.2 NB. 11 {
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 15.2 16.2 NB. 12 }
1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 13 {{
1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 14 }}
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15 partial
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 NB. 16 utf-8
)
As I noted in the beta forum -- increasing the complexity of the state
table adds both rows and columns (size grows proportional to or faster
than the square of the number of distinct character types when each
character type requires a distinct handling). So it's good to keep
this thing as simple as possible.
Also, I've not tested this extensively, and it's possible I'll need to
make further changes (let me know if you spot any problems).
That said... note also that I have *not* implemented the unicode
guideline which might suggest that the tokenizer should throw an error
on malformed utf-8 sequences. That would require several more rows and
columns to achieve the recommended inconvenience. This would also
introduce email line wrap, because the state table would become that
fat. (I'll attach a copy here as a .txt file, to see if an earlier
suggestion -- that the forum would preserve .txt attachments -- might
be a way of working around that issue. I suspect not, but it's easy
enough to test...)
This has *not* been implemented in the current jbeta as the ;: monad.
I am not sure if it should, since J the language is based on ascii,
not unicode -- it's just convenient that unicode supports an ascii
subset.
Still... we often do have reason to work with utf-8.
Thanks,
--
Raul
On Sun, Nov 8, 2020 at 11:01 AM Raul Miller <[email protected]> wrote:
>
> Oh, oops, I should have spotted that. Thanks.
>
> Updated state table:
>
> mj=: 256$0 NB. X other
> mj=: 1 (9,a.i.' ')}mj NB. S space and tab
> mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B
> mj=: 3 (a.i.'N')}mj NB. N the letter N
> mj=: 4 (a.i.'B')}mj NB. B the letter B
> mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _
> mj=: 6 (a.i.'.')}mj NB. . the decimal point
> mj=: 7 (a.i.':')}mj NB. : the colon
> mj=: 8 (a.i.'''')}mj NB. Q quote
> mj=: 9 (a.i.'{')}mj NB. { the left curly brace
> mj=:10 (10)} mj NB. LF
> mj=:11 (a.i.'}')}mj NB. } the right curly brace
>
> sj=: 0 10#:10*}.".;._2(0 :0)
> ' X S A N B 9 . : Q { LF }']0
> 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 NB. 0 space
> 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 1 other
> 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 2 alp/num
> 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 3 N
> 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 NB. 4 NB
> 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 NB. 5 NB.
> 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 NB. 6 num
> 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 NB. 7 '
> 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 NB. 8 ''
> 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 NB. 9 comment
> 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 NB. 10 LF
> 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 NB. 11 {
> 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 NB. 12 }
> 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 13 {{
> 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 14 }}
> )
>
> (Note that I haven't coerced this state table to integer form --
> floats and integers occupy the same space on 64 bit systems, and the
> model doesn't really care about representation.)
>
> Thanks,
>
> --
> Raul
>
> On Sun, Nov 8, 2020 at 10:53 AM Henry Rich <[email protected]> wrote:
> >
> > I was talking about
> >
> > ;: LF,'.'
> > +-+-+
> > | |.|
> > +-+-+
> >
> > Henry Rich
> >
> > On 11/8/2020 8:38 AM, Raul Miller wrote:
> > > I tested for that case:
> > >
> > > #;:'NB.',LF,LF
> > > 3
> > > #(0;sj;mj) sq 'NB.',LF,LF
> > > 3
> > > #(0;sj;mj) sq 'NB.',LF,LF,LF
> > > 4
> > >
> > > Thanks,
> > >
> >
> >
> > --
> > This email has been checked for viruses by AVG.
> > https://www.avg.com
> >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
mj=: 256$0 NB. X other
mj=: 1 (9,a.i.' ')}mj NB. S space and tab
mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B
mj=: 3 (a.i.'N')}mj NB. N the letter N
mj=: 4 (a.i.'B')}mj NB. B the letter B
mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _
mj=: 6 (a.i.'.')}mj NB. . the decimal point
mj=: 7 (a.i.':')}mj NB. : the colon
mj=: 8 (a.i.'''')}mj NB. Q quote
mj=: 9 (a.i.'{')}mj NB. { the left curly brace
mj=:10 (10)} mj NB. LF
mj=:11 (a.i.'}')}mj NB. } the right curly brace
mj=:12 (192+i.64)}mj NB. U utf-8 octet prefix
mj=:13 (128+i.64)}mj NB. V utf-8 octet suffix
sj=: 0 10#:10*}.".;._2(0 :0)
' X S A N B 9 . : Q { LF } U V']0
1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 NB. 0 space
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 1 other
1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 2 alp/num
1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 3 N
1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 4 NB
9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 5 NB.
1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 NB. 6 num
7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 7.0 7.0 NB. 7 '
1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 NB. 8 ''
9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 9 comment
1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 NB. 10 LF
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 15.2 16.2 NB. 11 {
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 15.2 16.2 NB. 12 }
1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 13 {{
1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 14 }}
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15 partial
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 16 utf-8
)
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm