I think we need to be a little clearer about what you're trying to
achieve when you say "catch the utf-8 characters".
Your change to mj made all utf-8 *octets* be treated as alphabetic.
That's... an approach, certainly.
Meanwhile, according to https://en.wikipedia.org/wiki/UTF-8#Encoding
there are three different utf-8 starts. I don't know what you are
referring to when you say that there are seven possible U starts.
But there is a way for ;: to report errors in the stream. See attached
for a demonstration (I made quoted strings be an error.)
Thanks,
--
Raul
On Sat, Nov 21, 2020 at 1:05 PM Don Guinn <[email protected]> wrote:
>
> I was not precise in my earlier response. I should have said that detecting
> the wrong number of UTF-8 continuation bytes would be difficult in the
> sequential machine as you would probably need to detect seven possible U
> starts and seven U continues to properly check. That would make a very
> large JS, although only checking for up to three continuation bytes would
> probably be sufficient.
>
> And I should have said that your SJ and MJ did not catch the UTF-8
> characters in my test. Each byte was still treated as "other". Your
> approach is better as it allows the possibility of treating UTF-8 as
> "other", but would contain more than one byte - the entire UTF-8 sequence.
> I haven't looked at your SJ yet to try to find out why it doesn't catch the
> UTF-8.
>
> But what does one do if there is an error in the data? ;: returns errors if
> SJ and MJ are not constructed properly, but there is no way to report an
> error for bad data. And if there were, what would a programmer do about it?
> So is it necessary to detect bad UTF-8 sequences? Probably not. And for now
> at least treating all UTF-8 like alp/num would probably be what one would
> want. Let the display of the data show the errors.
>
> On Sat, Nov 21, 2020 at 9:12 AM Raul Miller <[email protected]> wrote:
>
> > (That said, it's also worth noting that the state table I presented
> > here doesn't give an error for unbalanced quotes, either. So if you
> > want errors to be thrown, you should probably be updating the state
> > table to force an error there, also.)
> >
> > (And, I should note that I haven't taken a look at what the resulting
> > errors look like. So I don't know how informative the resulting error
> > messages would be...)
> >
> > Thanks again,
> >
> > --
> > Raul
> >
> > On Sat, Nov 21, 2020 at 11:04 AM Raul Miller <[email protected]>
> > wrote:
> > >
> > > Yes, I was surprised (and pleased) that the text file came through ok.
> > >
> > > As for throwing errors on malformed utf-8 sequences, here's how that
> > > could be implemented:
> > >
> > > (1) We introduce an "error row" which uses operation 2 for all
> > > character classes (and, for consistency, identifies the next row as
> > > itself -- this is mostly so that improper use of the error row would
> > > be relatively obvious). (Operation 2 emits a token, but the important
> > > thing here is that it throws an error if j=-1.)
> > >
> > > (2) For each row which is part of a partially complete utf-8
> > > character, any appearance of any character class which is not a utf-8
> > > suffix character would use operation 3 and would identify the next row
> > > as the error row. (Operation 3 emits a token, but the important thing
> > > here is that it sets j=-1.)
> > >
> > > (3) Each of the different utf-8 prefixes would lead to a different row
> > > in the state table. For example, the character class containing
> > > character 224 would get a beginning of token row (which gets the error
> > > row treatment and) which leads to a row that expects a utf-8 suffix
> > > (which gets the error row treatment) followed by a second utf-8 suffix
> > > (which follows the pattern set by row 1 of the state table).
> > >
> > > Hopefully that makes sense.
> > >
> > > Note that the final token in a string could not trigger an error.
> > > That's a limitation of the engine and corresponds approximately to how
> > > utf-8 must be treated when a low level buffer boundary splits a utf-8
> > > character.
> > >
> > > Anyways, the point is that the sequential machine can support the sort
> > > of "count a small number of steps" which is needed here. The
> > > difficulty is more that the machine stops when it reaches the end of
> > > the string. If that's a sensitivity, this could be handled by
> > > appending a linefeed character to the end of the string before
> > > processing and then removing a final linefeed character from the last
> > > token after tokenization.
> > >
> > > Again, I hope this makes sense...
> > >
> > > Thanks,
> > >
> > > --
> > > Raul
> > >
> > >
> > >
> > >
> > > On Sat, Nov 21, 2020 at 9:22 AM Don Guinn <[email protected]> wrote:
> > > >
> > > > The text file worked great!
> > > >
> > > > As to the UTF-8 codes. What is important is to avoid splitting the
> > start
> > > > bytes from the continuation bytes. Validating the UTF-8 codes is a
> > > > difficult task. The start byte includes the number of continuation
> > bytes to
> > > > follow. It would be an error if the number of continuation bytes didn't
> > > > agree.
> > > >
> > > > I tried my test on your definitions and it failed. Attached is a text
> > file
> > > > with your definitions and my test following.
> > > >
> > > > Well, I can't view the attachment. I don't know if it's there or not.
> > Just
> > > > in case, here is my test.
> > > >
> > > >
> > > > NB. A noun to show the handling of UTF-8 in ;:
> > > > test=:{{)n
> > > > The symbol for the Euro is ₠
> > > > Other symools like π show up also
> > > > How about ⌹ in APL NB. ⌹
> > > > Common expressions like 'H₂O' for water
> > > > Common expressions like H₂O for water
> > > > }}
> > > >
> > > > NB. How ;: this sj and mj handles it
> > > > ,.<;.2(0;sj;mj);:test
> > > >
> > > > NB. Assigning UTF8 as character
> > > > mj=: 2 (128+i.128)}mj
> > > >
> > > > NB. How UTF-8 is now handled
> > > > ,.<;.2(0;sj;mj);:test
> > > >
> > > > On Fri, Nov 20, 2020 at 2:46 PM Raul Miller <[email protected]>
> > wrote:
> > > >
> > > > > Here's an updated version, which also retains utf-8 character
> > > > > sequences within token boundaries (instead of splitting them up into
> > > > > multiple tokens). I had originally posted this to the jbeta forum,
> > but
> > > > > it's really a programming topic, and probably belongs here.
> > > > >
> > > > > mj=: 256$0 NB. X other
> > > > > mj=: 1 (9,a.i.' ')}mj NB. S space and tab
> > > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B
> > > > > mj=: 3 (a.i.'N')}mj NB. N the letter N
> > > > > mj=: 4 (a.i.'B')}mj NB. B the letter B
> > > > > mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _
> > > > > mj=: 6 (a.i.'.')}mj NB. . the decimal point
> > > > > mj=: 7 (a.i.':')}mj NB. : the colon
> > > > > mj=: 8 (a.i.'''')}mj NB. Q quote
> > > > > mj=: 9 (a.i.'{')}mj NB. { the left curly brace
> > > > > mj=:10 (10)} mj NB. LF
> > > > > mj=:11 (a.i.'}')}mj NB. } the right curly brace
> > > > > mj=:12 (192+i.64)}mj NB. U utf-8 octet prefix
> > > > > mj=:13 (128+i.64)}mj NB. V utf-8 octet suffix
> > > > >
> > > > > sj=: 0 10#:10*}.".;._2(0 :0)
> > > > > ' X S A N B 9 . : Q { LF } U V']0
> > > > > 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 NB. 0
> > space
> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 1
> > other
> > > > > 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 2
> > alp/num
> > > > > 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 3 N
> > > > > 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 4
> > NB
> > > > > 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 5
> > NB.
> > > > > 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 NB. 6
> > num
> > > > > 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 7.0 7.0 NB. 7 '
> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 NB. 8
> > ''
> > > > > 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 9
> > comment
> > > > > 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 NB. 10
> > LF
> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 15.2 16.2 NB. 11
> > {
> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 15.2 16.2 NB. 12
> > }
> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 13
> > {{
> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 14
> > }}
> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15
> > > > > partial
> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 NB. 16
> > utf-8
> > > > > )
> > > > >
> > > > > As I noted in the beta forum -- increasing the complexity of the
> > state
> > > > > table adds both rows and columns (size grows proportional to or
> > faster
> > > > > than the square of the number of distinct character types when each
> > > > > character type requires a distinct handling). So it's good to keep
> > > > > this thing as simple as possible.
> > > > >
> > > > > Also, I've not tested this extensively, and it's possible I'll need
> > to
> > > > > make further changes (let me know if you spot any problems).
> > > > >
> > > > > That said... note also that I have *not* implemented the unicode
> > > > > guideline which might suggest that the tokenizer should throw an
> > error
> > > > > on malformed utf-8 sequences. That would require several more rows
> > and
> > > > > columns to achieve the recommended inconvenience. This would also
> > > > > introduce email line wrap, because the state table would become that
> > > > > fat. (I'll attach a copy here as a .txt file, to see if an earlier
> > > > > suggestion -- that the forum would preserve .txt attachments -- might
> > > > > be a way of working around that issue. I suspect not, but it's easy
> > > > > enough to test...)
> > > > >
> > > > > This has *not* been implemented in the current jbeta as the ;: monad.
> > > > > I am not sure if it should, since J the language is based on ascii,
> > > > > not unicode -- it's just convenient that unicode supports an ascii
> > > > > subset.
> > > > >
> > > > > Still... we often do have reason to work with utf-8.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > --
> > > > > Raul
> > > > >
> > > > > On Sun, Nov 8, 2020 at 11:01 AM Raul Miller <[email protected]>
> > wrote:
> > > > > >
> > > > > > Oh, oops, I should have spotted that. Thanks.
> > > > > >
> > > > > > Updated state table:
> > > > > >
> > > > > > mj=: 256$0 NB. X other
> > > > > > mj=: 1 (9,a.i.' ')}mj NB. S space and tab
> > > > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B
> > > > > > mj=: 3 (a.i.'N')}mj NB. N the letter N
> > > > > > mj=: 4 (a.i.'B')}mj NB. B the letter B
> > > > > > mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _
> > > > > > mj=: 6 (a.i.'.')}mj NB. . the decimal point
> > > > > > mj=: 7 (a.i.':')}mj NB. : the colon
> > > > > > mj=: 8 (a.i.'''')}mj NB. Q quote
> > > > > > mj=: 9 (a.i.'{')}mj NB. { the left curly brace
> > > > > > mj=:10 (10)} mj NB. LF
> > > > > > mj=:11 (a.i.'}')}mj NB. } the right curly brace
> > > > > >
> > > > > > sj=: 0 10#:10*}.".;._2(0 :0)
> > > > > > ' X S A N B 9 . : Q { LF }']0
> > > > > > 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 NB. 0 space
> > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 1 other
> > > > > > 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 2 alp/num
> > > > > > 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 3 N
> > > > > > 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 NB. 4 NB
> > > > > > 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 NB. 5 NB.
> > > > > > 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 NB. 6 num
> > > > > > 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 NB. 7 '
> > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 NB. 8 ''
> > > > > > 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 NB. 9 comment
> > > > > > 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 NB. 10 LF
> > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 NB. 11 {
> > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 NB. 12 }
> > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 13 {{
> > > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 14 }}
> > > > > > )
> > > > > >
> > > > > > (Note that I haven't coerced this state table to integer form --
> > > > > > floats and integers occupy the same space on 64 bit systems, and
> > the
> > > > > > model doesn't really care about representation.)
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > --
> > > > > > Raul
> > > > > >
> > > > > > On Sun, Nov 8, 2020 at 10:53 AM Henry Rich <[email protected]>
> > wrote:
> > > > > > >
> > > > > > > I was talking about
> > > > > > >
> > > > > > > ;: LF,'.'
> > > > > > > +-+-+
> > > > > > > | |.|
> > > > > > > +-+-+
> > > > > > >
> > > > > > > Henry Rich
> > > > > > >
> > > > > > > On 11/8/2020 8:38 AM, Raul Miller wrote:
> > > > > > > > I tested for that case:
> > > > > > > >
> > > > > > > > #;:'NB.',LF,LF
> > > > > > > > 3
> > > > > > > > #(0;sj;mj) sq 'NB.',LF,LF
> > > > > > > > 3
> > > > > > > > #(0;sj;mj) sq 'NB.',LF,LF,LF
> > > > > > > > 4
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > This email has been checked for viruses by AVG.
> > > > > > > https://www.avg.com
> > > > > > >
> > > > > > >
> > ----------------------------------------------------------------------
> > > > > > > For information about J forums see
> > http://www.jsoftware.com/forums.htm
> > > > >
> > ----------------------------------------------------------------------
> > > > > For information about J forums see
> > http://www.jsoftware.com/forums.htm
> > > > >
> > > > ----------------------------------------------------------------------
> > > > For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
mj=: 256$0 NB. X other
mj=: 1 (9,a.i.' ')}mj NB. S space and tab
mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B
mj=: 3 (a.i.'N')}mj NB. N the letter N
mj=: 4 (a.i.'B')}mj NB. B the letter B
mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _
mj=: 6 (a.i.'.')}mj NB. . the decimal point
mj=: 7 (a.i.':')}mj NB. : the colon
mj=: 8 (a.i.'''')}mj NB. Q quote
mj=: 9 (a.i.'{')}mj NB. { the left curly brace
mj=:10 (10)} mj NB. LF
mj=:11 (a.i.'}')}mj NB. } the right curly brace
mj=:12 (192+i.64)}mj NB. U utf-8 octet prefix
mj=:13 (128+i.64)}mj NB. V utf-8 octet suffix
sj=: 0 10#:10*}.".;._2(0 :0)
' X S A N B 9 . : Q { LF } U V']0
1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 NB. 0 space
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 1 other
1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 2 alp/num
1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 3 N
1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 4 NB
9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 5 NB.
1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 NB. 6 num
7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 7.0 7.0 NB. 7 '
1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 NB. 8 ''
9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 9 comment
1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 NB. 10 LF
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 15.2 16.2 NB. 11 {
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 15.2 16.2 NB. 12 }
1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 13 {{
1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB. 14 }}
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15 partial
1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 16 utf-8
)
example=: 'asdf''jkl'''
echo (0;sj;mj) ;: example
SJ=:3 (17,.(i.14),.1)} sj
echo (0;SJ;mj) ;: 'this is a test'
echo (0;SJ;mj) ;: example
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm