Okay. I changed your lines for UTF-8 to:
1.1 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15 partial
1.0 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 NB. 16 utf-8
Notice that this treats each UTF-8 sequence as "other" instead of alph/num
but the box contains the entire UTF-8 sequence. Notice that in your
definition the subscript 2 in water shows up as a separate box as an
"other" in my test. Where in my case the subscript 2 is considered an
alph/num.
NB. A noun to show the handling of UTF-8 in ;:
test=:{{)n
The symbol for the Euro is ₠
Other symols like π show up also
How about ⌹ in APL NB. ⌹
Common expressions like 'H₂O' for water
Common expressions like H₂O for water
}}
NB. How ;: in beta handles it
,.<;.2(0;sj;mj);:test
+-------------------------------------------+
|+---+------+---+---+----+--+-+-+ |
||The|symbol|for|the|Euro|is|₠| | |
|+---+------+---+---+----+--+-+-+ |
+-------------------------------------------+
|+-----+------+----+-+----+--+----+-+ |
||Other|symols|like|π|show|up|also| | |
|+-----+------+----+-+----+--+----+-+ |
+-------------------------------------------+
|+---+-----+-+--+---+-----+-+ |
||How|about|⌹|in|APL|NB. ⌹| | |
|+---+-----+-+--+---+-----+-+ |
+-------------------------------------------+
|+------+-----------+----+-----+---+-----+-+|
||Common|expressions|like|'H₂O'|for|water| ||
|+------+-----------+----+-----+---+-----+-+|
+-------------------------------------------+
|+------+-----------+----+-+-+-+---+-----+-+|
||Common|expressions|like|H|₂|O|for|water| ||
|+------+-----------+----+-+-+-+---+-----+-+|
+-------------------------------------------+
NB. Assigning UTF8 as character
mj=: 2 (128+i.128)}mj
NB. How UTF-8 is now handled
,.<;.2(0;sj;mj);:test
+-------------------------------------------+
|+---+------+---+---+----+--+-+-+ |
||The|symbol|for|the|Euro|is|₠| | |
|+---+------+---+---+----+--+-+-+ |
+-------------------------------------------+
|+-----+------+----+-+----+--+----+-+ |
||Other|symols|like|π|show|up|also| | |
|+-----+------+----+-+----+--+----+-+ |
+-------------------------------------------+
|+---+-----+-+--+---+-----+-+ |
||How|about|⌹|in|APL|NB. ⌹| | |
|+---+-----+-+--+---+-----+-+ |
+-------------------------------------------+
|+------+-----------+----+-----+---+-----+-+|
||Common|expressions|like|'H₂O'|for|water| ||
|+------+-----------+----+-----+---+-----+-+|
+-------------------------------------------+
|+------+-----------+----+---+---+-----+-+ |
||Common|expressions|like|H₂O|for|water| | |
|+------+-----------+----+---+---+-----+-+ |
+-------------------------------------------+
On Sat, Nov 21, 2020 at 11:05 AM Don Guinn <[email protected]> wrote:
> I was not precise in my earlier response. I should have said that
> detecting the wrong number of UTF-8 continuation bytes would be difficult
> in the sequential machine as you would probably need to detect seven
> possible U starts and seven U continues to properly check. That would make
> a very large JS, although only checking for up to three continuation bytes
> would probably be sufficient.
>
> And I should have said that your SJ and MJ did not catch the UTF-8
> characters in my test. Each byte was still treated as "other". Your
> approach is better as it allows the possibility of treating UTF-8 as
> "other", but would contain more than one byte - the entire UTF-8 sequence.
> I haven't looked at your SJ yet to try to find out why it doesn't catch the
> UTF-8.
>
> But what does one do if there is an error in the data? ;: returns errors
> if SJ and MJ are not constructed properly, but there is no way to report an
> error for bad data. And if there were, what would a programmer do about it?
> So is it necessary to detect bad UTF-8 sequences? Probably not. And for now
> at least treating all UTF-8 like alp/num would probably be what one would
> want. Let the display of the data show the errors.
>
> On Sat, Nov 21, 2020 at 9:12 AM Raul Miller <[email protected]> wrote:
>
>> (That said, it's also worth noting that the state table I presented
>> here doesn't give an error for unbalanced quotes, either. So if you
>> want errors to be thrown, you should probably be updating the state
>> table to force an error there, also.)
>>
>> (And, I should note that I haven't taken a look at what the resulting
>> errors look like. So I don't know how informative the resulting error
>> messages would be...)
>>
>> Thanks again,
>>
>> --
>> Raul
>>
>> On Sat, Nov 21, 2020 at 11:04 AM Raul Miller <[email protected]>
>> wrote:
>> >
>> > Yes, I was surprised (and pleased) that the text file came through ok.
>> >
>> > As for throwing errors on malformed utf-8 sequences, here's how that
>> > could be implemented:
>> >
>> > (1) We introduce an "error row" which uses operation 2 for all
>> > character classes (and, for consistency, identifies the next row as
>> > itself -- this is mostly so that improper use of the error row would
>> > be relatively obvious). (Operation 2 emits a token, but the important
>> > thing here is that it throws an error if j=-1.)
>> >
>> > (2) For each row which is part of a partially complete utf-8
>> > character, any appearance of any character class which is not a utf-8
>> > suffix character would use operation 3 and would identify the next row
>> > as the error row. (Operation 3 emits a token, but the important thing
>> > here is that it sets j=-1.)
>> >
>> > (3) Each of the different utf-8 prefixes would lead to a different row
>> > in the state table. For example, the character class containing
>> > character 224 would get a beginning of token row (which gets the error
>> > row treatment and) which leads to a row that expects a utf-8 suffix
>> > (which gets the error row treatment) followed by a second utf-8 suffix
>> > (which follows the pattern set by row 1 of the state table).
>> >
>> > Hopefully that makes sense.
>> >
>> > Note that the final token in a string could not trigger an error.
>> > That's a limitation of the engine and corresponds approximately to how
>> > utf-8 must be treated when a low level buffer boundary splits a utf-8
>> > character.
>> >
>> > Anyways, the point is that the sequential machine can support the sort
>> > of "count a small number of steps" which is needed here. The
>> > difficulty is more that the machine stops when it reaches the end of
>> > the string. If that's a sensitivity, this could be handled by
>> > appending a linefeed character to the end of the string before
>> > processing and then removing a final linefeed character from the last
>> > token after tokenization.
>> >
>> > Again, I hope this makes sense...
>> >
>> > Thanks,
>> >
>> > --
>> > Raul
>> >
>> >
>> >
>> >
>> > On Sat, Nov 21, 2020 at 9:22 AM Don Guinn <[email protected]> wrote:
>> > >
>> > > The text file worked great!
>> > >
>> > > As to the UTF-8 codes. What is important is to avoid splitting the
>> start
>> > > bytes from the continuation bytes. Validating the UTF-8 codes is a
>> > > difficult task. The start byte includes the number of continuation
>> bytes to
>> > > follow. It would be an error if the number of continuation bytes
>> didn't
>> > > agree.
>> > >
>> > > I tried my test on your definitions and it failed. Attached is a text
>> file
>> > > with your definitions and my test following.
>> > >
>> > > Well, I can't view the attachment. I don't know if it's there or not.
>> Just
>> > > in case, here is my test.
>> > >
>> > >
>> > > NB. A noun to show the handling of UTF-8 in ;:
>> > > test=:{{)n
>> > > The symbol for the Euro is ₠
>> > > Other symools like π show up also
>> > > How about ⌹ in APL NB. ⌹
>> > > Common expressions like 'H₂O' for water
>> > > Common expressions like H₂O for water
>> > > }}
>> > >
>> > > NB. How ;: this sj and mj handles it
>> > > ,.<;.2(0;sj;mj);:test
>> > >
>> > > NB. Assigning UTF8 as character
>> > > mj=: 2 (128+i.128)}mj
>> > >
>> > > NB. How UTF-8 is now handled
>> > > ,.<;.2(0;sj;mj);:test
>> > >
>> > > On Fri, Nov 20, 2020 at 2:46 PM Raul Miller <[email protected]>
>> wrote:
>> > >
>> > > > Here's an updated version, which also retains utf-8 character
>> > > > sequences within token boundaries (instead of splitting them up into
>> > > > multiple tokens). I had originally posted this to the jbeta forum,
>> but
>> > > > it's really a programming topic, and probably belongs here.
>> > > >
>> > > > mj=: 256$0 NB. X other
>> > > > mj=: 1 (9,a.i.' ')}mj NB. S space and tab
>> > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B
>> > > > mj=: 3 (a.i.'N')}mj NB. N the letter N
>> > > > mj=: 4 (a.i.'B')}mj NB. B the letter B
>> > > > mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _
>> > > > mj=: 6 (a.i.'.')}mj NB. . the decimal point
>> > > > mj=: 7 (a.i.':')}mj NB. : the colon
>> > > > mj=: 8 (a.i.'''')}mj NB. Q quote
>> > > > mj=: 9 (a.i.'{')}mj NB. { the left curly brace
>> > > > mj=:10 (10)} mj NB. LF
>> > > > mj=:11 (a.i.'}')}mj NB. } the right curly brace
>> > > > mj=:12 (192+i.64)}mj NB. U utf-8 octet prefix
>> > > > mj=:13 (128+i.64)}mj NB. V utf-8 octet suffix
>> > > >
>> > > > sj=: 0 10#:10*}.".;._2(0 :0)
>> > > > ' X S A N B 9 . : Q { LF } U V']0
>> > > > 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 15.1 16.1 NB. 0
>> space
>> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 1
>> other
>> > > > 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 2
>> alp/num
>> > > > 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 3
>> N
>> > > > 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 15.2 16.2 NB. 4
>> NB
>> > > > 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 5
>> NB.
>> > > > 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 15.2 16.2 NB. 6
>> num
>> > > > 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 7.0 7.0 NB. 7
>> '
>> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 15.2 16.2 NB. 8
>> ''
>> > > > 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 9.0 9.0 NB. 9
>> comment
>> > > > 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 15.2 16.2 NB.
>> 10 LF
>> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 15.2 16.2 NB.
>> 11 {
>> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 15.2 16.2 NB.
>> 12 }
>> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB.
>> 13 {{
>> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 15.2 16.2 NB.
>> 14 }}
>> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.0 16.0 NB. 15
>> > > > partial
>> > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 15.2 16.0 NB.
>> 16 utf-8
>> > > > )
>> > > >
>> > > > As I noted in the beta forum -- increasing the complexity of the
>> state
>> > > > table adds both rows and columns (size grows proportional to or
>> faster
>> > > > than the square of the number of distinct character types when each
>> > > > character type requires a distinct handling). So it's good to keep
>> > > > this thing as simple as possible.
>> > > >
>> > > > Also, I've not tested this extensively, and it's possible I'll need
>> to
>> > > > make further changes (let me know if you spot any problems).
>> > > >
>> > > > That said... note also that I have *not* implemented the unicode
>> > > > guideline which might suggest that the tokenizer should throw an
>> error
>> > > > on malformed utf-8 sequences. That would require several more rows
>> and
>> > > > columns to achieve the recommended inconvenience. This would also
>> > > > introduce email line wrap, because the state table would become that
>> > > > fat. (I'll attach a copy here as a .txt file, to see if an earlier
>> > > > suggestion -- that the forum would preserve .txt attachments --
>> might
>> > > > be a way of working around that issue. I suspect not, but it's easy
>> > > > enough to test...)
>> > > >
>> > > > This has *not* been implemented in the current jbeta as the ;:
>> monad.
>> > > > I am not sure if it should, since J the language is based on ascii,
>> > > > not unicode -- it's just convenient that unicode supports an ascii
>> > > > subset.
>> > > >
>> > > > Still... we often do have reason to work with utf-8.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > --
>> > > > Raul
>> > > >
>> > > > On Sun, Nov 8, 2020 at 11:01 AM Raul Miller <[email protected]>
>> wrote:
>> > > > >
>> > > > > Oh, oops, I should have spotted that. Thanks.
>> > > > >
>> > > > > Updated state table:
>> > > > >
>> > > > > mj=: 256$0 NB. X other
>> > > > > mj=: 1 (9,a.i.' ')}mj NB. S space and tab
>> > > > > mj=: 2 (,(a.i.'Aa')+/i.26)}mj NB. A A-Z a-z excluding N B
>> > > > > mj=: 3 (a.i.'N')}mj NB. N the letter N
>> > > > > mj=: 4 (a.i.'B')}mj NB. B the letter B
>> > > > > mj=: 5 (a.i.'0123456789_')}mj NB. 9 digits and _
>> > > > > mj=: 6 (a.i.'.')}mj NB. . the decimal point
>> > > > > mj=: 7 (a.i.':')}mj NB. : the colon
>> > > > > mj=: 8 (a.i.'''')}mj NB. Q quote
>> > > > > mj=: 9 (a.i.'{')}mj NB. { the left curly brace
>> > > > > mj=:10 (10)} mj NB. LF
>> > > > > mj=:11 (a.i.'}')}mj NB. } the right curly brace
>> > > > >
>> > > > > sj=: 0 10#:10*}.".;._2(0 :0)
>> > > > > ' X S A N B 9 . : Q { LF }']0
>> > > > > 1.1 0.0 2.1 3.1 2.1 6.1 1.1 1.1 7.1 11.1 10.1 12.1 NB. 0 space
>> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 11.2 10.2 12.2 NB. 1 other
>> > > > > 1.2 0.3 2.0 2.0 2.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 2 alp/num
>> > > > > 1.2 0.3 2.0 2.0 4.0 2.0 1.0 1.0 7.2 11.2 10.2 12.2 NB. 3 N
>> > > > > 1.2 0.3 2.0 2.0 2.0 2.0 5.0 1.0 7.2 11.2 10.2 12.2 NB. 4 NB
>> > > > > 9.0 9.0 9.0 9.0 9.0 9.0 1.0 1.0 9.0 9.0 10.2 9.0 NB. 5 NB.
>> > > > > 1.4 0.5 6.0 6.0 6.0 6.0 6.0 1.0 7.4 11.4 10.2 12.4 NB. 6 num
>> > > > > 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 8.0 7.0 7.0 7.0 NB. 7 '
>> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.2 1.2 7.0 11.2 10.2 12.2 NB. 8 ''
>> > > > > 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 9.0 10.2 9.0 NB. 9 comment
>> > > > > 1.2 0.2 2.2 3.2 2.2 6.2 1.2 1.2 7.2 11.2 10.2 12.2 NB. 10 LF
>> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 13.0 10.2 1.2 NB. 11 {
>> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.0 1.0 7.2 1.2 10.2 14.0 NB. 12 }
>> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 13 {{
>> > > > > 1.2 0.3 2.2 3.2 2.2 6.2 1.7 1.7 7.2 1.2 10.2 1.2 NB. 14 }}
>> > > > > )
>> > > > >
>> > > > > (Note that I haven't coerced this state table to integer form --
>> > > > > floats and integers occupy the same space on 64 bit systems, and
>> the
>> > > > > model doesn't really care about representation.)
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > --
>> > > > > Raul
>> > > > >
>> > > > > On Sun, Nov 8, 2020 at 10:53 AM Henry Rich <[email protected]>
>> wrote:
>> > > > > >
>> > > > > > I was talking about
>> > > > > >
>> > > > > > ;: LF,'.'
>> > > > > > +-+-+
>> > > > > > | |.|
>> > > > > > +-+-+
>> > > > > >
>> > > > > > Henry Rich
>> > > > > >
>> > > > > > On 11/8/2020 8:38 AM, Raul Miller wrote:
>> > > > > > > I tested for that case:
>> > > > > > >
>> > > > > > > #;:'NB.',LF,LF
>> > > > > > > 3
>> > > > > > > #(0;sj;mj) sq 'NB.',LF,LF
>> > > > > > > 3
>> > > > > > > #(0;sj;mj) sq 'NB.',LF,LF,LF
>> > > > > > > 4
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > This email has been checked for viruses by AVG.
>> > > > > > https://www.avg.com
>> > > > > >
>> > > > > >
>> ----------------------------------------------------------------------
>> > > > > > For information about J forums see
>> http://www.jsoftware.com/forums.htm
>> > > >
>> ----------------------------------------------------------------------
>> > > > For information about J forums see
>> http://www.jsoftware.com/forums.htm
>> > > >
>> > > ----------------------------------------------------------------------
>> > > For information about J forums see
>> http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm