Re: [Jprogramming] sequential machine and empty word output

Danil Osipchuk Mon, 24 Apr 2023 12:56:48 -0700

CSV is just an example everyone can relate to and an important one. But the
issue is much broader in scope, to put it simply currently SM is not
flexible enough, it bites one's hand almost every time one tries to apply
it.
The most pressing issue is with the domain of emit word:
  1)  you can't emit an empty word, although it is a perfectly valid and
important case in too many scenarios. In particular, currently a valid SM
which could parse an empty string doesn't exist at all. It makes very
little sense, especially in a language where such an effort had been made
to thoroughly cover   empty lists elsewhere.
  2) a related issue is -- often you are not interested in separators, but
in what is between them, so you would rather skip them. You do want to
present the structure of the input. This takes the form of frets between
empty fields.


In case of csv the issue is primarily with commas, quotes here are
tangentially related and it was not my intention to produce a full fledged
implementation conforming to RFCs etc. with all subtleties (at least not
right away).
But as an example csv is hard to resist since it highlights the problem
perfectly.

Given a text of comma separated possibly empty fields how do you split them?

One of course will suggest something like this (note the negative _2
switch):

csv =: [: (<;. _2) ,&','

csv '123,32,'

┌───┬──┬┐

│123│32││

└───┴──┴┘

csv ',,'

┌┬┬┐

││││

└┴┴┘
What if some commas are inside of quoted strings? Now you can't use bulk
approaches like cut, because you need to pass state around. You can parse
it in a loop inside of explicit definition, but given input large enough,
it quickly becomes unfeasible.
SM seems to be a right tool, but it doesn't do empty partitions and
skipping separators is cumbersome since it requires additional processing.

Generally this is when you reach for SM -- complex parsing and big inputs
(I have to deal with multi gigabyte memory mapped files) and you  would
wish to cover as much as possible with SM.

It looks a bit strange that  3 opcodes depict a  j=i+1, but this just
follows how SM is documented -  through a description of what goes on
underhood. During the actual item processing step i = j.
Essentially new opcodes would be doing the same thing as negative n
switches do for the cut family -- extending functionality to skip frets
when partitioning.
And an example to parse a (simplified) csv is still notably simple - 2x3
state table.

Hopefully this clarifies a bit.

regards,
  Danil

пн, 24 апр. 2023 г. в 19:08, Raul Miller <[email protected]>:

> Parsing csv seems like the motivation here.
>
> If so, it would also be good to have a more complete test suite.
>
> In particular, csv double quote handling --
>
> https://stackoverflow.com/questions/66096193/having-multiple-double-quotes-inside-quoted-string-csv-file
> for example -- means that your opcode 9 deserves some careful thought.
>
> (Traditionally, cleanup of the double quotes would happen in a
> separate step, after ;: had completed breaking out the words. Here, I
> think you mean something slightly different for opcode 9. Instead of
> ev (which would include all text from the end of the previous word),
> you might have been thinking of some different concept which would
> skip over one of the double quote characters?)
>
> (I haven't emulated a machine implementation here, there's enough
> detail involved that I would much rather look at a working demo and
> how it handles test cases.)
>
> Thanks,
>
> --
> Raul
>
> On Mon, Apr 24, 2023 at 5:34 AM Danil Osipchuk <[email protected]>
> wrote:
> >
> > I wonder if I'm the only one bothered by semicolon's assertion of
> strictly
> > i>j.
> >
> > Generally, empty words can be used as markers to impose some additional
> > regularity on the output, to make it easier to process later.
> >
> > An obvious example would be parsing a csv file with 3 fields per record
> > where any can be empty:
> > ,,
> > ,1st field, Is empty
> > Full record, 3, "Hello, world"
> >
> > It is natural to parse it into empty strings where appropriate, but i>j
> > gets into a way.
> >
> >
> > Letting i>:j in and adding 3( for the sake of completeness) new opcodes
> > like below seems to be increasing SM's usefulness considerably in mostly
> > backwards compatible way. What do others think?
> >
> > 8    j=.i+1
> > 9    j=.i+1  [ ew(i,j,r,c)
> > 10   j=.i+1  [ ev(i,j,r,c)
> >
> > NB. Rows: 0: Waiting for terminating comma, 1: Inside of quotes
> > NB. Columns: 0: comma, 1: double quotes, 2: other
> >
> >    <"1 (2 3 2$ 0 9 1 1 0 0   1 0 0 0 1 0)
> > +---+---+---+
> > |0 9|1 1|0 0|
> > +---+---+---+
> > |1 0|0 0|1 0|
> > +---+---+---+
> >    csv =: (0;(2 3 2$ 0 9 1 1 0 0   1 0 0 0 1 0   );(',';'"');0 0 0 _1 ) &
> > ;:
> >    csv ',,'
> > ++++
> > ||||
> > ++++
> >    csv ',1st field, Is empty'
> > ++---------+---------+
> > ||1st field| Is empty|
> > ++---------+---------+
> >    csv 'Full record, 3, "Hello, world"'
> > +-----------+--+--------------+
> > |Full record| 3|"Hello, world"|
> > +-----------+--+--------------+
> >
> > ====
> >
> > dlab:~/Sources/jsource-master/jsrc$ diff w.c w.c.orig
> > 251c251
> > < #define CHKJ(j)             ASSERT(BETWEENC((j),0,i),EVINDEX);
> > ---
> > > #define CHKJ(j)             ASSERT(BETWEENO((j),0,i),EVINDEX);
> > 272,274d271
> > <   case 8:         j=i+1; break;
> >          \
> > <   case 9:         if(0<=vi){EMIT(T,vj,vi,vr,vc); vi=vr=-1;}
> > EMIT(T,j,i,r,c);        j=i+1; break;  \
> > <   case 10:        if(r!=vr){if(0<=vi)EMIT(T,vj,vi,vr,vc); vj=j; vr=r;
> > vc=c;} vi=i;  j=i+1; break;  \
> > 339c336
> > <  v=sv; DQ(p*q, k=*v++; e=*v++;
> > ASSERT((UI)k<(UI)p&&(UI)e<=(UI)10,EVINDEX););
> > ---
> > >  v=sv; DQ(p*q, k=*v++; e=*v++;
> > ASSERT((UI)k<(UI)p&&(UI)e<=(UI)7,EVINDEX););
> > 346c343
> > <   if(2<=n){ijrd[1]=j=*v++; ASSERT(BETWEENC(j, -1, i),EVINDEX);}
> > ---
> > >   if(2<=n){ijrd[1]=j=*v++; ASSERT(BETWEENO(j, -1, i),EVINDEX);}
> >
> >
> > regards,
> >  Danil
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] sequential machine and empty word output

Reply via email to