Just to gild the lily, one of our NYCJUG members implemented CSV parsing
using J's finite-state machine primitives:
http://www.jsoftware.com/jwiki/NYCJUG/2013-06-11?action=AttachFile&do=view&target=Parsing+CSV+Files+with+a+Finite+State+Machine.pdf.


On Tue, Dec 10, 2013 at 9:35 AM, Joe Bogner <[email protected]> wrote:

> Just to expand on Devon's post, I often use a combination of cut and each
> to split up a string
>
> This will do the same  (with a few more steps behind the scenes)
>
> > ',' cut each LF cut ('1,2,"embedded comma",3.4',CR, LF,'5,6,"no
> comma",7.8',CR, LF) -. CR
>
> as
>
> <;._1&>',',&.><;._2 CR-.~('1,2,"embedded comma",3.4',CR,LF,'5,6,"no
> comma",7.8',CR,LF)
>
> Jon, in case it helps to break it down:
>
> [Split on comma] [each] [Split on LF] [Remove CR] ('1,2,"embedded
> comma",3.4',CR,LF,'5,6,"no comma",7.8',CR,LF)
>
>
> Step 1 - Remove the extra CR
>
> CR-. removes extra carriage returns from the string. They are unnecessary
> since we are splitting on LF
>
> You can accomplish the same by doing this:
>
> ('1,2,"embedded comma",3.4',CR,LF,'5,6,"no comma",7.8',CR,LF) -. CR
>
> As Brian mentioned, the tilde just reverses the arguments.
>
> CR -.~ ('1,2,"embedded comma",3.4',CR,LF,'5,6,"no comma",7.8',CR,LF)
>
> Step 2 - Split on the last character, which is now LF
>
> http://www.jsoftware.com/jwiki/Vocabulary/semidot
>
> <;._2 will split on the last character of the string and drop it
>
> <;._2 ('A',LF,'B',LF,'C',LF)
> ┌─┬─┬─┐
> │A│B│C│
> └─┴─┴─┘
>
> If you check out the definition of 'cut' you will see it has this same
> operation
>
> Step 3 - Split on comma for each item
>
> In Step 2 - we created a boxed array of strings for each LF. We now need to
> operate on each box and split based on comma
>
> The 'each' adverb will do this, which is what Devon has as "&.>"
>
> [Split on comma] is <;._1&>',' ,
>
> You can see it in action here:
>
>    <;._1&>',' , each ('a,b';'c,d')
> ┌─┬─┐
> │a│b│
> ├─┼─┤
> │c│d│
> └─┴─┘
>
> The trick here is to use the cut conjunction to split on commas. The split
> conjunction either uses the first or the last item in the array to split. A
> CSV file won't have the comma at the beginning or the end, so we need to
> first add a comma at the beginning of each boxed array so we can tell cut
> to split on it
>
> That is what &>',' is doing. It's adding a comma at the beginning of each
> item
>
>  ',' ,&.> ('a,b';'c,d')
> ┌────┬────┐
> │,a,b│,c,d│
> └────┴────┘
>
> ',' , each ('a,b';'c,d')
>
> ┌────┬────┐
> │,a,b│,c,d│
> └────┴────┘
>
>
> Now that each boxed string starts with a comma, we can cut on the first
> character and drop it
>
> <;._1 &> ',' , each ('a,b';'c,d')
>
>
> Back to the beginning:
>
>    <;._1 &> ',' , each <;._2 ('1,2,"embedded comma",3.4',CR,LF,'5,6,"no
> comma",7.8',CR,LF)
>
> Split on comma - for each item - in a LF split string
>
> ┌─┬─┬────────────────┬────┐
> │1│2│"embedded comma"│3.4 │
> ├─┼─┼────────────────┼────┤
> │5│6│"no comma"      │7.8 │
> └─┴─┴────────────────┴────┘
>
>
> Hope that helps. I learned more by going through it and wanted to share
>
> On Sat, Dec 7, 2013 at 5:44 PM, Devon McCormick <[email protected]>
> wrote:
>
> > Yes - sorry for typing it in w/o testing it.  Note that the point at
> which
> > the error was picked up is indicated by extra spaces in the returned
> line:
> >    mat=.<.; _1&>',',&.><;._2 CR-.~freads jpath'~temp/test.csv'
> > |domain error
> > |   mat=.<.;    _1&>',',&.><;._2 CR-.~freads jpath'~temp/test.csv'
> >
> > A good way to to debug a line like this is to look at successively longer
> > pieces, starting w/the rightmost one, e.g. (on my system):
> >    jpath '~temp/test.csv'
> > c:/users/devonmcc/j64-701-user/temp/test.csv
> >
> > Do I have this file?
> >    fexist jpath '~temp/test.csv'
> > 0
> >
> > So, I don't have this file - I only used it to mimic the example you
> sent.
> > If I create this file locally so I can continue looking at longer pieces:
> >    ('1,2,"embedded, comma",3.4',CR,LF,'5,6,"no comma",7.8') fwrite
> > 'test.csv'
> > 45
> >    fexist 'test.csv'
> > 1
> >
> > BTW - "fexist" is defined
> >    fexist=: 1:@(1!:4) ::0:@(([: < 8 u: >) ::]&>)@(<^:(L. = 0:))
> > in case you don't have it.
> >
> > Continuing with longer fragments shows us what the data looks like at
> each
> > step:
> >    NB. mat=. <.;_1&>',',&.><;._2 CR-.~freads 'test.csv'
> >    freads 'test.csv'
> > 1,2,"embedded, comma",3.4
> > 5,6,"no comma",7.8
> >
> >    CR-.~freads 'test.csv'
> > 1,2,"embedded, comma",3.4
> > 5,6,"no comma",7.8
> >
> >    <;._2 CR-.~freads 'test.csv'
> > +-------------------------+------------------+
> > |1,2,"embedded, comma",3.4|5,6,"no comma",7.8|
> > +-------------------------+------------------+
> >    ',',&.><;._2 CR-.~freads 'test.csv'
> > +--------------------------+-------------------+
> > |,1,2,"embedded, comma",3.4|,5,6,"no comma",7.8|
> > +--------------------------+-------------------+
> >    <.;_1&>',',&.><;._2 CR-.~freads 'test.csv'
> > |domain error
> > |   <.;    _1&>',',&.><;._2 CR-.~freads'test.csv'
> >
> > Fixing the error:
> >    <;._1&>',',&.><;._2 CR-.~freads 'test.csv'
> > +-+-+----------+-------+---+
> > |1|2|"embedded | comma"|3.4|
> > +-+-+----------+-------+---+
> > |5|6|"no comma"|7.8    |   |
> > +-+-+----------+-------+---+
> >
> >
> >
> >
> >
> >
> > On Sat, Dec 7, 2013 at 10:27 AM, Brian Schott <[email protected]
> > >wrote:
> >
> > > It looks like there is a typo in command with `mat`: .;  should be ;.
>  .
> > > 'mat` is not a verb but a noun, btw.
> > > I think tilde is a dyadic tilde, not monadic and swaps the arguments of
> > -.
> > > in this case.
> > >
> > > On Sat, Dec 7, 2013 at 9:08 AM, Jon Hough <[email protected]> wrote:
> > >
> > > > I'd like to thank everyone for replying.
> > > > I suppose I should think about using J7.
> > > >
> > > > I did try Devon's example:
> > > > "You can read CSV files in J pretty simply without using any
> predefined
> > > >  verbs like this:
> > > >
> > > >  mat=. <.;_1&>',',&.><;._2 CR-.~freads jpath '~temp/test.csv'
> > > >
> > > > and I got the error:
> > > > |domain error
> > > > |   mat=.<.;    _1&>',',&.><;._2 CR-.~freads jpath'~temp/test.csv'
> > > >
> > > > As an aside, I don't really understand what the "mat" function is
> > doing.
> > > > I'm still reading
> > > > "J for C Programmers" so my understanding is a little shaky, but mat
> > > seems
> > > > to be monadic, with the argument as the file to read. I'm not sure if
> > > this
> > > > is an example of a tacit verb, because the argument
> ('~temp/test.csv')
> > > > seems to be hardcoded into the verb.
> > > >
> > > > I assume:
> > > > freads jpath '~temp/test.csv'
> > > > reads the file.(http://www.jsoftware.com/user/script_files.htm)
> > > > I do not really understand this: ~freads (I do not understand this
> use
> > of
> > > > the monadic tilde)
> > > > I am trying to read this verb from right to left, but am not getting
> > very
> > > > far, even using the J dictionary and reference card for support.
> > > > I would really appreciate any help at all in deciphering this.
> > > >
> > > > Thanks and regards,
> > > > Jon
> > > >
> > > >
> > > --
> > > (B=) <-----my sig
> > > Brian Schott
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> >
> >
> >
> > --
> > Devon McCormick, CFA
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>



-- 
Devon McCormick, CFA
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to