I just tried the Finite State Machine parser from the url below. It doesn't correctly parse multi line csv files. Can someone suggest changes so it will?
-----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Devon McCormick Sent: Tuesday, December 10, 2013 10:34 To: J-programming forum Subject: Re: [Jprogramming] Beginner Understanding CSV file reading/writing Just to gild the lily, one of our NYCJUG members implemented CSV parsing using J's finite-state machine primitives: http://www.jsoftware.com/jwiki/NYCJUG/2013-06-11?action=AttachFile&do=view&target=Parsing+CSV+Files+with+a+Finite+State+Machine.pdf. On Tue, Dec 10, 2013 at 9:35 AM, Joe Bogner <[email protected]> wrote: > Just to expand on Devon's post, I often use a combination of cut and > each to split up a string > > This will do the same (with a few more steps behind the scenes) > > > ',' cut each LF cut ('1,2,"embedded comma",3.4',CR, LF,'5,6,"no > comma",7.8',CR, LF) -. CR > > as > > <;._1&>',',&.><;._2 CR-.~('1,2,"embedded comma",3.4',CR,LF,'5,6,"no > comma",7.8',CR,LF) > > Jon, in case it helps to break it down: > > [Split on comma] [each] [Split on LF] [Remove CR] ('1,2,"embedded > comma",3.4',CR,LF,'5,6,"no comma",7.8',CR,LF) > > > Step 1 - Remove the extra CR > > CR-. removes extra carriage returns from the string. They are > unnecessary since we are splitting on LF > > You can accomplish the same by doing this: > > ('1,2,"embedded comma",3.4',CR,LF,'5,6,"no comma",7.8',CR,LF) -. CR > > As Brian mentioned, the tilde just reverses the arguments. > > CR -.~ ('1,2,"embedded comma",3.4',CR,LF,'5,6,"no comma",7.8',CR,LF) > > Step 2 - Split on the last character, which is now LF > > http://www.jsoftware.com/jwiki/Vocabulary/semidot > > <;._2 will split on the last character of the string and drop it > > <;._2 ('A',LF,'B',LF,'C',LF) > ┌─┬─┬─┐ > │A│B│C│ > └─┴─┴─┘ > > If you check out the definition of 'cut' you will see it has this same > operation > > Step 3 - Split on comma for each item > > In Step 2 - we created a boxed array of strings for each LF. We now > need to operate on each box and split based on comma > > The 'each' adverb will do this, which is what Devon has as "&.>" > > [Split on comma] is <;._1&>',' , > > You can see it in action here: > > <;._1&>',' , each ('a,b';'c,d') > ┌─┬─┐ > │a│b│ > ├─┼─┤ > │c│d│ > └─┴─┘ > > The trick here is to use the cut conjunction to split on commas. The > split conjunction either uses the first or the last item in the array > to split. A CSV file won't have the comma at the beginning or the end, > so we need to first add a comma at the beginning of each boxed array > so we can tell cut to split on it > > That is what &>',' is doing. It's adding a comma at the beginning of > each item > > ',' ,&.> ('a,b';'c,d') > ┌────┬────┐ > │,a,b│,c,d│ > └────┴────┘ > > ',' , each ('a,b';'c,d') > > ┌────┬────┐ > │,a,b│,c,d│ > └────┴────┘ > > > Now that each boxed string starts with a comma, we can cut on the > first character and drop it > > <;._1 &> ',' , each ('a,b';'c,d') > > > Back to the beginning: > > <;._1 &> ',' , each <;._2 ('1,2,"embedded > comma",3.4',CR,LF,'5,6,"no > comma",7.8',CR,LF) > > Split on comma - for each item - in a LF split string > > ┌─┬─┬────────────────┬────┐ > │1│2│"embedded comma"│3.4 │ > ├─┼─┼────────────────┼────┤ > │5│6│"no comma" │7.8 │ > └─┴─┴────────────────┴────┘ > > > Hope that helps. I learned more by going through it and wanted to > share > > On Sat, Dec 7, 2013 at 5:44 PM, Devon McCormick <[email protected]> > wrote: > > > Yes - sorry for typing it in w/o testing it. Note that the point at > which > > the error was picked up is indicated by extra spaces in the returned > line: > > mat=.<.; _1&>',',&.><;._2 CR-.~freads jpath'~temp/test.csv' > > |domain error > > | mat=.<.; _1&>',',&.><;._2 CR-.~freads jpath'~temp/test.csv' > > > > A good way to to debug a line like this is to look at successively > > longer pieces, starting w/the rightmost one, e.g. (on my system): > > jpath '~temp/test.csv' > > c:/users/devonmcc/j64-701-user/temp/test.csv > > > > Do I have this file? > > fexist jpath '~temp/test.csv' > > 0 > > > > So, I don't have this file - I only used it to mimic the example you > sent. > > If I create this file locally so I can continue looking at longer pieces: > > ('1,2,"embedded, comma",3.4',CR,LF,'5,6,"no comma",7.8') fwrite > > 'test.csv' > > 45 > > fexist 'test.csv' > > 1 > > > > BTW - "fexist" is defined > > fexist=: 1:@(1!:4) ::0:@(([: < 8 u: >) ::]&>)@(<^:(L. = 0:)) in > > case you don't have it. > > > > Continuing with longer fragments shows us what the data looks like > > at > each > > step: > > NB. mat=. <.;_1&>',',&.><;._2 CR-.~freads 'test.csv' > > freads 'test.csv' > > 1,2,"embedded, comma",3.4 > > 5,6,"no comma",7.8 > > > > CR-.~freads 'test.csv' > > 1,2,"embedded, comma",3.4 > > 5,6,"no comma",7.8 > > > > <;._2 CR-.~freads 'test.csv' > > +-------------------------+------------------+ > > |1,2,"embedded, comma",3.4|5,6,"no comma",7.8| > > +-------------------------+------------------+ > > ',',&.><;._2 CR-.~freads 'test.csv' > > +--------------------------+-------------------+ > > |,1,2,"embedded, comma",3.4|,5,6,"no comma",7.8| > > +--------------------------+-------------------+ > > <.;_1&>',',&.><;._2 CR-.~freads 'test.csv' > > |domain error > > | <.; _1&>',',&.><;._2 CR-.~freads'test.csv' > > > > Fixing the error: > > <;._1&>',',&.><;._2 CR-.~freads 'test.csv' > > +-+-+----------+-------+---+ > > |1|2|"embedded | comma"|3.4| > > +-+-+----------+-------+---+ > > |5|6|"no comma"|7.8 | | > > +-+-+----------+-------+---+ > > > > > > > > > > > > > > On Sat, Dec 7, 2013 at 10:27 AM, Brian Schott > > <[email protected] > > >wrote: > > > > > It looks like there is a typo in command with `mat`: .; should be ;. > . > > > 'mat` is not a verb but a noun, btw. > > > I think tilde is a dyadic tilde, not monadic and swaps the > > > arguments of > > -. > > > in this case. > > > > > > On Sat, Dec 7, 2013 at 9:08 AM, Jon Hough <[email protected]> wrote: > > > > > > > I'd like to thank everyone for replying. > > > > I suppose I should think about using J7. > > > > > > > > I did try Devon's example: > > > > "You can read CSV files in J pretty simply without using any > predefined > > > > verbs like this: > > > > > > > > mat=. <.;_1&>',',&.><;._2 CR-.~freads jpath '~temp/test.csv' > > > > > > > > and I got the error: > > > > |domain error > > > > | mat=.<.; _1&>',',&.><;._2 CR-.~freads jpath'~temp/test.csv' > > > > > > > > As an aside, I don't really understand what the "mat" function > > > > is > > doing. > > > > I'm still reading > > > > "J for C Programmers" so my understanding is a little shaky, but > > > > mat > > > seems > > > > to be monadic, with the argument as the file to read. I'm not > > > > sure if > > > this > > > > is an example of a tacit verb, because the argument > ('~temp/test.csv') > > > > seems to be hardcoded into the verb. > > > > > > > > I assume: > > > > freads jpath '~temp/test.csv' > > > > reads the file.(http://www.jsoftware.com/user/script_files.htm) > > > > I do not really understand this: ~freads (I do not understand > > > > this > use > > of > > > > the monadic tilde) > > > > I am trying to read this verb from right to left, but am not > > > > getting > > very > > > > far, even using the J dictionary and reference card for support. > > > > I would really appreciate any help at all in deciphering this. > > > > > > > > Thanks and regards, > > > > Jon > > > > > > > > > > > -- > > > (B=) <-----my sig > > > Brian Schott > > > ------------------------------------------------------------------ > > > ---- For information about J forums see > > > http://www.jsoftware.com/forums.htm > > > > > > > > > > > -- > > Devon McCormick, CFA > > -------------------------------------------------------------------- > > -- For information about J forums see > > http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > -- Devon McCormick, CFA ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
