You may also want to look at this: http://www.jsoftware.com/jwiki/NYCJUG/2012-12-11#Example_of_Free-Form_Text_Wrangling.
On Tue, Dec 10, 2013 at 11:34 AM, Devon McCormick <[email protected]>wrote: > Just to gild the lily, one of our NYCJUG members implemented CSV parsing > using J's finite-state machine primitives: > http://www.jsoftware.com/jwiki/NYCJUG/2013-06-11?action=AttachFile&do=view&target=Parsing+CSV+Files+with+a+Finite+State+Machine.pdf. > > > On Tue, Dec 10, 2013 at 9:35 AM, Joe Bogner <[email protected]> wrote: > >> Just to expand on Devon's post, I often use a combination of cut and each >> to split up a string >> >> This will do the same (with a few more steps behind the scenes) >> >> > ',' cut each LF cut ('1,2,"embedded comma",3.4',CR, LF,'5,6,"no >> comma",7.8',CR, LF) -. CR >> >> as >> >> <;._1&>',',&.><;._2 CR-.~('1,2,"embedded comma",3.4',CR,LF,'5,6,"no >> comma",7.8',CR,LF) >> >> Jon, in case it helps to break it down: >> >> [Split on comma] [each] [Split on LF] [Remove CR] ('1,2,"embedded >> comma",3.4',CR,LF,'5,6,"no comma",7.8',CR,LF) >> >> >> Step 1 - Remove the extra CR >> >> CR-. removes extra carriage returns from the string. They are unnecessary >> since we are splitting on LF >> >> You can accomplish the same by doing this: >> >> ('1,2,"embedded comma",3.4',CR,LF,'5,6,"no comma",7.8',CR,LF) -. CR >> >> As Brian mentioned, the tilde just reverses the arguments. >> >> CR -.~ ('1,2,"embedded comma",3.4',CR,LF,'5,6,"no comma",7.8',CR,LF) >> >> Step 2 - Split on the last character, which is now LF >> >> http://www.jsoftware.com/jwiki/Vocabulary/semidot >> >> <;._2 will split on the last character of the string and drop it >> >> <;._2 ('A',LF,'B',LF,'C',LF) >> ┌─┬─┬─┐ >> │A│B│C│ >> └─┴─┴─┘ >> >> If you check out the definition of 'cut' you will see it has this same >> operation >> >> Step 3 - Split on comma for each item >> >> In Step 2 - we created a boxed array of strings for each LF. We now need >> to >> operate on each box and split based on comma >> >> The 'each' adverb will do this, which is what Devon has as "&.>" >> >> [Split on comma] is <;._1&>',' , >> >> You can see it in action here: >> >> <;._1&>',' , each ('a,b';'c,d') >> ┌─┬─┐ >> │a│b│ >> ├─┼─┤ >> │c│d│ >> └─┴─┘ >> >> The trick here is to use the cut conjunction to split on commas. The split >> conjunction either uses the first or the last item in the array to split. >> A >> CSV file won't have the comma at the beginning or the end, so we need to >> first add a comma at the beginning of each boxed array so we can tell cut >> to split on it >> >> That is what &>',' is doing. It's adding a comma at the beginning of each >> item >> >> ',' ,&.> ('a,b';'c,d') >> ┌────┬────┐ >> │,a,b│,c,d│ >> └────┴────┘ >> >> ',' , each ('a,b';'c,d') >> >> ┌────┬────┐ >> │,a,b│,c,d│ >> └────┴────┘ >> >> >> Now that each boxed string starts with a comma, we can cut on the first >> character and drop it >> >> <;._1 &> ',' , each ('a,b';'c,d') >> >> >> Back to the beginning: >> >> <;._1 &> ',' , each <;._2 ('1,2,"embedded comma",3.4',CR,LF,'5,6,"no >> comma",7.8',CR,LF) >> >> Split on comma - for each item - in a LF split string >> >> ┌─┬─┬────────────────┬────┐ >> │1│2│"embedded comma"│3.4 │ >> ├─┼─┼────────────────┼────┤ >> │5│6│"no comma" │7.8 │ >> └─┴─┴────────────────┴────┘ >> >> >> Hope that helps. I learned more by going through it and wanted to share >> >> On Sat, Dec 7, 2013 at 5:44 PM, Devon McCormick <[email protected]> >> wrote: >> >> > Yes - sorry for typing it in w/o testing it. Note that the point at >> which >> > the error was picked up is indicated by extra spaces in the returned >> line: >> > mat=.<.; _1&>',',&.><;._2 CR-.~freads jpath'~temp/test.csv' >> > |domain error >> > | mat=.<.; _1&>',',&.><;._2 CR-.~freads jpath'~temp/test.csv' >> > >> > A good way to to debug a line like this is to look at successively >> longer >> > pieces, starting w/the rightmost one, e.g. (on my system): >> > jpath '~temp/test.csv' >> > c:/users/devonmcc/j64-701-user/temp/test.csv >> > >> > Do I have this file? >> > fexist jpath '~temp/test.csv' >> > 0 >> > >> > So, I don't have this file - I only used it to mimic the example you >> sent. >> > If I create this file locally so I can continue looking at longer >> pieces: >> > ('1,2,"embedded, comma",3.4',CR,LF,'5,6,"no comma",7.8') fwrite >> > 'test.csv' >> > 45 >> > fexist 'test.csv' >> > 1 >> > >> > BTW - "fexist" is defined >> > fexist=: 1:@(1!:4) ::0:@(([: < 8 u: >) ::]&>)@(<^:(L. = 0:)) >> > in case you don't have it. >> > >> > Continuing with longer fragments shows us what the data looks like at >> each >> > step: >> > NB. mat=. <.;_1&>',',&.><;._2 CR-.~freads 'test.csv' >> > freads 'test.csv' >> > 1,2,"embedded, comma",3.4 >> > 5,6,"no comma",7.8 >> > >> > CR-.~freads 'test.csv' >> > 1,2,"embedded, comma",3.4 >> > 5,6,"no comma",7.8 >> > >> > <;._2 CR-.~freads 'test.csv' >> > +-------------------------+------------------+ >> > |1,2,"embedded, comma",3.4|5,6,"no comma",7.8| >> > +-------------------------+------------------+ >> > ',',&.><;._2 CR-.~freads 'test.csv' >> > +--------------------------+-------------------+ >> > |,1,2,"embedded, comma",3.4|,5,6,"no comma",7.8| >> > +--------------------------+-------------------+ >> > <.;_1&>',',&.><;._2 CR-.~freads 'test.csv' >> > |domain error >> > | <.; _1&>',',&.><;._2 CR-.~freads'test.csv' >> > >> > Fixing the error: >> > <;._1&>',',&.><;._2 CR-.~freads 'test.csv' >> > +-+-+----------+-------+---+ >> > |1|2|"embedded | comma"|3.4| >> > +-+-+----------+-------+---+ >> > |5|6|"no comma"|7.8 | | >> > +-+-+----------+-------+---+ >> > >> > >> > >> > >> > >> > >> > On Sat, Dec 7, 2013 at 10:27 AM, Brian Schott <[email protected] >> > >wrote: >> > >> > > It looks like there is a typo in command with `mat`: .; should be ;. >> . >> > > 'mat` is not a verb but a noun, btw. >> > > I think tilde is a dyadic tilde, not monadic and swaps the arguments >> of >> > -. >> > > in this case. >> > > >> > > On Sat, Dec 7, 2013 at 9:08 AM, Jon Hough <[email protected]> >> wrote: >> > > >> > > > I'd like to thank everyone for replying. >> > > > I suppose I should think about using J7. >> > > > >> > > > I did try Devon's example: >> > > > "You can read CSV files in J pretty simply without using any >> predefined >> > > > verbs like this: >> > > > >> > > > mat=. <.;_1&>',',&.><;._2 CR-.~freads jpath '~temp/test.csv' >> > > > >> > > > and I got the error: >> > > > |domain error >> > > > | mat=.<.; _1&>',',&.><;._2 CR-.~freads jpath'~temp/test.csv' >> > > > >> > > > As an aside, I don't really understand what the "mat" function is >> > doing. >> > > > I'm still reading >> > > > "J for C Programmers" so my understanding is a little shaky, but mat >> > > seems >> > > > to be monadic, with the argument as the file to read. I'm not sure >> if >> > > this >> > > > is an example of a tacit verb, because the argument >> ('~temp/test.csv') >> > > > seems to be hardcoded into the verb. >> > > > >> > > > I assume: >> > > > freads jpath '~temp/test.csv' >> > > > reads the file.(http://www.jsoftware.com/user/script_files.htm) >> > > > I do not really understand this: ~freads (I do not understand this >> use >> > of >> > > > the monadic tilde) >> > > > I am trying to read this verb from right to left, but am not getting >> > very >> > > > far, even using the J dictionary and reference card for support. >> > > > I would really appreciate any help at all in deciphering this. >> > > > >> > > > Thanks and regards, >> > > > Jon >> > > > >> > > > >> > > -- >> > > (B=) <-----my sig >> > > Brian Schott >> > > ---------------------------------------------------------------------- >> > > For information about J forums see >> http://www.jsoftware.com/forums.htm >> > > >> > >> > >> > >> > -- >> > Devon McCormick, CFA >> > ---------------------------------------------------------------------- >> > For information about J forums see http://www.jsoftware.com/forums.htm >> > >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm >> > > > > -- > Devon McCormick, CFA > > -- Devon McCormick, CFA ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
