On Sat, Jan 4, 2014 at 5:04 PM, David Rosenberg <david.dav...@gmail.com> wrote: > Hi Ole, > > Sorry -- wasn't sure if I should respond to the parallel list or to you > personally.
Always the list. > The function load_parallel_results seems like a nice way to load generic > results data into R. In practice, however, the real power of R comes when > the data fits into a data.frame or [my strong preference] a data.table. > There would be nice ways to read this data in and stack it up into one big > data.frame, I'm just not sure of a quick way to make reproducible sample > data. Here's what I would do: > > INPUT: > > myvar1/1/myvar2/A/stdout > Hello\t1 > Bye\t2 > Wow\t3 > > myvar1/2/myvar2/A/stdout > Interesting\t9 > > myvar1/1/myvar2/B/stdout > NewYork\t3 > > In R, we would get a data.frame like this: > myvar1 myvar2 V1 V2 > 1 1 A Hello 1 > 2 1 A Bye 2 > 3 1 A Wow 3 > 4 2 A Interesting 9 > 5 1 B NewYork 3 Your idea requires the user to make sure the output is \t separated. > There would be nice code to do this, if it's of interest. In my typical > use-case, I would probably precede the load by making sure that all the > stderr's are either empty, or don't have anything interesting in them, and > then just load the stdins. I definitely see the use for this, so let us see if we can come adapt that to a generally useful function. Your solution basically takes in the output from: parallel --tag cmd ::: args ::: args and puts that into a table by splitting on \t, except that you will be reading from pre-computed result-files. I still believe there are situations where the output from the jobs does not fit a simple table template, and thus we somehow need to be able to deal with that, too. Maybe we could have an option that would indicate the splitting char. The default would be none = don't split: > load_parallel_results(file,split="\t") myvar1 myvar2 V1 V2 1 1 A Hello 1 2 1 A Bye 2 3 1 A Wow 3 4 2 A Interesting 9 5 1 B NewYork 3 > load_parallel_results(file) myvar1 myvar2 stdout stderr 1 1 A "Hello\t1\nBye\t2\nWow\t3\n" "" 2 2 A "Interesting\t9\n" "" 3 1 B "NewYork\t3\n" "" I am somewhat concerned that we simply ignore stderr in the first situation. I am also somewhat concerned that the current function loads all stdout/stderr files - even if they are never used. It would be better if that could be done lazily - see http://stackoverflow.com/questions/20923089/r-store-functions-in-a-data-frame . I believe I would prefer returning a data-structure, that you could select the relevant records from based on the arguments. And when you have the records you want, you can ask to have the stdout/stderr read in and possibly expanded as rows. This would be able to scale to much bigger stdout/stderr and many more jobs. Maybe the trivial solution is to simply return a table of the args+the filenames of stdout/stderr, and then have a function that turns that table into the read in files, which you can run either immediately or after you have selected the relevant rows. /Ole