Re: Use R to manage results from GNU Parallel

Ole Tange Sat, 04 Jan 2014 12:44:54 -0800

On Sat, Jan 4, 2014 at 5:04 PM, David Rosenberg <david.dav...@gmail.com> wrote:
> Hi Ole,
>
> Sorry -- wasn't sure if I should respond to the parallel list or to you
> personally.


Always the list.

> The function load_parallel_results seems like a nice way to load generic
> results data into R.  In practice, however, the real power of R comes when
> the data fits into a data.frame or [my strong preference] a data.table.
> There would be nice ways to read this data in and stack it up into one big
> data.frame, I'm just not sure of a quick way to make reproducible sample
> data.  Here's what I would do:
>
> INPUT:
>
> myvar1/1/myvar2/A/stdout
> Hello\t1
> Bye\t2
> Wow\t3
>
> myvar1/2/myvar2/A/stdout
> Interesting\t9
>
> myvar1/1/myvar2/B/stdout
> NewYork\t3
>
> In R, we would get a data.frame like this:
>   myvar1 myvar2          V1 V2
> 1      1      A       Hello  1
> 2      1      A         Bye  2
> 3      1      A         Wow  3
> 4      2      A Interesting  9
> 5      1      B     NewYork  3

Your idea requires the user to make sure the output is \t separated.

> There would be nice code to do this, if it's of interest.  In my typical
> use-case, I would probably precede the load by making sure that all the
> stderr's are either empty, or don't have anything interesting in them, and
> then just load the stdins.

I definitely see the use for this, so let us see if we can come adapt
that to a generally useful function.

Your solution basically takes in the output from:

    parallel --tag cmd ::: args ::: args

and puts that into a table by splitting on \t, except that you will be
reading from pre-computed result-files.

I still believe there are situations where the output from the jobs
does not fit a simple table template, and thus we somehow need to be
able to deal with that, too.

Maybe we could have an option that would indicate the splitting char.
The default would be none = don't split:

> load_parallel_results(file,split="\t")
    myvar1 myvar2          V1 V2
  1      1      A       Hello  1
  2      1      A         Bye  2
  3      1      A         Wow  3
  4      2      A Interesting  9
  5      1      B     NewYork  3

> load_parallel_results(file)
    myvar1 myvar2          stdout stderr
  1      1      A       "Hello\t1\nBye\t2\nWow\t3\n" ""
  2      2      A "Interesting\t9\n" ""
  3      1      B     "NewYork\t3\n" ""

I am somewhat concerned that we simply ignore stderr in the first situation.

I am also somewhat concerned that the current function loads all
stdout/stderr files - even if they are never used. It would be better
if that could be done lazily - see
http://stackoverflow.com/questions/20923089/r-store-functions-in-a-data-frame
.

I believe I would prefer returning a data-structure, that you could
select the relevant records from based on the arguments. And when you
have the records you want, you can ask to have the stdout/stderr read
in and possibly expanded as rows. This would be able to scale to much
bigger stdout/stderr and many more jobs.

Maybe the trivial solution is to simply return a table of the args+the
filenames of stdout/stderr, and then have a function that turns that
table into the read in files, which you can run either immediately or
after you have selected the relevant rows.


/Ole

Re: Use R to manage results from GNU Parallel

Reply via email to