Re: SQL save mode for GNU Parallel?

Stephen Fralich Thu, 30 Jul 2015 11:52:27 -0700

I created the fork for a very particular group of users
(http://www.bakerlab.org/), so the features I implemented are pretty
narrow. They run a serial protein folding code called Rosetta. They
use about 2-3 million CPU hours a month running it on the HPC system I
manage. Everyone in the lab uses GNU parallel or my fork. They also
run a BOINC project, use US government supercomputers, and submit work
to the Open Science Grid.

I planned to bring my fork to your attention at some point and get
your feedback. I'd hoped to make more of GNU parallel's many features
work with my modifications over time, but I'm primarily a systems
engineer, so I don't have a lot of time to dedicate. I still have that
ambition and I may be able to get more time to dedicate to the effort
since you're interested in it.

I originally used SQLite on GPFS, then MySQL, then PostgreSQL. SQLite
on GPFS worked ok, but only worked up to ~100 tasks per minute. MySQL
topped out at about 16,000 or so tasks per minute, but there seem to
be some issues with the locking code that cause performance to tank
thereafter (down to like 10s of tasks per minute). Postgres maxes out
at 40,000 tasks per minute, but performance remains steady with
additional clients.

I want to thank you for taking the time to write such versatile code.
Once I understood the code, it was very easy to make the changes. I
think I only had to change 2 lines of your existing code to make my
modifications work. I think I've spent about 13 work days on it and
most of those were spent testing and tuning databases.

I think most of your suggestions would be easy to implement, but I'll
think more about them over the coming days. It's an interesting idea
to use the database as a post-processing tool. I'll ask around and see
what people think. Thanks again,

Stephen

On Thu, Jul 30, 2015 at 5:26 AM, Ole Tange <ta...@gnu.org> wrote:
> I just discovered a fork of GNU Parallel:
> https://github.com/stephen-fralich/parallel-sql/
>
> It saves into PostgreSQL.
>
> If GNU Parallel should have an --sql option, it should be more general
> than that. It would be obvious to use a DBURL to specify which driver,
> username, password, and database to use.
>
> The most obvious would be having a table containing the columns from
> --joblog and the arguments. For some uses it would also make sense to
> have the stderr+stdout.
>
> So I am thinking of:
>
>   --sql mysql://user:pass@host/db/table
>
> If the table does not exist: Create it.
>
> But should there be an option to not store stderr+stdout? And if so:
> Should that be default? If saving is forced, then you can always just
>>/dev/null the output from the job.
>
> I can definitely see uses of being able to run 1000000 simulations
> with 10 different variables and then be able to easily get the output
> of the jobs where variable A is odd and > variable B (or similar).
>
> What should happen if the user uses variable names that are the same
> as the header of --joblog (e.g. Seq or stdout)?
>
> It would also be handy if you could change the status of a job to
> 'not-run' (which could be represented with exit status -2), so you
> could change this while GNU Parallel was running or add new jobs.
>
> You could then have workers that did took jobs out of a database table:
>
>   forever parallel --sql mysql://user:pass@host/db/table
>
> And a master node that submitted jobs to the table:
>
>   parallel --dry-run --sql mysql://user:pass@host/db/table the_job ::: the 
> args
>
> --dry-run with --sql should put status to 'not-run'.
>
> But that would also require some sort of handling of timeout: worker-2
> has started job-seq-4 3 seconds ago, and should not be considered
> timed out, thus no other worker should take that job.
>
> GNU Parallel will not depend on DBD-packages installed, but will only
> used these when the user asks for the driver. So in package speak it
> should probably 'suggest' the DBD-packages.
>
> Ideas? Suggestions? Observations?
>
>
> /Ole

Re: SQL save mode for GNU Parallel?

Reply via email to