I created the fork for a very particular group of users (http://www.bakerlab.org/), so the features I implemented are pretty narrow. They run a serial protein folding code called Rosetta. They use about 2-3 million CPU hours a month running it on the HPC system I manage. Everyone in the lab uses GNU parallel or my fork. They also run a BOINC project, use US government supercomputers, and submit work to the Open Science Grid.
I planned to bring my fork to your attention at some point and get your feedback. I'd hoped to make more of GNU parallel's many features work with my modifications over time, but I'm primarily a systems engineer, so I don't have a lot of time to dedicate. I still have that ambition and I may be able to get more time to dedicate to the effort since you're interested in it. I originally used SQLite on GPFS, then MySQL, then PostgreSQL. SQLite on GPFS worked ok, but only worked up to ~100 tasks per minute. MySQL topped out at about 16,000 or so tasks per minute, but there seem to be some issues with the locking code that cause performance to tank thereafter (down to like 10s of tasks per minute). Postgres maxes out at 40,000 tasks per minute, but performance remains steady with additional clients. I want to thank you for taking the time to write such versatile code. Once I understood the code, it was very easy to make the changes. I think I only had to change 2 lines of your existing code to make my modifications work. I think I've spent about 13 work days on it and most of those were spent testing and tuning databases. I think most of your suggestions would be easy to implement, but I'll think more about them over the coming days. It's an interesting idea to use the database as a post-processing tool. I'll ask around and see what people think. Thanks again, Stephen On Thu, Jul 30, 2015 at 5:26 AM, Ole Tange <ta...@gnu.org> wrote: > I just discovered a fork of GNU Parallel: > https://github.com/stephen-fralich/parallel-sql/ > > It saves into PostgreSQL. > > If GNU Parallel should have an --sql option, it should be more general > than that. It would be obvious to use a DBURL to specify which driver, > username, password, and database to use. > > The most obvious would be having a table containing the columns from > --joblog and the arguments. For some uses it would also make sense to > have the stderr+stdout. > > So I am thinking of: > > --sql mysql://user:pass@host/db/table > > If the table does not exist: Create it. > > But should there be an option to not store stderr+stdout? And if so: > Should that be default? If saving is forced, then you can always just >>/dev/null the output from the job. > > I can definitely see uses of being able to run 1000000 simulations > with 10 different variables and then be able to easily get the output > of the jobs where variable A is odd and > variable B (or similar). > > What should happen if the user uses variable names that are the same > as the header of --joblog (e.g. Seq or stdout)? > > It would also be handy if you could change the status of a job to > 'not-run' (which could be represented with exit status -2), so you > could change this while GNU Parallel was running or add new jobs. > > You could then have workers that did took jobs out of a database table: > > forever parallel --sql mysql://user:pass@host/db/table > > And a master node that submitted jobs to the table: > > parallel --dry-run --sql mysql://user:pass@host/db/table the_job ::: the > args > > --dry-run with --sql should put status to 'not-run'. > > But that would also require some sort of handling of timeout: worker-2 > has started job-seq-4 3 seconds ago, and should not be considered > timed out, thus no other worker should take that job. > > GNU Parallel will not depend on DBD-packages installed, but will only > used these when the user asks for the driver. So in package speak it > should probably 'suggest' the DBD-packages. > > Ideas? Suggestions? Observations? > > > /Ole