On Wednesday, 27 December 2017 17:14:51 UTC, Roc wrote: > > The primary focus of these workflow tools usually seems to be the > dependency / pipeline aspect, and they appear not quite robust (at least > the ones I tried) with respect to interacting with guarded HPC systems > (e.g., requiring SSH multi-hop and not allowing long-running processes on > login nodes). They are probably fine if we can run them directly on the > login nodes, or can keep a daemon running on the login nodes. However, for > the use case of setting up jobs on personal workstation, and submitting > them to multiple HPC systems opportunistically, they all might require the > development of some new plugins. I felt that Ansible may be a suitable > framework for such plugins because it's quite reliable for various SSH > scenarios and does not use a remote agent. > > > This is getting slightly off-topic with ansible, but if you want to submit to multiple clusters at different sites then you're going to run into scheduling problems and different scheduler configurations, data locality issues if the data sets are big and probably a number of other issues. These problems are reasonably well solved by the bio-informatics and HEP spaces, I've found that most pipeline tools are best with homogeneous systems and the behavior of failing fast and early is actually a good thing as you don't want tasks in your pipeline to continue if you have bad data.
The biggest challenge that you'll probably come across is how do you track and check for job id's and how do you decide if a job/task has run successfully, define where the data is, offer data validation steps and how do you do this in a disconnected way since the user on the laptop might disconnect from the network. If you're doing things opportunistically you'll want to stage data in before jobs are setup, then you run into a garbage collection problem on jobs that failed/didn't run. You would probably need to come up with a scheme for do this before implementing a plugin(s) for ansible. This is all assuming that you have addressed the issue of the executables are all of the same versions on the different clusters as versions matter in some simulations. JFTR, the luigi system has a ssh plugin that lets you scp things in and out of a cluster. -- You received this message because you are subscribed to the Google Groups "Ansible Project" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/1d37129e-03e6-4a2a-a09b-e56f1466f1ab%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
