On Wednesday, 27 December 2017 17:14:51 UTC, Roc wrote:
>
> The primary focus of these workflow tools usually seems to be the 
> dependency / pipeline aspect, and they appear not quite robust (at least 
> the ones I tried) with respect to interacting with guarded HPC systems 
> (e.g., requiring SSH multi-hop and not allowing long-running processes on 
> login nodes). They are probably fine if we can run them directly on the 
> login nodes, or can keep a daemon running on the login nodes. However, for 
> the use case of setting up jobs on personal workstation, and submitting 
> them to multiple HPC systems opportunistically, they all might require the 
> development of some new plugins. I felt that Ansible may be a suitable 
> framework for such plugins because it's quite reliable for various SSH 
> scenarios and does not use a remote agent.
>
>
>
This is getting slightly off-topic with ansible, but if you want to submit 
to multiple clusters at different sites then you're going to run into 
scheduling problems and different scheduler configurations, data locality 
issues if the data sets are big and probably a number of other issues. 
These problems are reasonably well solved by the bio-informatics and HEP 
spaces, I've found that most pipeline tools are best with homogeneous 
systems and the behavior of failing fast and early is actually a good thing 
as you don't want tasks in your pipeline to continue if you have bad data. 

The biggest challenge that you'll probably come across is how do you track 
and check for job id's and how do you decide if a job/task has run 
successfully, define where the data is, offer data validation steps and how 
do you do this in a disconnected way since the user on the laptop might 
disconnect from the network.  If you're doing things opportunistically 
you'll want to stage data in before jobs are setup, then you run into a 
garbage collection problem on jobs that failed/didn't run. You would 
probably need to come up with a scheme for do this before implementing a 
plugin(s) for ansible. This is all assuming that you have addressed the 
issue of the executables are all of the same versions on the different 
clusters as versions matter in some simulations.

JFTR, the luigi system has a ssh plugin that lets you scp things in and out 
of a cluster.

-- 
You received this message because you are subscribed to the Google Groups 
"Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/ansible-project/1d37129e-03e6-4a2a-a09b-e56f1466f1ab%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to