[Discuss] RFC: Dumb Python Pipelining Tool

Gabriel A. Devenyi Fri, 26 Sep 2014 17:57:53 -0700

Hi Software-Carpentry Discuss,

At the COmputational BRain Anatomy Lab at the Douglas Institute in
Montreal, the Kimel Family Translational Imaging-Genetics Lab at CAMH in
Toronto, and in neuroscience in general, we have a great need to stitch
many small command line data processing tools (minc-toolkit etc) to run
against very large datasets. At some points in the pipeline, these tools
could be run against all the input subjects in parallel, but at other
points we need the previous steps to be completed so we can aggregate
across subjects.


In searching for a tool to manage this workflow, we have found a few
(nipype, ruffus, taverna, pydpiper, joblib). But we found that these tools
either required programming in the file input-output management or writing
of new classes for the pipeline tool. This doesn't fit well with our user
base of non-programmers who have a general understanding of scripting. We
want to enable them to as easily as possible transform a serial bash script
into something that can run in parallel on a supercomputer.

Having found no tool, we have considering developing our own tool we have
dubbed "Pipeliner - The stupid pipeline maker" which will live at
https://github.com/CobraLab/pipeliner

We have posted a "functional" prototype of what Pipeliner would do, see
https://github.com/CobraLab/pipeliner/issues/1

Below is an example of serial bash code we'd like to be able to parallelize:
```sh
# correct all images before we begin
for image in input/atlases/* input/subjects/*; do
   correct $image output/nuc/$(basename $image)
done

# register all atlases to each subject
for atlas in input/atlases/*; do
    for subject in input/subjects/*; do
        register $atlas $subject output/registrations/$(basename
$atlas)/$(basename $subject)/reg.xfm
    done
done

# creage an average transformation for each subject
for subject in input/subjects/*; do
   subjectname=$(basename $subject)
   xfmaverage output/registrations/*/$subjectname/reg.xfm
output/averagexfm/$subjectname.xfm
done
```

This tool would generate an internal representation of a set of commands
and then use a number of output plugins to generate bash scripts,
GridEngine jobs, slurm jobs, or other outputs.

Does anyone have experience creating workflows like this, or know of an
existing tool we could use instead of rolling our own? We welcome comments,
suggestions, projects that already did this and collaborators to help build
this tool. Thanks everyone for your help!


-- 
Gabriel A. Devenyi B.Eng. Ph.D.
e: [email protected]

_______________________________________________
Discuss mailing list
[email protected]
http://lists.software-carpentry.org/mailman/listinfo/discuss_lists.software-carpentry.org

[Discuss] RFC: Dumb Python Pipelining Tool

Reply via email to