I can give some insight from the physics world as far as this goes.

First off, I think the dataflow puck is moving to platforms like Apache Beam. 
The main reason people (in science) don't just use Beam would be because they 
don't control the clusters they execute on. This is almost always true for 
science project using grid resources.

This model is often desirable for scientific grid-based processing systems 
which are inherently decentralized and involve the staging in and out of data, 
since the execution environment is inherently sandboxed. These are often 
integrated with other Grid frameworks (e.g. DIRAC http://diracgrid.org/, 
PegasusWMS) which have their own data catalog integrated into them to aid with 
data movement and staging, or sometimes they'll use another system for that 
management (iRODS, https://irods.org/). In many of those cases, you deal with 
logical file handles as the inputs/outputs, but the file management systems 
also own the data. I've done some work in this space as well, in that I've 
written a file replica management system 
(github.com/slaclab/datacat<http://github.com/slaclab/datacat>), but in this 
model, the system is just a global metadata database about file replicas, and 
doesn't "own" the file/data. Often a requirement of these systems is also the 
need to support data versions and processing provenance.

The nice thing about the dataflow is obviously the declarative data products. 
The messy thing is dealing with the data movement, especially if it's not 
mandatory (e.g. single datacenter and/or shared network disk). I'm partial 
towards the procedural nature of data movement using downstream processes and a 
DAG, but part of that is because I've had to deal with finicky File Transfer 
Nodes at different computing facilities to take full advantage of bandwidth 
available for file transfers. In most of the physics world (e.g. CERN), they 
also use dCache (https://www.dcache.org/) or xrootd (http://xrootd.org/) to aid 
in data movement, though some of the frameworks also support this natively 
(like DIRAC, as mentioned).

I will say that a frustrating thing about much of these tools and frameworks 
I've mentioned is that they are often hard to use them à la carte.

Brian


On Jan 23, 2017, at 8:05 AM, Bolke de Bruin 
<[email protected]<mailto:[email protected]>> wrote:

Hi All,

I came by a write up of some of the downsides in current workflow management 
systems like Airflow and Luigi 
(http://bionics.it/posts/workflows-dataflow-not-task-deps) where they argue 
dependencies should be between inputs and outputs of tasks rather than between 
tasks (inlets/outlets).

They extended Luigi (https://github.com/pharmbio/sciluigi) to do this and even 
published a scientific paper on it: 
http://jcheminf.springeropen.com/articles/10.1186/s13321-016-0179-6 .

I kind of like the idea, has anyone played with it, any thoughts? I might want 
to try it in Airflow.

Bolke

Reply via email to