I can give some insight from the physics world as far as this goes. First off, I think the dataflow puck is moving to platforms like Apache Beam. The main reason people (in science) don't just use Beam would be because they don't control the clusters they execute on. This is almost always true for science project using grid resources.
This model is often desirable for scientific grid-based processing systems which are inherently decentralized and involve the staging in and out of data, since the execution environment is inherently sandboxed. These are often integrated with other Grid frameworks (e.g. DIRAC http://diracgrid.org/, PegasusWMS) which have their own data catalog integrated into them to aid with data movement and staging, or sometimes they'll use another system for that management (iRODS, https://irods.org/). In many of those cases, you deal with logical file handles as the inputs/outputs, but the file management systems also own the data. I've done some work in this space as well, in that I've written a file replica management system (github.com/slaclab/datacat<http://github.com/slaclab/datacat>), but in this model, the system is just a global metadata database about file replicas, and doesn't "own" the file/data. Often a requirement of these systems is also the need to support data versions and processing provenance. The nice thing about the dataflow is obviously the declarative data products. The messy thing is dealing with the data movement, especially if it's not mandatory (e.g. single datacenter and/or shared network disk). I'm partial towards the procedural nature of data movement using downstream processes and a DAG, but part of that is because I've had to deal with finicky File Transfer Nodes at different computing facilities to take full advantage of bandwidth available for file transfers. In most of the physics world (e.g. CERN), they also use dCache (https://www.dcache.org/) or xrootd (http://xrootd.org/) to aid in data movement, though some of the frameworks also support this natively (like DIRAC, as mentioned). I will say that a frustrating thing about much of these tools and frameworks I've mentioned is that they are often hard to use them à la carte. Brian On Jan 23, 2017, at 8:05 AM, Bolke de Bruin <[email protected]<mailto:[email protected]>> wrote: Hi All, I came by a write up of some of the downsides in current workflow management systems like Airflow and Luigi (http://bionics.it/posts/workflows-dataflow-not-task-deps) where they argue dependencies should be between inputs and outputs of tasks rather than between tasks (inlets/outlets). They extended Luigi (https://github.com/pharmbio/sciluigi) to do this and even published a scientific paper on it: http://jcheminf.springeropen.com/articles/10.1186/s13321-016-0179-6 . I kind of like the idea, has anyone played with it, any thoughts? I might want to try it in Airflow. Bolke
