This is a very interesting discussion. Laura, very cool extension in
fileflow! I haven't delved too much into the actual code yet but the docs
give a great overview.

Some thoughts:

For "small" data, Airflow already supports a dataflow setup through the
XCom mechanism. It is considerably more cumbersome to set up than "vanilla"
Airflow, but with a little syntactical sugar we could easily allow users to
solve what I believe are the two largest blocks to coding in this style:
  1. Easily specify data dependencies from upstream tasks (in other words,
automating xcom_pull so data objects are immediately available as function
parameters or in the Operator context/template. I believe xcom_push is
already automated in a simple fashion if anything is returned from an
Operator).
  2. Tie task failure to the contents of that task's XComs. (The task might
appear to succeed (in a traditional Airflow sense) but would nonetheless
fail if (for example) it pushed an empty XCom).

For "large" data, the XCom system breaks down and something like Laura's
fileflow is needed, simply because the Airflow DB wasn't designed as a blob
store.

If we would like to make this a first-class mode for Airflow, here are some
starting points:
  1. The two modifications I described above, essentially lowering the
barrier to using XComs to convey data between Operators
  2. Borrowing from fileflow, providing a configurable backend for XComs --
GCS or S3 (or another database entirely). For example, a GCS-backed XCom
would use the builtin XCom mechanism to pass a gcs:// URI between tasks,
and automatically serialize or unpack the blob at that location on demand.
  3. One difference to the fileflow setup: we want to avoid using the
filesystem since there is no guarantee tasks are being run on the same
machine

--J


On Mon, Jan 23, 2017 at 2:05 PM Van Klaveren, Brian N. <
[email protected]> wrote:

> I can give some insight from the physics world as far as this goes.
>
> First off, I think the dataflow puck is moving to platforms like Apache
> Beam. The main reason people (in science) don't just use Beam would be
> because they don't control the clusters they execute on. This is almost
> always true for science project using grid resources.
>
> This model is often desirable for scientific grid-based processing systems
> which are inherently decentralized and involve the staging in and out of
> data, since the execution environment is inherently sandboxed. These are
> often integrated with other Grid frameworks (e.g. DIRAC
> http://diracgrid.org/, PegasusWMS) which have their own data catalog
> integrated into them to aid with data movement and staging, or sometimes
> they'll use another system for that management (iRODS, https://irods.org/).
> In many of those cases, you deal with logical file handles as the
> inputs/outputs, but the file management systems also own the data. I've
> done some work in this space as well, in that I've written a file replica
> management system (github.com/slaclab/datacat<
> http://github.com/slaclab/datacat>), but in this model, the system is
> just a global metadata database about file replicas, and doesn't "own" the
> file/data. Often a requirement of these systems is also the need to support
> data versions and processing provenance.
>
> The nice thing about the dataflow is obviously the declarative data
> products. The messy thing is dealing with the data movement, especially if
> it's not mandatory (e.g. single datacenter and/or shared network disk). I'm
> partial towards the procedural nature of data movement using downstream
> processes and a DAG, but part of that is because I've had to deal with
> finicky File Transfer Nodes at different computing facilities to take full
> advantage of bandwidth available for file transfers. In most of the physics
> world (e.g. CERN), they also use dCache (https://www.dcache.org/) or
> xrootd (http://xrootd.org/) to aid in data movement, though some of the
> frameworks also support this natively (like DIRAC, as mentioned).
>
> I will say that a frustrating thing about much of these tools and
> frameworks I've mentioned is that they are often hard to use them à la
> carte.
>
> Brian
>
>
> On Jan 23, 2017, at 8:05 AM, Bolke de Bruin <[email protected]<mailto:
> [email protected]>> wrote:
>
> Hi All,
>
> I came by a write up of some of the downsides in current workflow
> management systems like Airflow and Luigi (
> http://bionics.it/posts/workflows-dataflow-not-task-deps) where they
> argue dependencies should be between inputs and outputs of tasks rather
> than between tasks (inlets/outlets).
>
> They extended Luigi (https://github.com/pharmbio/sciluigi) to do this and
> even published a scientific paper on it:
> http://jcheminf.springeropen.com/articles/10.1186/s13321-016-0179-6 .
>
> I kind of like the idea, has anyone played with it, any thoughts? I might
> want to try it in Airflow.
>
> Bolke
>
>

Reply via email to