We’ve just started using Airflow as a platform to replace some older internally 
built systems, but one of the things we also looked at was a _newer_ internally 
built system which basically did the below.

In fact it came as a surprise when I started looking around at open source 
systems like Luigi and Airflow, that it wasn’t actually data passed between 
tasks, but primarily passed/failed events.

I do like Airflow and the way it does things, but there’s a bit of overhead 
(for the developer and the architcture) in having in our case to persit the 
data to a DB between every step, where in some of our inner steps of the 
pipeline passing a data set would have been logically simpler.

I like the fileflow idea. It reminds be of pub/sub messaging actually - where 
you just say “I’m publishing X” “I’m subscribing to X” and then the platform 
handles the rest.

Glenn


On 23/01/2017, 17:48, "Laura Lorenz" <[email protected]> wrote:

    We were struggling with the same problem and came up with fileflow
    <http://github.com/industrydive/fileflow> which is what we wrote to deal
    with passing data down a DAG in Airflow. We co-opt Airflow's task
    dependency system to represent the data dependencies and let fileflow
    handle knowing where the data is stored and how to get at it from
    downstream. We've considered rolling something in fileflow that allows you
    to specify the data dependencies more naturally (i.e.
    task.data_dependency(other_task)) as right now the code you have to write
    to manage the data dependencies is split between the co-opted Airflow task
    dependency system (.set_upstream() and the like) and operator args, but we
    haven't even started to think how to interject to the task dependency
    system as a plugin.

    We definitely felt this was missing especially since we were coming from
    the POV not of orchestrating workflows triggering external tools (like an
    external spark or hadoop job), but pipelining arbitrary Python scripts
    together we run in our airflow workers themselves that pass their outputs
    to each other; closer to what you would write a Makefile for but we wanted
    to get all the nice Airflow scheduling, queue management, and workflow
    profiling for free :)

    Laura

    On Mon, Jan 23, 2017 at 11:05 AM, Bolke de Bruin <[email protected]> wrote:

    > Hi All,
    >
    > I came by a write up of some of the downsides in current workflow
    > management systems like Airflow and Luigi (http://bionics.it/posts/
    > workflows-dataflow-not-task-deps) where they argue dependencies should be
    > between inputs and outputs of tasks rather than between tasks
    > (inlets/outlets).
    >
    > They extended Luigi (https://github.com/pharmbio/sciluigi) to do this and
    > even published a scientific paper on it: http://jcheminf.springeropen.
    > com/articles/10.1186/s13321-016-0179-6 .
    >
    > I kind of like the idea, has anyone played with it, any thoughts? I might
    > want to try it in Airflow.
    >
    > Bolke



The information contained in this message may be privileged and confidential 
and protected from disclosure. If the reader of this message is not the 
intended recipient, you are hereby notified that any dissemination, 
distribution or copying of this communication is strictly prohibited. If you 
have received this communication in error, please notify us immediately by 
replying to this message and deleting it from your computer. Thank you. Vela 
Trading Technologies LLC

Reply via email to