O that’s interesting! I think the way Airflow uses tasks doesn’t entirely fit 
with the Flow model, e.g. in Luigi one is normal to derive from a Task. In 
Tasks you can just add the inlets (data dependency) you require for your 
particular dag. In Airflow we use templating more extensively and more generic 
Operators.

However, thinking about it, we could extend xcom with your fileflow and allow 
operators to pull in the data automatically into the templating engine, instead 
of requiring a xcom_pull. Co-opting the dependency engine sounds very 
practical. The difference between control-flow and data-flow for the dependency 
engine is non-existent I think.

Bolke

> On 23 Jan 2017, at 18:48, Laura Lorenz <[email protected]> wrote:
> 
> We were struggling with the same problem and came up with fileflow
> <http://github.com/industrydive/fileflow> which is what we wrote to deal
> with passing data down a DAG in Airflow. We co-opt Airflow's task
> dependency system to represent the data dependencies and let fileflow
> handle knowing where the data is stored and how to get at it from
> downstream. We've considered rolling something in fileflow that allows you
> to specify the data dependencies more naturally (i.e.
> task.data_dependency(other_task)) as right now the code you have to write
> to manage the data dependencies is split between the co-opted Airflow task
> dependency system (.set_upstream() and the like) and operator args, but we
> haven't even started to think how to interject to the task dependency
> system as a plugin.
> 
> We definitely felt this was missing especially since we were coming from
> the POV not of orchestrating workflows triggering external tools (like an
> external spark or hadoop job), but pipelining arbitrary Python scripts
> together we run in our airflow workers themselves that pass their outputs
> to each other; closer to what you would write a Makefile for but we wanted
> to get all the nice Airflow scheduling, queue management, and workflow
> profiling for free :)
> 
> Laura
> 
> On Mon, Jan 23, 2017 at 11:05 AM, Bolke de Bruin <[email protected]> wrote:
> 
>> Hi All,
>> 
>> I came by a write up of some of the downsides in current workflow
>> management systems like Airflow and Luigi (http://bionics.it/posts/
>> workflows-dataflow-not-task-deps) where they argue dependencies should be
>> between inputs and outputs of tasks rather than between tasks
>> (inlets/outlets).
>> 
>> They extended Luigi (https://github.com/pharmbio/sciluigi) to do this and
>> even published a scientific paper on it: http://jcheminf.springeropen.
>> com/articles/10.1186/s13321-016-0179-6 .
>> 
>> I kind of like the idea, has anyone played with it, any thoughts? I might
>> want to try it in Airflow.
>> 
>> Bolke

Reply via email to