We’ve just started using Airflow as a platform to replace some older internally built systems, but one of the things we also looked at was a _newer_ internally built system which basically did the below.
In fact it came as a surprise when I started looking around at open source systems like Luigi and Airflow, that it wasn’t actually data passed between tasks, but primarily passed/failed events. I do like Airflow and the way it does things, but there’s a bit of overhead (for the developer and the architcture) in having in our case to persit the data to a DB between every step, where in some of our inner steps of the pipeline passing a data set would have been logically simpler. I like the fileflow idea. It reminds be of pub/sub messaging actually - where you just say “I’m publishing X” “I’m subscribing to X” and then the platform handles the rest. Glenn On 23/01/2017, 17:48, "Laura Lorenz" <[email protected]> wrote: We were struggling with the same problem and came up with fileflow <http://github.com/industrydive/fileflow> which is what we wrote to deal with passing data down a DAG in Airflow. We co-opt Airflow's task dependency system to represent the data dependencies and let fileflow handle knowing where the data is stored and how to get at it from downstream. We've considered rolling something in fileflow that allows you to specify the data dependencies more naturally (i.e. task.data_dependency(other_task)) as right now the code you have to write to manage the data dependencies is split between the co-opted Airflow task dependency system (.set_upstream() and the like) and operator args, but we haven't even started to think how to interject to the task dependency system as a plugin. We definitely felt this was missing especially since we were coming from the POV not of orchestrating workflows triggering external tools (like an external spark or hadoop job), but pipelining arbitrary Python scripts together we run in our airflow workers themselves that pass their outputs to each other; closer to what you would write a Makefile for but we wanted to get all the nice Airflow scheduling, queue management, and workflow profiling for free :) Laura On Mon, Jan 23, 2017 at 11:05 AM, Bolke de Bruin <[email protected]> wrote: > Hi All, > > I came by a write up of some of the downsides in current workflow > management systems like Airflow and Luigi (http://bionics.it/posts/ > workflows-dataflow-not-task-deps) where they argue dependencies should be > between inputs and outputs of tasks rather than between tasks > (inlets/outlets). > > They extended Luigi (https://github.com/pharmbio/sciluigi) to do this and > even published a scientific paper on it: http://jcheminf.springeropen. > com/articles/10.1186/s13321-016-0179-6 . > > I kind of like the idea, has anyone played with it, any thoughts? I might > want to try it in Airflow. > > Bolke The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to this message and deleting it from your computer. Thank you. Vela Trading Technologies LLC
