Just commented on the blog post: ---------------------------- I agree that workflow engines should expose a way to document data objects it reads from and writes to, so that it can be aware of the full graph of tasks and data objects and how it all relates. This metadata allows for clarity around data lineage and potentially deeper integration with external systems. Now there's the question of whether the state of a workflow should be inferred based on the presence or absences of related targets. For this specific question I'd argue that the workflow engine needs to manage its own state internally. Here are a few reasons why: * many maintenance tasks don't have have a physical output, forcing the creation of dummy objects representing state * external systems have no guarantees as to how quickly you can check for the existence of an object, therefore computing what task can run may put a burden on external systems, poking at thousands of data targets (related: the snakebite lib was developed in part to help with the Luigi burden on HDFS) * how do you handle the "currently running" state? a dummy/temporary output? manage this specific state internally? * how to handle a state like the "skipped" in Airflow (related to branching)? creating a dummy target? * if you need to re-run parts of the pipeline (say a specific task and everything downstream for a specific date range), you'll need to go and alter/delete the presence of a potentially intricate list of targets. This means the workflow engine needs to be able to delete files in external systems as a way to re-run tasks. Note that you may not always want to take these targets offline for the duration of the backfill. * if some tasks are using staging or temporary tables, cleaning those up to regain space would re-trigger the task, so you'll have to trick the system into achieving what you want to do (overwriting with an empty target?), perhaps changing your unit of work by creating larger tasks that include the temporary table step, but that may not be the unit-of-work that you want From my perspective, to run a workflow engine at scale you need to manage its state internally because you need strong guarantees as to reading and altering that state. I agree that ideally the workflow engine should know about input and output data objects (this is not the case currently in Airflow), and it would be a real nice thing to be able to diff & sync state across its internal state and external one (presence of targets), but may be challenging.
Max On Mon, Jan 23, 2017 at 8:05 AM, Bolke de Bruin <[email protected]> wrote: > Hi All, > > I came by a write up of some of the downsides in current workflow > management systems like Airflow and Luigi (http://bionics.it/posts/ > workflows-dataflow-not-task-deps) where they argue dependencies should be > between inputs and outputs of tasks rather than between tasks > (inlets/outlets). > > They extended Luigi (https://github.com/pharmbio/sciluigi) to do this and > even published a scientific paper on it: http://jcheminf.springeropen. > com/articles/10.1186/s13321-016-0179-6 . > > I kind of like the idea, has anyone played with it, any thoughts? I might > want to try it in Airflow. > > Bolke
