Just commented on the blog post:

----------------------------
I agree that workflow engines should expose a way to document data objects
it reads from and writes to, so that it can be aware of the full graph of
tasks and data objects and how it all relates. This metadata allows for
clarity around data lineage and potentially deeper integration with
external systems.
Now there's the question of whether the state of a workflow should be
inferred based on the presence or absences of related targets. For this
specific question I'd argue that the workflow engine needs to manage its
own state internally. Here are a few reasons why: * many maintenance tasks
don't have have a physical output, forcing the creation of dummy objects
representing state * external systems have no guarantees as to how quickly
you can check for the existence of an object, therefore computing what task
can run may put a burden on external systems, poking at thousands of data
targets (related: the snakebite lib was developed in part to help with the
Luigi burden on HDFS) * how do you handle the "currently running" state? a
dummy/temporary output? manage this specific state internally? * how to
handle a state like the "skipped" in Airflow (related to branching)?
creating a dummy target? * if you need to re-run parts of the pipeline (say
a specific task and everything downstream for a specific date range),
you'll need to go and alter/delete the presence of a potentially intricate
list of targets. This means the workflow engine needs to be able to delete
files in external systems as a way to re-run tasks. Note that you may not
always want to take these targets offline for the duration of the backfill. *
if some tasks are using staging or temporary tables, cleaning those up to
regain space would re-trigger the task, so you'll have to trick the system
into achieving what you want to do (overwriting with an empty target?),
perhaps changing your unit of work by creating larger tasks that include
the temporary table step, but that may not be the unit-of-work that you
want From my perspective, to run a workflow engine at scale you need to
manage its state internally because you need strong guarantees as to
reading and altering that state. I agree that ideally the workflow engine
should know about input and output data objects (this is not the case
currently in Airflow), and it would be a real nice thing to be able to diff
& sync state across its internal state and external one (presence of
targets), but may be challenging.

Max

On Mon, Jan 23, 2017 at 8:05 AM, Bolke de Bruin <[email protected]> wrote:

> Hi All,
>
> I came by a write up of some of the downsides in current workflow
> management systems like Airflow and Luigi (http://bionics.it/posts/
> workflows-dataflow-not-task-deps) where they argue dependencies should be
> between inputs and outputs of tasks rather than between tasks
> (inlets/outlets).
>
> They extended Luigi (https://github.com/pharmbio/sciluigi) to do this and
> even published a scientific paper on it: http://jcheminf.springeropen.
> com/articles/10.1186/s13321-016-0179-6 .
>
> I kind of like the idea, has anyone played with it, any thoughts? I might
> want to try it in Airflow.
>
> Bolke

Reply via email to