[
https://issues.apache.org/jira/browse/OOZIE-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089728#comment-16089728
]
Daniel Becker commented on OOZIE-2988:
--------------------------------------
I have spent some time investigating Airflow. Here are my thoughts on it:
The most important feature of Airflow is of course that the configuration files
are actually (python) code. I liked this as I, too, am more comfortable with
code than configuration files.
+*Workflows*+
The Airflow DAG is approximately what workflow is in Oozie. Actions (which are
called tasks in Airflow) are created in the python code using _operators_,
which are python object constructors. These have a set of parameters in common
(they all derive from a common base class), but they also have their specific
parameters. Tasks can also inherit default configuration from the DAG they are
contained in.
*Remark:* I definitely think that handling default configurations is something
that we should do, and Oozie workflows can already do it, for example with the
<global> sections (since workflow schema 0.5).
Airflow DAGs are checked for acyclicity and exceptions are raised if the
conditions are violated. When executing a workflow, Airflow evaluates the
python DAG definition from time to time to check / schedule / run individual
tasks (so the definition must be idempotent).
+*Transitions*+
Specifying the transitions between the actions is what I think is one of the
best features of Airflow, as it is done in a simpler and more intuitive way
than Oozie. After the actions are created, you can simply specify on to be
upstream to another. For example you have task1 and task2, and you'd like to
set task1 before task2, you can do:
{code:java}
task2.set_upstream(task1)
{code}
or
{code:java}
task1.set_downstream(task2)
{code}
or even
{code:java}
task1 >> task2
{code}
*Remark:* This way, forks and joins are handled implicitly, something we should
consider.
+*Decisions*+
Decisions are handled by the BranchPythonOperator which has a python callable
as a parameter. The callable returns the the id of the task that should be
chosen. This task must already be downstream to the branching. When evaluating
the callable, the status of all tasks downstream to the branching are set to
_skipped_ with the exception of the chosen branch. There is a catch though:
{code:java}
branch_task1 -------- task2 -------- task3
\ ----------------/
{code}
is not correct. Here, task3 is downstream to both task1 and task2, and when
evaluating the branching at branch_task1, and the path through task2 is taken,
task3's status is also set to _skipped_ as it is downstream to branch_task1 and
its branch has not been taken. After task2 executes, task3 will NOT be run.
The solution to this problem is to use a dummy task like this:
{code:java}
branch_task1 ------------ task2 ------------ task3
\ -------- dummy_task----/
{code}
Airflow has a DummyOperator class for this.
*Remarks:* I think this behaviour is unintuitive and we should not use it.
+*Dependencies of nodes*+
By default, tasks downstream to a node wait until that node finishes
successfully, and if it fails, they are not started. This can be customised.
*Remarks:* Oozie already has its model of action dependency (with _ok_ and
_error_ transitions), I think we should use the existing practice.
+*Coordinators and bundles*+
Scheduling is an integrated part of DAGs, they are not handled separately from
the workflow definition.
*Remark:* It could be a good decision, but in Oozie we already handle workflows
and coordinators separately (which is also a practice we can argue for), and I
think it would be too big a change to do it differently in our API.
Bundle are not present in Airflow either, but we can use nested sub-DAGs to
treat one or more workflows as a unit in a larger workflows. Oozie supports
subworkflows too.
+*Other things*+
Airflow has support for packaging dependencies together with the DAGs, although
these dependencies have to be written in python.
Airflow supports Jinja templates: [http://jinja.pocoo.org/docs/dev/].
> Inestigate Apache Airflow
> -------------------------
>
> Key: OOZIE-2988
> URL: https://issues.apache.org/jira/browse/OOZIE-2988
> Project: Oozie
> Issue Type: Sub-task
> Components: client
> Reporter: Andras Piros
> Assignee: Daniel Becker
> Original Estimate: 16h
> Time Spent: 14h
> Remaining Estimate: 2h
>
> Investigate current version of [*Apache
> Airflow*|https://airflow.incubator.apache.org/concepts.html] regarding
> following:
> * how to define a DAG
> ** nodes
> ** edges
> ** transitions
> ** decision and fork / join nodes
> ** checks on DAGness, and on completeness
> * how are connections between DAGs handled
> ** state of one running / already finished DAG is respected by another DAG
> * creation / running process
> ** what to define w/ Python code
> ** and what using templates
> ** inheritance in Python
> ** third party libs in Python code while creating a DAG
> ** how are DAGs stored and submitted / run
> ** what patterns are used under the hood (builder, delegate, ...)
> ** how is logic considered (Oozie EL functions)
> * how to schedule a DAG (Oozie Coordinators)
> * how to bundle these together (Oozie Bundles)
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)