[ 
https://issues.apache.org/jira/browse/OOZIE-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089728#comment-16089728
 ] 

Daniel Becker commented on OOZIE-2988:
--------------------------------------

I have spent some time investigating Airflow. Here are my thoughts on it:

The most important feature of Airflow is of course that the configuration files 
are actually (python) code. I liked this as I, too, am more comfortable with 
code than configuration files.

+*Workflows*+
The Airflow DAG is approximately what workflow is in Oozie. Actions (which are 
called tasks in Airflow) are created in the python code using _operators_, 
which are python object constructors. These have a set of parameters in common 
(they all derive from a common base class), but they also have their specific 
parameters. Tasks can also inherit default configuration from the DAG they are 
contained in.

*Remark:* I definitely think that handling default configurations is something 
that we should do, and Oozie workflows can already do it, for example with the 
<global> sections (since workflow schema 0.5).

Airflow DAGs are checked for acyclicity and exceptions are raised if the 
conditions are violated. When executing a workflow, Airflow evaluates the 
python DAG definition from time to time to check / schedule / run individual 
tasks (so the definition must be idempotent).

+*Transitions*+
Specifying the transitions between the actions is what I think is one of the 
best features of Airflow, as it is done in a simpler and more intuitive way 
than Oozie. After the actions are created, you can simply specify on to be 
upstream to another. For example you have task1 and task2, and you'd like to 
set task1 before task2, you can do:

{code:java}
task2.set_upstream(task1)
{code}
or
{code:java}
task1.set_downstream(task2)
{code}
or even

{code:java}
task1 >> task2
{code}

*Remark:* This way, forks and joins are handled implicitly, something we should 
consider.

+*Decisions*+
Decisions are handled by the BranchPythonOperator which has a python callable 
as a parameter. The callable returns the the id of the task that should be 
chosen. This task must already be downstream to the branching. When evaluating 
the callable, the status of all tasks downstream to the branching are set to 
_skipped_ with the exception of the chosen branch. There is a catch though:

{code:java}
branch_task1 -------- task2 -------- task3
                    \ ----------------/
{code}

is not correct. Here, task3 is downstream to both task1 and task2, and when 
evaluating the branching at branch_task1, and the path through task2 is taken, 
task3's status is also set to _skipped_ as it is downstream to branch_task1 and 
its branch has not been taken. After task2 executes, task3 will NOT be run.
The solution to this problem is to use a dummy task like this:


{code:java}
branch_task1 ------------ task2 ------------ task3
                    \ -------- dummy_task----/
{code}

Airflow has a DummyOperator class for this.
*Remarks:* I think this behaviour is unintuitive and we should not use it.

+*Dependencies of nodes*+
By default, tasks downstream to a node wait until that node finishes 
successfully, and if it fails, they are not started. This can be customised.

*Remarks:* Oozie already has its model of action dependency (with _ok_ and 
_error_ transitions), I think we should use the existing practice.

+*Coordinators and bundles*+
Scheduling is an integrated part of DAGs, they are not handled separately from 
the workflow definition.
*Remark:* It could be a good decision, but in Oozie we already handle workflows 
and coordinators separately (which is also a practice we can argue for), and I 
think it would be too big a change to do it differently in our API.

Bundle are not present in Airflow either, but we can use nested sub-DAGs to 
treat one or more workflows as a unit in a larger workflows. Oozie supports 
subworkflows too.

+*Other things*+
Airflow has support for packaging dependencies together with the DAGs, although 
these dependencies have to be written in python.
Airflow supports Jinja templates: [http://jinja.pocoo.org/docs/dev/].

> Inestigate Apache Airflow
> -------------------------
>
>                 Key: OOZIE-2988
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2988
>             Project: Oozie
>          Issue Type: Sub-task
>          Components: client
>            Reporter: Andras Piros
>            Assignee: Daniel Becker
>   Original Estimate: 16h
>          Time Spent: 14h
>  Remaining Estimate: 2h
>
> Investigate current version of [*Apache 
> Airflow*|https://airflow.incubator.apache.org/concepts.html] regarding 
> following:
> * how to define a DAG
> ** nodes
> ** edges
> ** transitions
> ** decision and fork / join nodes
> ** checks on DAGness, and on completeness
> * how are connections between DAGs handled
> ** state of one running / already finished DAG is respected by another DAG
> * creation / running process
> ** what to define w/ Python code
> ** and what using templates
> ** inheritance in Python
> ** third party libs in Python code while creating a DAG
> ** how are DAGs stored and submitted / run
> ** what patterns are used under the hood (builder, delegate, ...)
> ** how is logic considered (Oozie EL functions)
> * how to schedule a DAG (Oozie Coordinators)
> * how to bundle these together (Oozie Bundles)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to