Jeremiah Lowin created AIRFLOW-862:
--------------------------------------
Summary: Add DaskExecutor
Key: AIRFLOW-862
URL: https://issues.apache.org/jira/browse/AIRFLOW-862
Project: Apache Airflow
Issue Type: New Feature
Components: executor
Reporter: Jeremiah Lowin
Assignee: Jeremiah Lowin
The Dask Distributed sub-project makes it very easy to create pure-python
clusters of Dask workers ranging from a personal laptop to thousands of
networked cores. The workers can execute arbitrary functions submitted to the
Dask scheduler node. A full Dask app would involve multiple tasks with
data-dependencies (similar in philosophy to an Airflow DAG) but it will happily
run single functions as well.
The DaskExecutor is configured by supplying the IP address of the Dask
Scheduler. It submits Airflow commands to the cluster for execution (note: the
cluster should have access to any Airflow dependencies, including Airflow
itself!) and checks the resulting futures to see if the tasks completed
successfully.
Some advantages of using Dask for parallel execution over LocalExecutor or
CeleryExecutor are:
- simple scaling, from local machines to remote clusters
- pure python implementation (minimal dependencies and no need to run
additional databases)
- built in live-updating web UI for monitoring the cluster
** Note: This does NOT replace the Airflow scheduler or DAG engine with the
analogous Dask versions; it just uses the Dask cluster to run Airflow tasks.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)