Hi folks,

We are the team behind project vineyard ( https://github.com/v6d-io/v6d 
<https://github.com/v6d-io/v6d> ), which is an immutable
in-memory data manager. We are working on the integration with airflow. To 
fully leverage the
the capability of vineyard to improve the end-to-end performance of airflow 
workflows, we propose
a set of changes in DAG, executor, and scheduler to allow third-party provides 
inject hooks into
the airflow framework.

As we are not experts on airflow, before starting working on pull requests, we 
want to brief our
proposal here for early suggestions and feedback. We have prepared slides to 
outline the ideas:
https://docs.google.com/presentation/d/1r4gg4RX0PMrhRjpSpoRhe5Ck4FaFm8Xt9DSoZipsNa0/edit?usp=sharing
 
<https://docs.google.com/presentation/d/1r4gg4RX0PMrhRjpSpoRhe5Ck4FaFm8Xt9DSoZipsNa0/edit?usp=sharing>

Briefly, vineyard is a data-sharing engine that aims to provide efficient 
sharing for immediate data
for bigdata analytical workflows. To make vineyard could be leveraged by 
existing airflow jobs (without
non-trivial refactor the DAG definition code), our proposal includes

1. XCom interface: allow the operators to push more (optional) metadata to xcom

2. DAG level hooks for operators: operators in airflow could define their own 
`pre_execute` and
    `post_execute` hook, but at the DAG level, it is hard to inject a hook for 
all operators for preprocessing
   jobs, e.g., prepare the required inputs before executing an operator. We 
propose to support DAG
   level pre/post hooks in airflow.

3. Scheduler: airflow supports specifying custom XCom backend via 
`AIRFLOW__CORE__XCOM_BACKEND`.
    It would be great if the scheduler (scheduler_job.py) could be extensible 
as well.

4. Executor: if scheduler airflow could assign some “data locality information” 
when enqueuing tasks,
    there would be performance gain if the underlying backend supports 
accessing local objects faster
    than remote ones (e.g., ray backend or vineyard backend). The executor 
should consider such
    locality information when pop tasks to run from the queue.

If there's anything wrong with how airflow internally works, please correct us. 
We are seeking
comments about the whole design as well as detailed proposed changes from the 
airflow developers!

We also would like to see airflow moving towards better Kubernetes integration 
and support. Airflow
has the ability to launch pods (KubernetesExecutor and 
CeleryKubernetesExecutor), and we believe
by leverage the DAG and linage information, airflow could gain more in modern 
cloud environment
orchestrated by Kubernetes!

Thanks in advance!

Best regards,
Tao

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to