Hi folks, We are the team behind project vineyard ( https://github.com/v6d-io/v6d <https://github.com/v6d-io/v6d> ), which is an immutable in-memory data manager. We are working on the integration with airflow. To fully leverage the the capability of vineyard to improve the end-to-end performance of airflow workflows, we propose a set of changes in DAG, executor, and scheduler to allow third-party provides inject hooks into the airflow framework.
As we are not experts on airflow, before starting working on pull requests, we want to brief our proposal here for early suggestions and feedback. We have prepared slides to outline the ideas: https://docs.google.com/presentation/d/1r4gg4RX0PMrhRjpSpoRhe5Ck4FaFm8Xt9DSoZipsNa0/edit?usp=sharing <https://docs.google.com/presentation/d/1r4gg4RX0PMrhRjpSpoRhe5Ck4FaFm8Xt9DSoZipsNa0/edit?usp=sharing> Briefly, vineyard is a data-sharing engine that aims to provide efficient sharing for immediate data for bigdata analytical workflows. To make vineyard could be leveraged by existing airflow jobs (without non-trivial refactor the DAG definition code), our proposal includes 1. XCom interface: allow the operators to push more (optional) metadata to xcom 2. DAG level hooks for operators: operators in airflow could define their own `pre_execute` and `post_execute` hook, but at the DAG level, it is hard to inject a hook for all operators for preprocessing jobs, e.g., prepare the required inputs before executing an operator. We propose to support DAG level pre/post hooks in airflow. 3. Scheduler: airflow supports specifying custom XCom backend via `AIRFLOW__CORE__XCOM_BACKEND`. It would be great if the scheduler (scheduler_job.py) could be extensible as well. 4. Executor: if scheduler airflow could assign some “data locality information” when enqueuing tasks, there would be performance gain if the underlying backend supports accessing local objects faster than remote ones (e.g., ray backend or vineyard backend). The executor should consider such locality information when pop tasks to run from the queue. If there's anything wrong with how airflow internally works, please correct us. We are seeking comments about the whole design as well as detailed proposed changes from the airflow developers! We also would like to see airflow moving towards better Kubernetes integration and support. Airflow has the ability to launch pods (KubernetesExecutor and CeleryKubernetesExecutor), and we believe by leverage the DAG and linage information, airflow could gain more in modern cloud environment orchestrated by Kubernetes! Thanks in advance! Best regards, Tao
smime.p7s
Description: S/MIME cryptographic signature
