Hi Mocheng, Please allow me to share a question first: so in your proposal, the API in your plan is still accepting an Airflow DAG as the payload (just binarized or compressed), right?
If that's the case, I may not be fully convinced: the objectives in your proposal is about automation & programmatically submitting DAGs. These can already be achieved in an efficient way through CI/CD practice + a centralized place to manage your DAGs (e.g. a Git Repo to host the DAG files). As you are already aware, allowing this via API adds additional security concern, and I would doubt if that "breaks even". Kindly let me know if I have missed anything or misunderstood your proposal. Thanks. Regards, XD ---------------------------------------------------------------- (This is not a contribution) On Wed, Aug 10, 2022 at 1:46 AM Mocheng Guo <gmca...@gmail.com> wrote: > Hi Everyone, > > I have an enhancement proposal for the REST API service. This is based on > the observations that Airflow users want to be able to access Airflow more > easily as a platform service. > > The motivation comes from the following use cases: > 1. Users like data scientists want to iterate over data quickly with > interactive feedback in minutes, e.g. managing data pipelines inside > Jupyter Notebook while executing them in a remote airflow cluster. > 2. Services targeting specific audiences can generate DAGs based on inputs > like user command or external triggers, and they want to be able to submit > DAGs programmatically without manual intervention. > > I believe such use cases would help promote Airflow usability and gain > more customer popularity. The existing DAG repo brings considerable > overhead for such scenarios, a shared repo requires offline processes and > can be slow to rollout. > > The proposal aims to provide an alternative where a DAG can be transmitted > online and here are some key points: > 1. A DAG is packaged individually so that it can be distributable over the > network. For example, a DAG may be a serialized binary or a zip file. > 2. The Airflow REST API is the ideal place to talk with the external > world. The API would provide a generic interface to accept DAG artifacts > and should be extensible to support different artifact formats if needed. > 3. DAG persistence needs to be implemented since they are not part of the > DAG repository. > 4. Same behavior for DAGs supported in API vs those defined in the repo, > i.e. users write DAGs in the same syntax, and its scheduling, execution, > and web server UI should behave the same way. > > Since DAGs are written as code, running arbitrary code inside Airflow may > pose high security risks. Here are a few proposals to stop the security > breach: > 1. Accept DAGs only from trusted parties. Airflow already supports > pluggable authentication modules where strong authentication such as > Kerberos can be used. > 2. Execute DAG code as the API identity, i.e. A DAG created through the > API service will have run_as_user set to be the API identity. > 3. To enforce data access control on DAGs, the API identity should also be > used to access the data warehouse. > > We shared a demo based on a prototype implementation in the summit and > some details are described in this ppt > <https://drive.google.com/file/d/1luDGvWRA-hwn2NjPoobis2SL4_UNYfcM/view>, > and would love to get feedback and comments from the community about this > initiative. > > thanks > Mocheng >