[DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Eduardo Nicastro Thu, 11 Jan 2024 07:31:09 -0800

Hello all,

I'm reaching out to propose a topic for discussion that has recently
emerged in our GitHub discussion threads (#36723
<https://github.com/apache/airflow/discussions/36723>). It revolves around
enhancing the management of datasets in a multi-tenant Airflow architecture.


Use case/motivation
In our multi-instance setup, synchronizing dataset dependencies across
instances poses significant challenges. With the advent of dataset
listeners, a new door has opened for cross-instance dataset awareness. I
propose we explore creating endpoints to export dataset updates to make it
possible to trigger DAGs consuming from a Dataset across tenants.

Context
Below I will give some context about our current situation and solution we
have in place and propose a new workflow that would be more efficient. To
be able to implement this new workflow we would need a way to export
Dataset updates as mentioned.

Current Workflow
In our organization, we're dealing with multiple Airflow tenants, let's say
Tenant 1 and Tenant 2, as examples. To synchronize Dataset A across these
tenants, we currently have a complex setup:

   1. Containers run on a schedule to export metadata to CosmosDB (these
   will be replaced by the listener).
   2. Additional scheduled containers pull data from CosmosDB and write it
   to a shared file system, enabling generated DAGS to read it and mirror a
   dataset across tenants.


Proposed Workflow
Here's a breakdown of our proposed workflow:

   1. Cross-Tenant Dataset Interaction: We have Dags in Tenant 1 producing
   Dataset A. We need a mechanism to trigger all Dags consuming Dataset A in
   Tenant 2. This interaction is crucial for our data pipeline's efficiency
   and synchronicity.
   2. Dataset Listener Implementation: Our approach involves implementing a
   Dataset listener that programmatically creates Dataset A in all tenants
   where it's not present (like Tenant 2) and export Dataset updates when they
   happen. This would trigger an update on all Dags consuming from that
   Dataset.
   3. Standardized Dataset Names: We plan to use standardized dataset
   names, which makes sense since a URI is its identifier and uniqueness is a
   logical requirement.

[image: image.png]

Why This Matters:

   - It offers a streamlined, automated way to manage datasets across
   different Airflow instances.
   - It aligns with a need for efficient, interconnected workflows in a
   multi-tenant environment.


I invite the community to discuss:

   - Are there alternative methods within Airflow's current framework that
   could achieve similar goals?
   - Any insights or experiences that could inform our approach?

Your feedback and suggestions are invaluable, and I look forward to a
collaborative discussion.

Best Regards,
Eduardo Nicastro

[DISCUSSION] Enhanced Multi-Tenant Dataset Management in Airflow: Potential First Steps

Reply via email to