Hi Elad,

Good point. While this does not solve that challenge directly, fsspec does
implement a GitFS. This means that if we extend universal-pathlib, which
ObjectStoragePath relies upon, this becomes available right away. GitFS
also has an understanding of versioning and branches so that comes in handy
for AIP-63.

So making dag parsing and processing independent of the local fs gives us
more flexibility towards the future.

Bolke


On Sun, 26 May 2024 at 12:57, Elad Kalif <[email protected]> wrote:

> Thank you Bolke!
> Interesting read.
>
> I have a question about what is the pain we try to solve here. Most use
> cases I encountered were about the need to sync dags from a branch in
> GitHub (or equivalent) to the Airflow DAG folder.
> Correct me if I am wrong but this AIP does not handle this. A sync
> component will still be required to sync from git to S3/GCS/Other storage
> and this AIP solves only the part that Airflow machines will be able to
> fetch the files from storage. Is that correct?
>
> On Sun, May 26, 2024 at 10:55 AM Bolke de Bruin <[email protected]> wrote:
>
> > Hi All,
> >
> > I would like to discuss a new AIP aimed at enhancing the DAG loading
> > mechanism to support reading DAGs from ephemeral storage solutions. This
> > proposal is intended to supersede AIP-5 Remote DAG Fetcher and provide a
> > more flexible and scalable approach and to prepare for AIP-63.
> >
> >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-71+Generalizing+DAG+Loader+and+Processor+for+Ephemeral+Storage
> >
> > *Abstract*
> > This proposal aims to generalize the DAG loader and processor to use
> > pathlib.Path for file operations instead of assuming direct OS filesystem
> > access. It includes implementing a custom module loader that supports
> > loading from ObjectStoragePath locations and other Path-like
> abstractions,
> > with caching capabilities provided by fsspec. Furthermore, while this AIP
> > does not directly implement DAG versioning, it creates a foundational
> layer
> > that can be extended to support DAG versioning as outlined in AIP-63.
> >
> > A work in progress PR can be found here:
> > https://github.com/apache/airflow/pull/39647
> >
> > *Key points for discussion*
> >
> > Previous proposals, like AIP-5, suggested using a Fetcher mechanism. Kind
> > of like an in-process git-sync. This proposal is about making that
> > redundant by fully supporting object storage locations by leveraging
> > ObjectStoragePath and fsspec caching mechanisms.
> >
> > Earlier feedback on AIP-5 was that we thought that having an additional
> > Fetcher process was out of scope of the project. With the transient
> > integration of pathlib.Path and ObjectStoragePath I think this argument
> > does not hold anymore and the demand is there. In addition the added
> > flexibility allows AIP-63 to be implemented easier (what that looks like
> > remains to be seen).
> >
> > Airflow scans DAGs often. This very likely requires a caching mechanism
> for
> > both the DAGs and their modules. Fsspec does implement caching, and it is
> > planned to leverage this.
> >
> > Non DAG, Non module assets as part of the DAG folder are out of scope. So
> > say for example for some reason you include a GIF. This will not
> > automatically be available without changes to your code.
> >
> > I kindly request your thoughts :-).
> >
> > Bolke
> >
> > --
> >
> > --
> > Bolke de Bruin
> > [email protected]
> >
>


-- 

--
Bolke de Bruin
[email protected]

Reply via email to