I thought it would be prudent to start by refactoring the DagBag and implement the DagFetcher abstraction, with just the default FileSystemDagFetcher, in a way that could be merged without breaking anything. Maybe the existing JIRA can be about that first step, and then we create different JIRAs for each DagFetcher implementation?
I took a first stab at the initial step with this PR <https://github.com/apache/incubator-airflow/pull/3138>, maybe it helps make things more concrete. Because the file system related code from DagBag was moved to FileSystemDagFetcher, the diff is bigger than the actual changes. Is is basically: - A DagFetcher abstract base class. - A FileSystemDagFetcher, where the code is what we already had scattered in the DagBag class. - A *get_dag_fetcher* factory method that instantiates the right fetcher based on the dags_folder setting scheme (the uri prefix). - DagBag instances initialize and hold their own dag_fetcher (but always use the default for the example_dags). Of course, this PR should not break anything :) Hopefully this is not too far from the desired goal and we can go from here. Thanks in advance for any help, Diogo On 17 March 2018 at 00:02, Tao Feng <[email protected]> wrote: > +1. > > I think having a design doc is good. And it would be great if you could > create couples of small jiras for the related task. I am interested in > helping out if possible. > > Thanks, > -Tao > > On Fri, Mar 16, 2018 at 3:52 AM, Diogo Franco <[email protected]> > wrote: > > > Created this JIRA <https://issues.apache.org/jira/browse/AIRFLOW-2221>. > > > > I'm happy to take a shot at this with an initial implementation for > review, > > but If it is preferred to start with a design doc or something, let me > > know. > > > > Thank you for the guidance, cheers, > > Diogo > > > > On 16 March 2018 at 00:08, Maxime Beauchemin <[email protected] > > > > wrote: > > > > > I'm happy to commit to provide guidance and review the code if someone > > > wants to work on this feature. > > > > > > Max > > > > > > On Thu, Mar 15, 2018 at 4:42 PM, Kevin Pamplona <[email protected]> > > > wrote: > > > > > > > I'd also definitely be interested in this, as we have a asyn cron job > > > that > > > > syncs with a remote S3 location. I'd also be happy to help tackle > some > > of > > > > this work if there's a ticket involved. > > > > > > > > > > > > On Thu, Mar 15, 2018 at 4:38 PM, Joy Gao <[email protected]> wrote: > > > > > > > > > Hi guys, > > > > > > > > > > A related topic has been discussed recently via a separate email > > thread > > > > > (see 'How to add hooks for strong deployment consistency? > > > > > <https://lists.apache.org/thread.html/%3CCAB= > > > > > [email protected]%3E> > > > > > ') > > > > > > > > > > The idea brought up by Maxime is to modify DagBag and implement a > > > > > DagFetcher abstraction, where the default is > "FileSystemDagFetcher", > > > but > > > > it > > > > > open up doors for "GitRepoDagFetcher", "ArtifactoryDagFetcher", > > > > > "TarballInS3DagFetcher", or in this case, "HDFSDagFetcher", > > > > "S3DagFetcher", > > > > > and "GCSDagFetcher". > > > > > > > > > > We are all in favor of this, but as far as I'm aware no one has > owned > > > > this > > > > > yet. So if you (or anyone) wants to work on this, please create a > > JIRA > > > > and > > > > > call it out :) > > > > > > > > > > Cheers, > > > > > Joy > > > > > > > > > > > > > > > > > > > > On Thu, Mar 15, 2018 at 3:54 PM, Chris Fei <[email protected]> > wrote: > > > > > > > > > > > Hi Diogo, > > > > > > > > > > > > This would be valuable for me as well, I'd love first-class > support > > > for > > > > > > hdfs://..., s3://..., gcs://..., etc as a value for dags_folder. > > As a > > > > > > workaround, I deploy a maintenance DAG that periodically > downloads > > > > other > > > > > > DAGs from GCS into my DAG folder. Not perfect, but gets the job > > done. > > > > > > Chris > > > > > > > > > > > > On Thu, Mar 15, 2018, at 6:32 PM, Diogo Franco wrote: > > > > > > > Hi all, > > > > > > > > > > > > > > I think that the ability to fill up the DagBag from remote > > > > > > > locations would> be useful (in my use case, having the dags > > folder > > > in > > > > > > HDFS would > > > > > > > greatly> simplify the release process). > > > > > > > > > > > > > > Was there any discussion on this previously? I looked around > > > > > > > briefly but> couldn't find it. > > > > > > > > > > > > > > Maybe the method **DagBag.collect_dags** in *airflow/models.py > > > > *could> > > > > > > delegate the walking part to specific methods based on the > > > > > > > *dags_folder *prefix, > > > > > > > in a sort of plugin architecture. This would allow the > > > > > > > dags_folder to be> defined like hdfs://namenode/user/airflow/ > > dags, > > > > or > > > > > > s3://... > > > > > > > > > > > > > > If this makes sense, I'd love to work on it. > > > > > > > > > > > > > > Cheers, > > > > > > > Diogo Franco > > > > > > > > > > > > > > > > > > > > > > > > > > >
