The historical reason is that people would check in scripts in the repo that had actual compute or other forms or undesired effect in module scope (scripts with no "if __name__ == '__main__':") and Airflow would just run this script while seeking for DAGs. So we added this mitigation patch that would confirm that there's something Airflow-related in the .py file. Not elegant, and confusing at times, but it also probably prevented some issues over the years.
The solution here is to have a more explicit way of adding DAGs to the DagBag (instead of the folder-crawling approach). The DagFetcher proposal offers solutions around that, having a central "manifest" file that provides explicit pointers to all DAGs in the environment. Max On Sat, Nov 24, 2018 at 5:04 PM Beau Barker <beauinmelbou...@gmail.com> wrote: > In my opinion this searching for dags is not ideal. > > We should be explicitly specifying the dags to load somewhere. > > > > On 25 Nov 2018, at 10:41 am, Kevin Yang <yrql...@gmail.com> wrote: > > > > I believe that is mostly because we want to skip parsing/loading .py > files > > that doesn't contain DAG defs to save time, as scheduler is going to > > parse/load the .py files over and over again and some files can take > quite > > long to load. > > > > Cheers, > > Kevin Y > > > > On Fri, Nov 23, 2018 at 12:44 AM soma dhavala <soma.dhav...@gmail.com> > > wrote: > > > >> happy to report that the “fix” worked. thanks Alex. > >> > >> btw, wondering why was it there in the first place? how does it help — > >> saves time, early termination — what? > >> > >> > >>> On Nov 23, 2018, at 8:18 AM, Alex Guziel <alex.guz...@airbnb.com> > wrote: > >>> > >>> Yup. > >>> > >>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <soma.dhav...@gmail.com > >> <mailto:soma.dhav...@gmail.com>> wrote: > >>> > >>> > >>>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guz...@airbnb.com > >> <mailto:alex.guz...@airbnb.com>> wrote: > >>>> > >>>> It’s because of this > >>>> > >>>> “When searching for DAGs, Airflow will only consider files where the > >> string “airflow” and “DAG” both appear in the contents of the .py file.” > >>>> > >>> > >>> Have not noticed it. From airflow/models.py, in process_file — (both > in > >> 1.9 and 1.10) > >>> .. > >>> if not all([s in content for s in (b'DAG', b'airflow')]): > >>> .. > >>> is looking for those strings and if they are not found, it is returning > >> without loading the DAGs. > >>> > >>> > >>> So having “airflow” and “DAG” dummy strings placed somewhere will make > >> it work? > >>> > >>> > >>>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <soma.dhav...@gmail.com > >> <mailto:soma.dhav...@gmail.com>> wrote: > >>>> > >>>> > >>>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guz...@airbnb.com > >> <mailto:alex.guz...@airbnb.com>> wrote: > >>>>> > >>>>> I think this is what is going on. The dags are picked by local > >> variables. I.E. if you do > >>>>> dag = Dag(...) > >>>>> dag = Dag(…) > >>>> > >>>> from my_module import create_dag > >>>> > >>>> for file in yaml_files: > >>>> dag = create_dag(file) > >>>> globals()[dag.dag_id] = dag > >>>> > >>>> You notice that create_dag is in a different module. If it is in the > >> same scope (file), it will be fine. > >>>> > >>>>> > >>>> > >>>>> Only the second dag will be picked up. > >>>>> > >>>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala < > soma.dhav...@gmail.com > >> <mailto:soma.dhav...@gmail.com>> wrote: > >>>>> Hey AirFlow Devs: > >>>>> In our organization, we build a Machine Learning WorkBench with > >> AirFlow as > >>>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python > >>>>> operators to customize the behaviour. These work flows are specified > in > >>>>> YAML. > >>>>> > >>>>> We drop a DAG loader (written python) in the default location airflow > >>>>> expects the DAG files. This DAG loader reads the specified YAML > files > >> and > >>>>> converts them into airflow DAG objects. Essentially, we are > >>>>> programmatically creating the DAG objects. In order to support > muliple > >>>>> parsers (yaml, json etc), we separated the DAG creation from loading. > >> But > >>>>> when a DAG is created (in a separate module) and made available to > the > >> DAG > >>>>> loaders, airflow does not pick it up. As an example, consider that I > >>>>> created a DAG picked it, and will simply unpickle the DAG and give it > >> to > >>>>> airflow. > >>>>> > >>>>> However, in current avatar of airfow, the very creation of DAG has to > >>>>> happen in the loader itself. As far I am concerned, airflow should > not > >> care > >>>>> where and how the DAG object is created, so long as it is a valid DAG > >>>>> object. The workaround for us is to mix parser and loader in the same > >> file > >>>>> and drop it in the airflow default dags folder. During dag_bag > >> creation, > >>>>> this file is loaded up with import_modules utility and shows up in > the > >> UI. > >>>>> While this is a solution, but it is not clean. > >>>>> > >>>>> What do DEVs think about a solution to this problem? Will saving the > >> DAG to > >>>>> the db and reading it from the db work? Or some core changes need to > >> happen > >>>>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs. > >>>>> > >>>>> thanks, > >>>>> -soma > >>>> > >>> > >> > >> >