The historical reason is that people would check in scripts in the repo
that had actual compute or other forms or undesired effect in module scope
(scripts with no "if __name__ == '__main__':") and Airflow would just run
this script while seeking for DAGs. So we added this mitigation patch that
would confirm that there's something Airflow-related in the .py file. Not
elegant, and confusing at times, but it also probably prevented some issues
over the years.

The solution here is to have a more explicit way of adding DAGs to the
DagBag (instead of the folder-crawling approach). The DagFetcher proposal
offers solutions around that, having a central "manifest" file that
provides explicit pointers to all DAGs in the environment.

Max

On Sat, Nov 24, 2018 at 5:04 PM Beau Barker <beauinmelbou...@gmail.com>
wrote:

> In my opinion this searching for dags is not ideal.
>
> We should be explicitly specifying the dags to load somewhere.
>
>
> > On 25 Nov 2018, at 10:41 am, Kevin Yang <yrql...@gmail.com> wrote:
> >
> > I believe that is mostly because we want to skip parsing/loading .py
> files
> > that doesn't contain DAG defs to save time, as scheduler is going to
> > parse/load the .py files over and over again and some files can take
> quite
> > long to load.
> >
> > Cheers,
> > Kevin Y
> >
> > On Fri, Nov 23, 2018 at 12:44 AM soma dhavala <soma.dhav...@gmail.com>
> > wrote:
> >
> >> happy to report that the “fix” worked. thanks Alex.
> >>
> >> btw, wondering why was it there in the first place? how does it help —
> >> saves time, early termination — what?
> >>
> >>
> >>> On Nov 23, 2018, at 8:18 AM, Alex Guziel <alex.guz...@airbnb.com>
> wrote:
> >>>
> >>> Yup.
> >>>
> >>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala <soma.dhav...@gmail.com
> >> <mailto:soma.dhav...@gmail.com>> wrote:
> >>>
> >>>
> >>>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guz...@airbnb.com
> >> <mailto:alex.guz...@airbnb.com>> wrote:
> >>>>
> >>>> It’s because of this
> >>>>
> >>>> “When searching for DAGs, Airflow will only consider files where the
> >> string “airflow” and “DAG” both appear in the contents of the .py file.”
> >>>>
> >>>
> >>> Have not noticed it.  From airflow/models.py, in process_file — (both
> in
> >> 1.9 and 1.10)
> >>> ..
> >>> if not all([s in content for s in (b'DAG', b'airflow')]):
> >>> ..
> >>> is looking for those strings and if they are not found, it is returning
> >> without loading the DAGs.
> >>>
> >>>
> >>> So having “airflow” and “DAG”  dummy strings placed somewhere will make
> >> it work?
> >>>
> >>>
> >>>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala <soma.dhav...@gmail.com
> >> <mailto:soma.dhav...@gmail.com>> wrote:
> >>>>
> >>>>
> >>>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guz...@airbnb.com
> >> <mailto:alex.guz...@airbnb.com>> wrote:
> >>>>>
> >>>>> I think this is what is going on. The dags are picked by local
> >> variables. I.E. if you do
> >>>>> dag = Dag(...)
> >>>>> dag = Dag(…)
> >>>>
> >>>> from my_module import create_dag
> >>>>
> >>>> for file in yaml_files:
> >>>>     dag = create_dag(file)
> >>>>     globals()[dag.dag_id] = dag
> >>>>
> >>>> You notice that create_dag is in a different module. If it is in the
> >> same scope (file), it will be fine.
> >>>>
> >>>>>
> >>>>
> >>>>> Only the second dag will be picked up.
> >>>>>
> >>>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala <
> soma.dhav...@gmail.com
> >> <mailto:soma.dhav...@gmail.com>> wrote:
> >>>>> Hey AirFlow Devs:
> >>>>> In our organization, we build a Machine Learning WorkBench with
> >> AirFlow as
> >>>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
> >>>>> operators to customize the behaviour. These work flows are specified
> in
> >>>>> YAML.
> >>>>>
> >>>>> We drop a DAG loader (written python) in the default location airflow
> >>>>> expects the DAG files.  This DAG loader reads the specified YAML
> files
> >> and
> >>>>> converts them into airflow DAG objects. Essentially, we are
> >>>>> programmatically creating the DAG objects. In order to support
> muliple
> >>>>> parsers (yaml, json etc), we separated the DAG creation from loading.
> >> But
> >>>>> when a DAG is created (in a separate module) and made available to
> the
> >> DAG
> >>>>> loaders, airflow does not pick it up. As an example, consider that I
> >>>>> created a DAG picked it, and will simply unpickle the DAG and give it
> >> to
> >>>>> airflow.
> >>>>>
> >>>>> However, in current avatar of airfow, the very creation of DAG has to
> >>>>> happen in the loader itself. As far I am concerned, airflow should
> not
> >> care
> >>>>> where and how the DAG object is created, so long as it is a valid DAG
> >>>>> object. The workaround for us is to mix parser and loader in the same
> >> file
> >>>>> and drop it in the airflow default dags folder. During dag_bag
> >> creation,
> >>>>> this file is loaded up with import_modules utility and shows up in
> the
> >> UI.
> >>>>> While this is a solution, but it is not clean.
> >>>>>
> >>>>> What do DEVs think about a solution to this problem? Will saving the
> >> DAG to
> >>>>> the db and reading it from the db work? Or some core changes need to
> >> happen
> >>>>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
> >>>>>
> >>>>> thanks,
> >>>>> -soma
> >>>>
> >>>
> >>
> >>
>

Reply via email to