Great inputs James. I was premature in saying we need micro-services. Any solutioning should depend on the problem(s) being solved and promise(s) being made.
thanks, -soma > On Nov 28, 2018, at 11:24 PM, James Meickle <jmeic...@quantopian.com.INVALID> > wrote: > > I would be very interested in helping draft a rearchitecting AIP. Of > course, that's a vague statement. I am interested in several specific areas > of Airflow functionality that would be hard to modify without some > refactoring taking place first: > > 1) Improving Airflow's data model so it's easier to have functional data > pipelines (such as addressing information propagation and artifacts via a > non-xcom mechanism) > > 2) Having point-in-timeness for DAGs: a concept of which revision of a DAG > was in use at which date, represented in-Airflow. > > 3) Better idioms and loading capabilities for DAG factories (either > config-driven, or non-Python creation of DAGs, like with boundary-layer). > > 4) Flexible execution dates: in finance we operate day over day, and have > valid use cases for "t-1", "t+0", and "t+1" dates. The current execution > date status is incredibly confusing for literally every developer we've > brought onto Airflow (they understand it eventually but do make mistakes at > first). > > 5) Scheduler-integrated sensors > > 6) Making Airflow more operator-friendly with better alerting, health > checks, notifications, deploy-time configuration, etc. > > 7) Improving testability of various components (both within the Airflow > repo, as well as making it easier to test DAGs and plugins) > > 8) Deprecating "newbie trap" or excess complexity features (like skips), by > fixing their internal implementation or by providing alternatives that > address their use cases in more sound ways. > > To my mind, I would need Airflow to be more modular to accomplish several > of those. Even if these aims don't happen in Airflow contrib (as some are > quite contentious and have been discussed on this list before), it would > currently be nearly impossible to maintain an in-house branch that > attempted to implement them. > > That being said, saying that it requires microservices is IMO incorrect. > Airflow already scales quite well, so while it needs more modularization, > we probably would see no benefit from immediately breaking those modules > into independent services. > > On Wed, Nov 28, 2018 at 11:38 AM Ash Berlin-Taylor <a...@apache.org> wrote: > >> I have similar feelings around the "core" of Airflow and would _love_ to >> somehow find time to spend a month really getting to grips with the >> scheduler and the dagbag and see what comes to light with fresh eyes and >> the benefits of hindsight. >> >> Finding that time is going to be.... A Challenge though. >> >> (Oh, except no to microservices. Airflow is hard enough to operator right >> now without splitting things in to even more daemons) >> >> -ash >>> On 26 Nov 2018, at 03:06, soma dhavala <soma.dhav...@gmail.com> wrote: >>> >>> >>> >>>> On Nov 26, 2018, at 7:50 AM, Maxime Beauchemin < >> maximebeauche...@gmail.com> wrote: >>>> >>>> The historical reason is that people would check in scripts in the repo >>>> that had actual compute or other forms or undesired effect in module >> scope >>>> (scripts with no "if __name__ == '__main__':") and Airflow would just >> run >>>> this script while seeking for DAGs. So we added this mitigation patch >> that >>>> would confirm that there's something Airflow-related in the .py file. >> Not >>>> elegant, and confusing at times, but it also probably prevented some >> issues >>>> over the years. >>>> >>>> The solution here is to have a more explicit way of adding DAGs to the >>>> DagBag (instead of the folder-crawling approach). The DagFetcher >> proposal >>>> offers solutions around that, having a central "manifest" file that >>>> provides explicit pointers to all DAGs in the environment. >>> >>> Some rebasing needs to happen. When I looked at 1.8 code base almost an >> year ago, it felt like more complex than necessary. What airflow is trying >> to promise from an architectural standpoint — that was not clear to me. It >> is trying to do too many things, scattered in too many places, is the >> feeling I got. As a result, I stopped peeping, and just trust that it works >> — which it does, btw. I tend to think that, airflow outgrew its original >> intents. A sort of micro-services architecture has to be brought in. I may >> sound critical, but no offense. I truly appreciate the contributions. >>> >>>> >>>> Max >>>> >>>> On Sat, Nov 24, 2018 at 5:04 PM Beau Barker <beauinmelbou...@gmail.com> >>>> wrote: >>>> >>>>> In my opinion this searching for dags is not ideal. >>>>> >>>>> We should be explicitly specifying the dags to load somewhere. >>>>> >>>>> >>>>>> On 25 Nov 2018, at 10:41 am, Kevin Yang <yrql...@gmail.com> wrote: >>>>>> >>>>>> I believe that is mostly because we want to skip parsing/loading .py >>>>> files >>>>>> that doesn't contain DAG defs to save time, as scheduler is going to >>>>>> parse/load the .py files over and over again and some files can take >>>>> quite >>>>>> long to load. >>>>>> >>>>>> Cheers, >>>>>> Kevin Y >>>>>> >>>>>> On Fri, Nov 23, 2018 at 12:44 AM soma dhavala <soma.dhav...@gmail.com >>> >>>>>> wrote: >>>>>> >>>>>>> happy to report that the “fix” worked. thanks Alex. >>>>>>> >>>>>>> btw, wondering why was it there in the first place? how does it help >> — >>>>>>> saves time, early termination — what? >>>>>>> >>>>>>> >>>>>>>> On Nov 23, 2018, at 8:18 AM, Alex Guziel <alex.guz...@airbnb.com> >>>>> wrote: >>>>>>>> >>>>>>>> Yup. >>>>>>>> >>>>>>>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala < >> soma.dhav...@gmail.com >>>>>>> <mailto:soma.dhav...@gmail.com>> wrote: >>>>>>>> >>>>>>>> >>>>>>>>> On Nov 23, 2018, at 3:28 AM, Alex Guziel <alex.guz...@airbnb.com >>>>>>> <mailto:alex.guz...@airbnb.com>> wrote: >>>>>>>>> >>>>>>>>> It’s because of this >>>>>>>>> >>>>>>>>> “When searching for DAGs, Airflow will only consider files where >> the >>>>>>> string “airflow” and “DAG” both appear in the contents of the .py >> file.” >>>>>>>>> >>>>>>>> >>>>>>>> Have not noticed it. From airflow/models.py, in process_file — >> (both >>>>> in >>>>>>> 1.9 and 1.10) >>>>>>>> .. >>>>>>>> if not all([s in content for s in (b'DAG', b'airflow')]): >>>>>>>> .. >>>>>>>> is looking for those strings and if they are not found, it is >> returning >>>>>>> without loading the DAGs. >>>>>>>> >>>>>>>> >>>>>>>> So having “airflow” and “DAG” dummy strings placed somewhere will >> make >>>>>>> it work? >>>>>>>> >>>>>>>> >>>>>>>>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala < >> soma.dhav...@gmail.com >>>>>>> <mailto:soma.dhav...@gmail.com>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel <alex.guz...@airbnb.com >>>>>>> <mailto:alex.guz...@airbnb.com>> wrote: >>>>>>>>>> >>>>>>>>>> I think this is what is going on. The dags are picked by local >>>>>>> variables. I.E. if you do >>>>>>>>>> dag = Dag(...) >>>>>>>>>> dag = Dag(…) >>>>>>>>> >>>>>>>>> from my_module import create_dag >>>>>>>>> >>>>>>>>> for file in yaml_files: >>>>>>>>> dag = create_dag(file) >>>>>>>>> globals()[dag.dag_id] = dag >>>>>>>>> >>>>>>>>> You notice that create_dag is in a different module. If it is in >> the >>>>>>> same scope (file), it will be fine. >>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>>> Only the second dag will be picked up. >>>>>>>>>> >>>>>>>>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala < >>>>> soma.dhav...@gmail.com >>>>>>> <mailto:soma.dhav...@gmail.com>> wrote: >>>>>>>>>> Hey AirFlow Devs: >>>>>>>>>> In our organization, we build a Machine Learning WorkBench with >>>>>>> AirFlow as >>>>>>>>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow >> python >>>>>>>>>> operators to customize the behaviour. These work flows are >> specified >>>>> in >>>>>>>>>> YAML. >>>>>>>>>> >>>>>>>>>> We drop a DAG loader (written python) in the default location >> airflow >>>>>>>>>> expects the DAG files. This DAG loader reads the specified YAML >>>>> files >>>>>>> and >>>>>>>>>> converts them into airflow DAG objects. Essentially, we are >>>>>>>>>> programmatically creating the DAG objects. In order to support >>>>> muliple >>>>>>>>>> parsers (yaml, json etc), we separated the DAG creation from >> loading. >>>>>>> But >>>>>>>>>> when a DAG is created (in a separate module) and made available to >>>>> the >>>>>>> DAG >>>>>>>>>> loaders, airflow does not pick it up. As an example, consider >> that I >>>>>>>>>> created a DAG picked it, and will simply unpickle the DAG and >> give it >>>>>>> to >>>>>>>>>> airflow. >>>>>>>>>> >>>>>>>>>> However, in current avatar of airfow, the very creation of DAG >> has to >>>>>>>>>> happen in the loader itself. As far I am concerned, airflow should >>>>> not >>>>>>> care >>>>>>>>>> where and how the DAG object is created, so long as it is a valid >> DAG >>>>>>>>>> object. The workaround for us is to mix parser and loader in the >> same >>>>>>> file >>>>>>>>>> and drop it in the airflow default dags folder. During dag_bag >>>>>>> creation, >>>>>>>>>> this file is loaded up with import_modules utility and shows up in >>>>> the >>>>>>> UI. >>>>>>>>>> While this is a solution, but it is not clean. >>>>>>>>>> >>>>>>>>>> What do DEVs think about a solution to this problem? Will saving >> the >>>>>>> DAG to >>>>>>>>>> the db and reading it from the db work? Or some core changes need >> to >>>>>>> happen >>>>>>>>>> in the dag_bag creation. Can dag_bag take a bunch of "created" >> DAGs. >>>>>>>>>> >>>>>>>>>> thanks, >>>>>>>>>> -soma >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>> >>> >> >>