Re: programmatically creating and airflow quirks

2018-11-28 Thread soma dhavala
Great inputs James. I was premature in saying we need micro-services. Any 
solutioning should  depend on the problem(s) being solved and promise(s) being 
made.

thanks,
-soma

> On Nov 28, 2018, at 11:24 PM, James Meickle  
> wrote:
> 
> I would be very interested in helping draft a rearchitecting AIP. Of
> course, that's a vague statement. I am interested in several specific areas
> of Airflow functionality that would be hard to modify without some
> refactoring taking place first:
> 
> 1) Improving Airflow's data model so it's easier to have functional data
> pipelines (such as addressing information propagation and artifacts via a
> non-xcom mechanism)
> 
> 2) Having point-in-timeness for DAGs: a concept of which revision of a DAG
> was in use at which date, represented in-Airflow.
> 
> 3) Better idioms and loading capabilities for DAG factories (either
> config-driven, or non-Python creation of DAGs, like with boundary-layer).
> 
> 4) Flexible execution dates: in finance we operate day over day, and have
> valid use cases for "t-1", "t+0", and "t+1" dates. The current execution
> date status is incredibly confusing for literally every developer we've
> brought onto Airflow (they understand it eventually but do make mistakes at
> first).
> 
> 5) Scheduler-integrated sensors
> 
> 6) Making Airflow more operator-friendly with better alerting, health
> checks, notifications, deploy-time configuration, etc.
> 
> 7) Improving testability of various components (both within the Airflow
> repo, as well as making it easier to test DAGs and plugins)
> 
> 8) Deprecating "newbie trap" or excess complexity features (like skips), by
> fixing their internal implementation or by providing alternatives that
> address their use cases in more sound ways.
> 
> To my mind, I would need Airflow to be more modular to accomplish several
> of those. Even if these aims don't happen in Airflow contrib (as some are
> quite contentious and have been discussed on this list before), it would
> currently be nearly impossible to maintain an in-house branch that
> attempted to implement them.
> 
> That being said, saying that it requires microservices is IMO incorrect.
> Airflow already scales quite well, so while it needs more modularization,
> we probably would see no benefit from immediately breaking those modules
> into independent services.
> 
> On Wed, Nov 28, 2018 at 11:38 AM Ash Berlin-Taylor  wrote:
> 
>> I have similar feelings around the "core" of Airflow and would _love_ to
>> somehow find time to spend a month really getting to grips with the
>> scheduler and the dagbag and see what comes to light with fresh eyes and
>> the benefits of hindsight.
>> 
>> Finding that time is going to be A Challenge though.
>> 
>> (Oh, except no to microservices. Airflow is hard enough to operator right
>> now without splitting things in to even more daemons)
>> 
>> -ash
>>> On 26 Nov 2018, at 03:06, soma dhavala  wrote:
>>> 
>>> 
>>> 
>>>> On Nov 26, 2018, at 7:50 AM, Maxime Beauchemin <
>> maximebeauche...@gmail.com> wrote:
>>>> 
>>>> The historical reason is that people would check in scripts in the repo
>>>> that had actual compute or other forms or undesired effect in module
>> scope
>>>> (scripts with no "if __name__ == '__main__':") and Airflow would just
>> run
>>>> this script while seeking for DAGs. So we added this mitigation patch
>> that
>>>> would confirm that there's something Airflow-related in the .py file.
>> Not
>>>> elegant, and confusing at times, but it also probably prevented some
>> issues
>>>> over the years.
>>>> 
>>>> The solution here is to have a more explicit way of adding DAGs to the
>>>> DagBag (instead of the folder-crawling approach). The DagFetcher
>> proposal
>>>> offers solutions around that, having a central "manifest" file that
>>>> provides explicit pointers to all DAGs in the environment.
>>> 
>>> Some rebasing needs to happen. When I looked at 1.8 code base almost an
>> year ago, it felt like more complex than necessary.  What airflow is trying
>> to promise from an architectural standpoint — that was not clear to me. It
>> is trying to do too many things, scattered in too many places, is the
>> feeling I got. As a result, I stopped peeping, and just trust that it works
>> — which it does, btw. I tend to think that, airflow outgrew its original
>> intents. A sort of micro-services architecture has to be brought

Re: programmatically creating and airflow quirks

2018-11-25 Thread soma dhavala



> On Nov 26, 2018, at 7:50 AM, Maxime Beauchemin  
> wrote:
> 
> The historical reason is that people would check in scripts in the repo
> that had actual compute or other forms or undesired effect in module scope
> (scripts with no "if __name__ == '__main__':") and Airflow would just run
> this script while seeking for DAGs. So we added this mitigation patch that
> would confirm that there's something Airflow-related in the .py file. Not
> elegant, and confusing at times, but it also probably prevented some issues
> over the years.
> 
> The solution here is to have a more explicit way of adding DAGs to the
> DagBag (instead of the folder-crawling approach). The DagFetcher proposal
> offers solutions around that, having a central "manifest" file that
> provides explicit pointers to all DAGs in the environment.

Some rebasing needs to happen. When I looked at 1.8 code base almost an year 
ago, it felt like more complex than necessary.  What airflow is trying to 
promise from an architectural standpoint — that was not clear to me. It is 
trying to do too many things, scattered in too many places, is the feeling I 
got. As a result, I stopped peeping, and just trust that it works — which it 
does, btw. I tend to think that, airflow outgrew its original intents. A sort 
of micro-services architecture has to be brought in. I may sound critical, but 
no offense. I truly appreciate the contributions.

> 
> Max
> 
> On Sat, Nov 24, 2018 at 5:04 PM Beau Barker 
> wrote:
> 
>> In my opinion this searching for dags is not ideal.
>> 
>> We should be explicitly specifying the dags to load somewhere.
>> 
>> 
>>> On 25 Nov 2018, at 10:41 am, Kevin Yang  wrote:
>>> 
>>> I believe that is mostly because we want to skip parsing/loading .py
>> files
>>> that doesn't contain DAG defs to save time, as scheduler is going to
>>> parse/load the .py files over and over again and some files can take
>> quite
>>> long to load.
>>> 
>>> Cheers,
>>> Kevin Y
>>> 
>>> On Fri, Nov 23, 2018 at 12:44 AM soma dhavala 
>>> wrote:
>>> 
>>>> happy to report that the “fix” worked. thanks Alex.
>>>> 
>>>> btw, wondering why was it there in the first place? how does it help —
>>>> saves time, early termination — what?
>>>> 
>>>> 
>>>>> On Nov 23, 2018, at 8:18 AM, Alex Guziel 
>> wrote:
>>>>> 
>>>>> Yup.
>>>>> 
>>>>> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala >>> <mailto:soma.dhav...@gmail.com>> wrote:
>>>>> 
>>>>> 
>>>>>> On Nov 23, 2018, at 3:28 AM, Alex Guziel >>> <mailto:alex.guz...@airbnb.com>> wrote:
>>>>>> 
>>>>>> It’s because of this
>>>>>> 
>>>>>> “When searching for DAGs, Airflow will only consider files where the
>>>> string “airflow” and “DAG” both appear in the contents of the .py file.”
>>>>>> 
>>>>> 
>>>>> Have not noticed it.  From airflow/models.py, in process_file — (both
>> in
>>>> 1.9 and 1.10)
>>>>> ..
>>>>> if not all([s in content for s in (b'DAG', b'airflow')]):
>>>>> ..
>>>>> is looking for those strings and if they are not found, it is returning
>>>> without loading the DAGs.
>>>>> 
>>>>> 
>>>>> So having “airflow” and “DAG”  dummy strings placed somewhere will make
>>>> it work?
>>>>> 
>>>>> 
>>>>>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala >>> <mailto:soma.dhav...@gmail.com>> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel >>> <mailto:alex.guz...@airbnb.com>> wrote:
>>>>>>> 
>>>>>>> I think this is what is going on. The dags are picked by local
>>>> variables. I.E. if you do
>>>>>>> dag = Dag(...)
>>>>>>> dag = Dag(…)
>>>>>> 
>>>>>> from my_module import create_dag
>>>>>> 
>>>>>> for file in yaml_files:
>>>>>>dag = create_dag(file)
>>>>>>globals()[dag.dag_id] = dag
>>>>>> 
>>>>>> You notice that create_dag is in a different module. If it is in the
>>>> same scope (file), it will be fine.
>>>>>> 
>>>>>>> 
&

Re: Remove airflow from pypi

2018-11-24 Thread soma dhavala



> On Nov 25, 2018, at 5:02 AM, George Leslie-Waksman  wrote:
> 
> It's probably a good idea to put something at "airflow", even if it
> just fails to install and tells people to install apache-airflow
> instead.
> 
> If not, there's a risk someone squats the name airflow and puts up
> something malicious.
> 

+ 1

> --George
> On Fri, Nov 23, 2018 at 11:44 AM Driesprong, Fokko  
> wrote:
>> 
>> Thanks Dan for picking this up quickly.
>> 
>> Op vr 23 nov. 2018 om 18:31 schreef Kaxil Naik :
>> 
>>> Thanks Dan
>>> 
>>> On Fri, Nov 23, 2018 at 3:44 PM Dan Davydov 
>>> wrote:
>>> 
 This could potentially break builds for some users but I feel the pros
 mentioned outweigh this, I went ahead and deleted it.
 
 On Fri, Nov 23, 2018 at 10:18 AM Bolke de Bruin 
>>> wrote:
 
> Agree! This is even a security issue.
> 
> Sent from my iPhone
> 
>> On 23 Nov 2018, at 15:29, Driesprong, Fokko 
> wrote:
>> 
>> Hi all,
>> 
>> I think we should remove airflow 
> (not
>> apache-airflow) from Pypi. I still get questions from people who
>> accidentally install Airflow 1.8.0. I see this is maintained
>> by mistercrunch, artwr, aeon. Anyone any objections?
>> 
>> Cheers, Fokko
> 
 
>>> 
>>> 
>>> --
>>> *Kaxil Naik*
>>> *Big Data Consultant *@ *Data Reply UK*
>>> *Certified *Google Cloud Data Engineer | *Certified* Apache Spark & Neo4j
>>> Developer
>>> *Phone: *+44 (0) 74820 88992
>>> *LinkedIn*: https://www.linkedin.com/in/kaxil
>>> 



Re: programmatically creating and airflow quirks

2018-11-23 Thread soma dhavala
happy to report that the “fix” worked. thanks Alex.

btw, wondering why was it there in the first place? how does it help — saves 
time, early termination — what?


> On Nov 23, 2018, at 8:18 AM, Alex Guziel  wrote:
> 
> Yup. 
> 
> On Thu, Nov 22, 2018 at 3:16 PM soma dhavala  <mailto:soma.dhav...@gmail.com>> wrote:
> 
> 
>> On Nov 23, 2018, at 3:28 AM, Alex Guziel > <mailto:alex.guz...@airbnb.com>> wrote:
>> 
>> It’s because of this 
>> 
>> “When searching for DAGs, Airflow will only consider files where the string 
>> “airflow” and “DAG” both appear in the contents of the .py file.”
>> 
> 
> Have not noticed it.  From airflow/models.py, in process_file — (both in 1.9 
> and 1.10)
> ..
> if not all([s in content for s in (b'DAG', b'airflow')]):
> ..
> is looking for those strings and if they are not found, it is returning 
> without loading the DAGs.
> 
> 
> So having “airflow” and “DAG”  dummy strings placed somewhere will make it 
> work?
> 
> 
>> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala > <mailto:soma.dhav...@gmail.com>> wrote:
>> 
>> 
>>> On Nov 22, 2018, at 3:37 PM, Alex Guziel >> <mailto:alex.guz...@airbnb.com>> wrote:
>>> 
>>> I think this is what is going on. The dags are picked by local variables. 
>>> I.E. if you do
>>> dag = Dag(...)
>>> dag = Dag(…)
>> 
>> from my_module import create_dag
>> 
>> for file in yaml_files:
>>  dag = create_dag(file)
>>  globals()[dag.dag_id] = dag
>> 
>> You notice that create_dag is in a different module. If it is in the same 
>> scope (file), it will be fine.
>> 
>>> 
>> 
>>> Only the second dag will be picked up.
>>> 
>>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala >> <mailto:soma.dhav...@gmail.com>> wrote:
>>> Hey AirFlow Devs:
>>> In our organization, we build a Machine Learning WorkBench with AirFlow as
>>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
>>> operators to customize the behaviour. These work flows are specified in
>>> YAML.
>>> 
>>> We drop a DAG loader (written python) in the default location airflow
>>> expects the DAG files.  This DAG loader reads the specified YAML files and
>>> converts them into airflow DAG objects. Essentially, we are
>>> programmatically creating the DAG objects. In order to support muliple
>>> parsers (yaml, json etc), we separated the DAG creation from loading. But
>>> when a DAG is created (in a separate module) and made available to the DAG
>>> loaders, airflow does not pick it up. As an example, consider that I
>>> created a DAG picked it, and will simply unpickle the DAG and give it to
>>> airflow.
>>> 
>>> However, in current avatar of airfow, the very creation of DAG has to
>>> happen in the loader itself. As far I am concerned, airflow should not care
>>> where and how the DAG object is created, so long as it is a valid DAG
>>> object. The workaround for us is to mix parser and loader in the same file
>>> and drop it in the airflow default dags folder. During dag_bag creation,
>>> this file is loaded up with import_modules utility and shows up in the UI.
>>> While this is a solution, but it is not clean.
>>> 
>>> What do DEVs think about a solution to this problem? Will saving the DAG to
>>> the db and reading it from the db work? Or some core changes need to happen
>>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
>>> 
>>> thanks,
>>> -soma
>> 
> 



Re: programmatically creating and airflow quirks

2018-11-22 Thread soma dhavala


> On Nov 23, 2018, at 3:28 AM, Alex Guziel  wrote:
> 
> It’s because of this 
> 
> “When searching for DAGs, Airflow will only consider files where the string 
> “airflow” and “DAG” both appear in the contents of the .py file.”
> 

Have not noticed it.  From airflow/models.py, in process_file — (both in 1.9 
and 1.10)
..
if not all([s in content for s in (b'DAG', b'airflow')]):
..
is looking for those strings and if they are not found, it is returning without 
loading the DAGs.


So having “airflow” and “DAG”  dummy strings placed somewhere will make it work?


> On Thu, Nov 22, 2018 at 2:27 AM soma dhavala  <mailto:soma.dhav...@gmail.com>> wrote:
> 
> 
>> On Nov 22, 2018, at 3:37 PM, Alex Guziel > <mailto:alex.guz...@airbnb.com>> wrote:
>> 
>> I think this is what is going on. The dags are picked by local variables. 
>> I.E. if you do
>> dag = Dag(...)
>> dag = Dag(…)
> 
> from my_module import create_dag
> 
> for file in yaml_files:
>   dag = create_dag(file)
>   globals()[dag.dag_id] = dag
> 
> You notice that create_dag is in a different module. If it is in the same 
> scope (file), it will be fine.
> 
>> 
> 
>> Only the second dag will be picked up.
>> 
>> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala > <mailto:soma.dhav...@gmail.com>> wrote:
>> Hey AirFlow Devs:
>> In our organization, we build a Machine Learning WorkBench with AirFlow as
>> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
>> operators to customize the behaviour. These work flows are specified in
>> YAML.
>> 
>> We drop a DAG loader (written python) in the default location airflow
>> expects the DAG files.  This DAG loader reads the specified YAML files and
>> converts them into airflow DAG objects. Essentially, we are
>> programmatically creating the DAG objects. In order to support muliple
>> parsers (yaml, json etc), we separated the DAG creation from loading. But
>> when a DAG is created (in a separate module) and made available to the DAG
>> loaders, airflow does not pick it up. As an example, consider that I
>> created a DAG picked it, and will simply unpickle the DAG and give it to
>> airflow.
>> 
>> However, in current avatar of airfow, the very creation of DAG has to
>> happen in the loader itself. As far I am concerned, airflow should not care
>> where and how the DAG object is created, so long as it is a valid DAG
>> object. The workaround for us is to mix parser and loader in the same file
>> and drop it in the airflow default dags folder. During dag_bag creation,
>> this file is loaded up with import_modules utility and shows up in the UI.
>> While this is a solution, but it is not clean.
>> 
>> What do DEVs think about a solution to this problem? Will saving the DAG to
>> the db and reading it from the db work? Or some core changes need to happen
>> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
>> 
>> thanks,
>> -soma
> 



Re: programmatically creating and airflow quirks

2018-11-22 Thread soma dhavala


> On Nov 22, 2018, at 3:37 PM, Alex Guziel  wrote:
> 
> I think this is what is going on. The dags are picked by local variables. 
> I.E. if you do
> dag = Dag(...)
> dag = Dag(…)

from my_module import create_dag

for file in yaml_files:
dag = create_dag(file)
globals()[dag.dag_id] = dag

You notice that create_dag is in a different module. If it is in the same scope 
(file), it will be fine.

> 

> Only the second dag will be picked up.
> 
> On Thu, Nov 22, 2018 at 2:04 AM Soma S Dhavala  > wrote:
> Hey AirFlow Devs:
> In our organization, we build a Machine Learning WorkBench with AirFlow as
> an orchestrator of the ML Work Flows, and have wrapped AirFlow python
> operators to customize the behaviour. These work flows are specified in
> YAML.
> 
> We drop a DAG loader (written python) in the default location airflow
> expects the DAG files.  This DAG loader reads the specified YAML files and
> converts them into airflow DAG objects. Essentially, we are
> programmatically creating the DAG objects. In order to support muliple
> parsers (yaml, json etc), we separated the DAG creation from loading. But
> when a DAG is created (in a separate module) and made available to the DAG
> loaders, airflow does not pick it up. As an example, consider that I
> created a DAG picked it, and will simply unpickle the DAG and give it to
> airflow.
> 
> However, in current avatar of airfow, the very creation of DAG has to
> happen in the loader itself. As far I am concerned, airflow should not care
> where and how the DAG object is created, so long as it is a valid DAG
> object. The workaround for us is to mix parser and loader in the same file
> and drop it in the airflow default dags folder. During dag_bag creation,
> this file is loaded up with import_modules utility and shows up in the UI.
> While this is a solution, but it is not clean.
> 
> What do DEVs think about a solution to this problem? Will saving the DAG to
> the db and reading it from the db work? Or some core changes need to happen
> in the dag_bag creation. Can dag_bag take a bunch of "created" DAGs.
> 
> thanks,
> -soma