Re: [DICUSS][AIP-40] Deferrable ("Async") Operators

Andrew Godwin Thu, 29 Apr 2021 09:00:10 -0700

Yup, the triggerer HA is called out in the AIP, along with a future space
left to make it sharded (I'll design this into its API):


"The triggerer is designed to be run in a highly-available (HA) manner; it
should coexist with other instances of itself, running the same triggers,
and deduplicating events when they fire off. In future, it should also be
designed to run in a sharded/partitioned architecture, where the set of
triggers is divided across replica sets of triggerers."

I think a trigger_run_timeout is a very sensible thing to ship and I will
add it into the AIP, and I will propose we set a default of 2-3 seconds so
we catch the most egregious cases (ideally it would be sub-500ms if you
wanted to catch everything).

Andrew

On Thu, Apr 29, 2021 at 7:24 AM Kaxil Naik <kaxiln...@gmail.com> wrote:

> Similar to trigger_timeout (which is now in the AIP) we could have
> trigger_run_timeout to take care of that case.
>
> Regards,
> Kaxil
>
> On Thu, Apr 29, 2021 at 1:51 PM Ash Berlin-Taylor <a...@apache.org> wrote:
>
>> On point one: The triggerer is HA, so you can run multiple, so the
>> concern about blocking is valid, but it wouldn't block _all_ triggers.
>>
>> Or I thought that was the plan, but the AIP doesn't mention that, but
>> Andrew said that _somewhere_
>>
>> We should explicitly mention this is the case.
>>
>> -ash
>>
>> On 29 April 2021 09:45:47 BST, Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>> Hey anyone from, the BIG users, any comments here ? Any thoughts about
>>> the operation side of this? I am not sure if my worries are too
>>> "excessive", so I would love to hear your thoughts here.. I think it
>>> would be great to unblock Andrew so that he can carry on with the AIP
>>> (which I think is great feature BTW).
>>>
>>> I think this boils down to two questions:
>>>
>>> 1) Would you be worried by having a single thread running all such
>>> triggers for the whole installation where users could potentially
>>> develop their own "blocking" triggers by mistake and block others ?
>>> 2) What kind of monitoring/detection/prevention would you like to see
>>> to get it "under control" :)
>>>
>>> J,
>>>
>>>
>>> On Tue, Apr 27, 2021 at 9:04 PM Andrew Godwin
>>> <andrew.god...@astronomer.io.invalid> wrote:
>>>
>>>>
>>>>  Oh totally, I was partially wearing my SRE hat when writing up parts of 
>>>> this (hence why HA is built-in from the start), but I'm more used to 
>>>> running web workloads than Airflow, so any input from people who've run 
>>>> large Airflow installs are welcome.
>>>>
>>>> That will not work I am afraid :). If we open it up for users to use, we 
>>>> have to deal with consequences. We have to be prepared for people doing 
>>>> all kinds of weird things - at least this is what I've learned during the 
>>>> last 2 years of working on Airflow.
>>>>>
>>>>
>>>>  It's maybe one of the key lessons I've had from 15 years writing open 
>>>> source - people will always use your hidden APIs no matter what!
>>>>
>>>>  I think in this case, it's a situation where we just have to make it 
>>>> progressively safer as you go up the stack - if you're just using 
>>>> Operators you are fine, if you are authoring Operators there's some new 
>>>> things to think about if you want to go deferrable, and if you're 
>>>> authoring Triggers then you need a bit of async experience. Plus, people 
>>>> who are just writing standard non-deferrable operators have nothing to 
>>>> worry about as this is purely an additive feature, which I like; letting 
>>>> people opt-in to complexity is always nice.
>>>>
>>>>  Hopefully we can get enough safety guards in that all three of these 
>>>> levels will be reasonably accessible without encouraging people to go 
>>>> straight to Triggers; it would likely be an improvement over Smart 
>>>> Sensors, which as far as I can tell lack a lot of them.
>>>>
>>>>  Andrew
>>>>
>>>>  On Tue, Apr 27, 2021 at 12:29 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>
>>>>>
>>>>>  Hey Andrew,
>>>>>
>>>>>  Just don't get me wrong :). I love the idea/AIP proposal. Just want to
>>>>>  make sure that from day one we think about operational aspects of it.
>>>>>  I think the easier we make it for the operations people, the less we
>>>>>  will have to deal with their problems in the devlist/Github issues.
>>>>>
>>>>>  I would love to hear what others think about it - maybe some folks
>>>>>  from the devlist who operate Airflow in scale could chime in here - we
>>>>>  have some people from AirBnB/Twitter etc... Maybe you could state
>>>>>  expectations when it comes to the operational side if you'd have 1000s
>>>>>  of Triggers that potentially interfere with each other :) ?
>>>>>
>>>>>
>>>>>
>>>>>  On Mon, Apr 26, 2021 at 9:21 PM Andrew Godwin
>>>>>  <andrew.god...@astronomer.io.invalid> wrote:
>>>>>
>>>>>> 1) In general, I'm envisioning Triggers as a thing that are generally 
>>>>>> abstracted away from DAG authors - instead, they would come in a 
>>>>>> provider package (or core Airflow) and so we would be expecting the same 
>>>>>> higher level of quality and testing.
>>>>>>
>>>>>
>>>>>  I see the intention, but we would have to have pretty comprehensive
>>>>>  set of triggers from day one - similar to the different types of "rich
>>>>>  intervals" we have in
>>>>>  
>>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-39+Richer+scheduler_interval.
>>>>>  Maybe a list (groups) of the triggers we would like to have as
>>>>>  "built-ins" would be really helpful?
>>>>>  But even then, I think the "extendability" of this solution is what I
>>>>>  like about it. More often than not, users will find out that they need
>>>>>  something custom. I think if we describe the interface that the
>>>>>  Trigger should implement and make it a "first-class" citizen -
>>>>>  similarly as all other concepts in Airflow, it is expected that users
>>>>>  will override them - Operators, Sensors, even the new "Richer
>>>>>  schedules" are "meant" to be implemented by the users. If they are
>>>>>  not, we should not make it public but rather have (and accept) only a
>>>>>  fixed set of those - and for that we could just  implement an Enum of
>>>>>  available Triggers. By providing an interface, we invite our users to
>>>>>  implement their own custom Triggers, and I think we need to deal with
>>>>>  consequences. I think we have to think about what happens when users
>>>>>  start writing their triggers and what "toolbox" we give the people
>>>>>  operating Airflow to deal with that.
>>>>>
>>>>> a) I do think it should all be fixed as asyncio, mostly because the 
>>>>> library support is there and you don't suddenly want to make people have 
>>>>> to run multiple "flavours" of triggerer process based on how many 
>>>>> triggers they're using and what runtime loops they demand. If trio were 
>>>>> more popular, I'd pick that as it's safer and (in my opinion) 
>>>>> better-designed, but we are unfortunately nowhere near that, and in 
>>>>> addition this is not something that is impossible to add in at a later 
>>>>> stage.
>>>>>>
>>>>>
>>>>>  Asyncio would also be my choice by far so we do not differ here :)
>>>>>
>>>>>  From what I understand, you propose a single event loop to run all the
>>>>>  deferred tasks? Am I right?
>>>>>  My point is that multiple event loops or Thread-based async running
>>>>>  are also part of asyncio. No problem with that. Asyncio by default
>>>>>  uses a single event loop, but there is no problem to use more. This
>>>>>  would be quite similar to "queues" we currently have with celery
>>>>>  workers. I am not telling we should, but I can see the case where this
>>>>>  might be advisable (for example to isolate groups of the deferred
>>>>>  tasks so that they do not interfere/delay the other group). What do
>>>>>  you think?
>>>>>
>>>>> b) There's definitely some hooks we can use to detect long-running 
>>>>> triggers, and I'll see if I can grab some of them and implement them. At 
>>>>> very least, there's the SIGALRM watchdog method, and I believe it's 
>>>>> possible to put nice timeouts around asyncio tasks, which we could use to 
>>>>> enforce a max runtime specified in the airflow.cfg file.
>>>>>>
>>>>>
>>>>>  I agree it will be indeed difficult to prevent people from making
>>>>>  mistakes, but we should really think about how we should help the
>>>>>  operations people to detect and diagnose them. And I think it should
>>>>>  be part of specification:
>>>>>
>>>>>  1) what kind of metrics we are logging for the triggers - I think we
>>>>>  should gather and publish (using airflow's metrics system) some useful
>>>>>  metrics for the operations people (number of executions/execution
>>>>>  length, queuing/delay time vs. expectation for some trigger like date
>>>>>  trigger) etc. for all triggers from day one.
>>>>>  2) you mentioned it - maybe we should have Debug mode turned on by
>>>>>  default: https://docs.python.org/3/library/asyncio-dev.html#debug-mode
>>>>>  . It has some useful features, mainly automated logging of too long
>>>>>  running async methods. Not sure what consequences it has though.
>>>>>  3) max execution time would be even nicer indeed
>>>>>  4) maybe there should be some exposure in the UI/CLI/API on what's
>>>>>  going in with triggers?
>>>>>  5) maybe there should be a way to cancel /disable some triggers that
>>>>>  are mis-behaving - using the UI/CLI/API  - until the code gets fixed
>>>>>  for those ?
>>>>>
>>>>>  I think we simply need to document those aspects in "operations"
>>>>>  chapter of your proposal. I am happy to propose some draft changes in
>>>>>  the AIP if you would like to, after we discuss it here.
>>>>>
>>>>> 2) This is why there's no actual _state_ that is persisted - instead, you 
>>>>> pass the method you want to call next and its keyword arguments. 
>>>>> Obviously we'll need to be quite clear about this in the docs, but I feel 
>>>>> this is better than persisting _some_ state. Again, though, this is an 
>>>>> implementation detail that would likely be hidden inside an Operator or 
>>>>> Sensor from the average DAG user; I'm not proposing we expose this to 
>>>>> things like PythonOperator or TaskFlow yet, for the reasons you describe.
>>>>>> I wish I had a better API to present for Operator authors, but with the 
>>>>>> fundamental fact that the Operator/Task is going to run on different 
>>>>>> machines for different phases of its life, I think having explicit "give 
>>>>>> me this when you revive me" arguments is the best tradeoff we can go for.
>>>>>>
>>>>>
>>>>>  Agree. We cannot do much about it. I think we should just be very
>>>>>  clear when we document the life cycle. Maybe we should even update the
>>>>>  semantics of pre/post so that  "pre_execute()" and "post_execute()"
>>>>>  are executed for EVERY execute - including the deferred one (also
>>>>>  execute_complete()). This way (in your example) the task execution
>>>>>  would look like : pre_execute(), execute(-> throw TaskDeferred()),
>>>>>  post_execute()   and then on another worker pre_execute(),
>>>>>  execute_complete(), post_execute(). I think that would make sense.
>>>>>  What do you think?
>>>>>
>>>>> 3) I do think we should limit the size of the payload, as well as the 
>>>>> kwargs that pass between deferred phases of the task instance - something 
>>>>> pretty meaty, like 500KB, would seem reasonable to me. I've also run into 
>>>>> the problem in the past that if you design a messaging system without a 
>>>>> limit, people _will_ push the size of the things sent up to 
>>>>> eyebrow-raisingly-large sizes.
>>>>>>
>>>>>
>>>>>  Yeah. I think XCom should be our benchmark. Currently we have
>>>>>  MAX_XCOM_SIZE = 49344. Should we use the same?
>>>>>
>>>>>  And that leads me to another question, actually very important. I
>>>>>  think (correct me please if I am wrong)  it is missing in the current
>>>>>  specs. Where the kwargs are going to be stored while the task is
>>>>>  deferred? Are they only stored in memory of the triggerer? Or in the
>>>>>  DB? And what are the consequences? What happens when the tasks are
>>>>>  triggered and the triggerer is restarted? How do we recover? Does it
>>>>>  mean that all the tasks that are deferred will have to be restarted?
>>>>>  How? Could you please elaborate a bit on that (and I think also it
>>>>>  needs a chapter in the specification).
>>>>>
>>>>> Overall, I think it's important to stress that the average DAG author 
>>>>> should not even know that Triggers really exist; instead, they should 
>>>>> just be able to switch Sensors or Operators (e.g. DateTimeSensor -> 
>>>>> AsyncDateTimeSensor in my prototype) and get the benefits of deferred 
>>>>> operators with no extra thought required.
>>>>>>
>>>>>
>>>>>  That will not work I am afraid :). If we open it up for users to use,
>>>>>  we have to deal with consequences. We have to be prepared for people
>>>>>  doing all kinds of weird things - at least this is what I've learned
>>>>>  during the last 2 years of working on Airflow. See the comment about.
>>>>>  If we do not want users to implement custom versions of triggers we
>>>>>  should use Enums and a closed set of those. If we create an API an
>>>>>  interface - people will use it and create their own, no matter if we
>>>>>  want or not.
>>>>>
>>>>>
>>>>>  J.
>>>>>
>>>>
>>>
>>>
>>> --
>>> +48 660 796 129
>>>
>>>

Re: [DICUSS][AIP-40] Deferrable ("Async") Operators

Reply via email to