Re: [DICUSS][AIP-40] Deferrable ("Async") Operators

Kaxil Naik Thu, 29 Apr 2021 06:24:57 -0700

Similar to trigger_timeout (which is now in the AIP) we could have
trigger_run_timeout to take care of that case.


Regards,
Kaxil

On Thu, Apr 29, 2021 at 1:51 PM Ash Berlin-Taylor <a...@apache.org> wrote:

> On point one: The triggerer is HA, so you can run multiple, so the concern
> about blocking is valid, but it wouldn't block _all_ triggers.
>
> Or I thought that was the plan, but the AIP doesn't mention that, but
> Andrew said that _somewhere_
>
> We should explicitly mention this is the case.
>
> -ash
>
> On 29 April 2021 09:45:47 BST, Jarek Potiuk <ja...@potiuk.com> wrote:
>>
>> Hey anyone from, the BIG users, any comments here ? Any thoughts about
>> the operation side of this? I am not sure if my worries are too
>> "excessive", so I would love to hear your thoughts here.. I think it
>> would be great to unblock Andrew so that he can carry on with the AIP
>> (which I think is great feature BTW).
>>
>> I think this boils down to two questions:
>>
>> 1) Would you be worried by having a single thread running all such
>> triggers for the whole installation where users could potentially
>> develop their own "blocking" triggers by mistake and block others ?
>> 2) What kind of monitoring/detection/prevention would you like to see
>> to get it "under control" :)
>>
>> J,
>>
>>
>> On Tue, Apr 27, 2021 at 9:04 PM Andrew Godwin
>> <andrew.god...@astronomer.io.invalid> wrote:
>>
>>>
>>>  Oh totally, I was partially wearing my SRE hat when writing up parts of 
>>> this (hence why HA is built-in from the start), but I'm more used to 
>>> running web workloads than Airflow, so any input from people who've run 
>>> large Airflow installs are welcome.
>>>
>>> That will not work I am afraid :). If we open it up for users to use, we 
>>> have to deal with consequences. We have to be prepared for people doing all 
>>> kinds of weird things - at least this is what I've learned during the last 
>>> 2 years of working on Airflow.
>>>>
>>>
>>>  It's maybe one of the key lessons I've had from 15 years writing open 
>>> source - people will always use your hidden APIs no matter what!
>>>
>>>  I think in this case, it's a situation where we just have to make it 
>>> progressively safer as you go up the stack - if you're just using Operators 
>>> you are fine, if you are authoring Operators there's some new things to 
>>> think about if you want to go deferrable, and if you're authoring Triggers 
>>> then you need a bit of async experience. Plus, people who are just writing 
>>> standard non-deferrable operators have nothing to worry about as this is 
>>> purely an additive feature, which I like; letting people opt-in to 
>>> complexity is always nice.
>>>
>>>  Hopefully we can get enough safety guards in that all three of these 
>>> levels will be reasonably accessible without encouraging people to go 
>>> straight to Triggers; it would likely be an improvement over Smart Sensors, 
>>> which as far as I can tell lack a lot of them.
>>>
>>>  Andrew
>>>
>>>  On Tue, Apr 27, 2021 at 12:29 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>>
>>>>
>>>>  Hey Andrew,
>>>>
>>>>  Just don't get me wrong :). I love the idea/AIP proposal. Just want to
>>>>  make sure that from day one we think about operational aspects of it.
>>>>  I think the easier we make it for the operations people, the less we
>>>>  will have to deal with their problems in the devlist/Github issues.
>>>>
>>>>  I would love to hear what others think about it - maybe some folks
>>>>  from the devlist who operate Airflow in scale could chime in here - we
>>>>  have some people from AirBnB/Twitter etc... Maybe you could state
>>>>  expectations when it comes to the operational side if you'd have 1000s
>>>>  of Triggers that potentially interfere with each other :) ?
>>>>
>>>>
>>>>
>>>>  On Mon, Apr 26, 2021 at 9:21 PM Andrew Godwin
>>>>  <andrew.god...@astronomer.io.invalid> wrote:
>>>>
>>>>> 1) In general, I'm envisioning Triggers as a thing that are generally 
>>>>> abstracted away from DAG authors - instead, they would come in a provider 
>>>>> package (or core Airflow) and so we would be expecting the same higher 
>>>>> level of quality and testing.
>>>>>
>>>>
>>>>  I see the intention, but we would have to have pretty comprehensive
>>>>  set of triggers from day one - similar to the different types of "rich
>>>>  intervals" we have in
>>>>  
>>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-39+Richer+scheduler_interval.
>>>>  Maybe a list (groups) of the triggers we would like to have as
>>>>  "built-ins" would be really helpful?
>>>>  But even then, I think the "extendability" of this solution is what I
>>>>  like about it. More often than not, users will find out that they need
>>>>  something custom. I think if we describe the interface that the
>>>>  Trigger should implement and make it a "first-class" citizen -
>>>>  similarly as all other concepts in Airflow, it is expected that users
>>>>  will override them - Operators, Sensors, even the new "Richer
>>>>  schedules" are "meant" to be implemented by the users. If they are
>>>>  not, we should not make it public but rather have (and accept) only a
>>>>  fixed set of those - and for that we could just  implement an Enum of
>>>>  available Triggers. By providing an interface, we invite our users to
>>>>  implement their own custom Triggers, and I think we need to deal with
>>>>  consequences. I think we have to think about what happens when users
>>>>  start writing their triggers and what "toolbox" we give the people
>>>>  operating Airflow to deal with that.
>>>>
>>>> a) I do think it should all be fixed as asyncio, mostly because the 
>>>> library support is there and you don't suddenly want to make people have 
>>>> to run multiple "flavours" of triggerer process based on how many triggers 
>>>> they're using and what runtime loops they demand. If trio were more 
>>>> popular, I'd pick that as it's safer and (in my opinion) better-designed, 
>>>> but we are unfortunately nowhere near that, and in addition this is not 
>>>> something that is impossible to add in at a later stage.
>>>>>
>>>>
>>>>  Asyncio would also be my choice by far so we do not differ here :)
>>>>
>>>>  From what I understand, you propose a single event loop to run all the
>>>>  deferred tasks? Am I right?
>>>>  My point is that multiple event loops or Thread-based async running
>>>>  are also part of asyncio. No problem with that. Asyncio by default
>>>>  uses a single event loop, but there is no problem to use more. This
>>>>  would be quite similar to "queues" we currently have with celery
>>>>  workers. I am not telling we should, but I can see the case where this
>>>>  might be advisable (for example to isolate groups of the deferred
>>>>  tasks so that they do not interfere/delay the other group). What do
>>>>  you think?
>>>>
>>>> b) There's definitely some hooks we can use to detect long-running 
>>>> triggers, and I'll see if I can grab some of them and implement them. At 
>>>> very least, there's the SIGALRM watchdog method, and I believe it's 
>>>> possible to put nice timeouts around asyncio tasks, which we could use to 
>>>> enforce a max runtime specified in the airflow.cfg file.
>>>>>
>>>>
>>>>  I agree it will be indeed difficult to prevent people from making
>>>>  mistakes, but we should really think about how we should help the
>>>>  operations people to detect and diagnose them. And I think it should
>>>>  be part of specification:
>>>>
>>>>  1) what kind of metrics we are logging for the triggers - I think we
>>>>  should gather and publish (using airflow's metrics system) some useful
>>>>  metrics for the operations people (number of executions/execution
>>>>  length, queuing/delay time vs. expectation for some trigger like date
>>>>  trigger) etc. for all triggers from day one.
>>>>  2) you mentioned it - maybe we should have Debug mode turned on by
>>>>  default: https://docs.python.org/3/library/asyncio-dev.html#debug-mode
>>>>  . It has some useful features, mainly automated logging of too long
>>>>  running async methods. Not sure what consequences it has though.
>>>>  3) max execution time would be even nicer indeed
>>>>  4) maybe there should be some exposure in the UI/CLI/API on what's
>>>>  going in with triggers?
>>>>  5) maybe there should be a way to cancel /disable some triggers that
>>>>  are mis-behaving - using the UI/CLI/API  - until the code gets fixed
>>>>  for those ?
>>>>
>>>>  I think we simply need to document those aspects in "operations"
>>>>  chapter of your proposal. I am happy to propose some draft changes in
>>>>  the AIP if you would like to, after we discuss it here.
>>>>
>>>> 2) This is why there's no actual _state_ that is persisted - instead, you 
>>>> pass the method you want to call next and its keyword arguments. Obviously 
>>>> we'll need to be quite clear about this in the docs, but I feel this is 
>>>> better than persisting _some_ state. Again, though, this is an 
>>>> implementation detail that would likely be hidden inside an Operator or 
>>>> Sensor from the average DAG user; I'm not proposing we expose this to 
>>>> things like PythonOperator or TaskFlow yet, for the reasons you describe.
>>>>> I wish I had a better API to present for Operator authors, but with the 
>>>>> fundamental fact that the Operator/Task is going to run on different 
>>>>> machines for different phases of its life, I think having explicit "give 
>>>>> me this when you revive me" arguments is the best tradeoff we can go for.
>>>>>
>>>>
>>>>  Agree. We cannot do much about it. I think we should just be very
>>>>  clear when we document the life cycle. Maybe we should even update the
>>>>  semantics of pre/post so that  "pre_execute()" and "post_execute()"
>>>>  are executed for EVERY execute - including the deferred one (also
>>>>  execute_complete()). This way (in your example) the task execution
>>>>  would look like : pre_execute(), execute(-> throw TaskDeferred()),
>>>>  post_execute()   and then on another worker pre_execute(),
>>>>  execute_complete(), post_execute(). I think that would make sense.
>>>>  What do you think?
>>>>
>>>> 3) I do think we should limit the size of the payload, as well as the 
>>>> kwargs that pass between deferred phases of the task instance - something 
>>>> pretty meaty, like 500KB, would seem reasonable to me. I've also run into 
>>>> the problem in the past that if you design a messaging system without a 
>>>> limit, people _will_ push the size of the things sent up to 
>>>> eyebrow-raisingly-large sizes.
>>>>>
>>>>
>>>>  Yeah. I think XCom should be our benchmark. Currently we have
>>>>  MAX_XCOM_SIZE = 49344. Should we use the same?
>>>>
>>>>  And that leads me to another question, actually very important. I
>>>>  think (correct me please if I am wrong)  it is missing in the current
>>>>  specs. Where the kwargs are going to be stored while the task is
>>>>  deferred? Are they only stored in memory of the triggerer? Or in the
>>>>  DB? And what are the consequences? What happens when the tasks are
>>>>  triggered and the triggerer is restarted? How do we recover? Does it
>>>>  mean that all the tasks that are deferred will have to be restarted?
>>>>  How? Could you please elaborate a bit on that (and I think also it
>>>>  needs a chapter in the specification).
>>>>
>>>> Overall, I think it's important to stress that the average DAG author 
>>>> should not even know that Triggers really exist; instead, they should just 
>>>> be able to switch Sensors or Operators (e.g. DateTimeSensor -> 
>>>> AsyncDateTimeSensor in my prototype) and get the benefits of deferred 
>>>> operators with no extra thought required.
>>>>>
>>>>
>>>>  That will not work I am afraid :). If we open it up for users to use,
>>>>  we have to deal with consequences. We have to be prepared for people
>>>>  doing all kinds of weird things - at least this is what I've learned
>>>>  during the last 2 years of working on Airflow. See the comment about.
>>>>  If we do not want users to implement custom versions of triggers we
>>>>  should use Enums and a closed set of those. If we create an API an
>>>>  interface - people will use it and create their own, no matter if we
>>>>  want or not.
>>>>
>>>>
>>>>  J.
>>>>
>>>
>>
>>
>> --
>> +48 660 796 129
>>
>>

Re: [DICUSS][AIP-40] Deferrable ("Async") Operators

Reply via email to