jscheffl opened a new pull request, #44536:
URL: https://github.com/apache/airflow/pull/44536

   In https://github.com/apache/airflow/pull/44311#discussion_r1862825630 
@kaxil, @potiuk and me had a bit of discussion. As promised to come back with 
this, this PR implements (as promised) a way to make the retries in Edge worker 
configurable.
   
   But it is also opening the box of debates because:
   
   - Do we want to add a new config? (Some people start screaming?)
       - (My position: Sensible default and make it configurable)
   - Should we really retry 10 times?
       - (My position: 10 attempts was the former default in internal API, in a 
small prod outage I can say at least this is good such that tasks do not fail 
in small webserver outages or connection interrupts. We see daily flakiness on 
our WAN. As Zombie threshold is at 300s per default retrying more than 5min 
might not be needed. But also we should faster on small glitches... so the 
exponential back-off is good. I tested with the parameters below and I think 
for waiting 5min max it is reasonable to test 10 times in between before fail.
   - Oh, why specific in Edge? (I saw multiple occasions in retries in 
different places in the repo. But also moving to TaskSDK I think we also should 
consider making this more common - at least Edge API retries from far away 
should be matching to the calls that are made to TaskSDK!)
       - (My position: I'd favor to make such setting common in Edge API and at 
least TaskSDK calls)
   
   @ashb would also call for your opinion.
   
   And if needed - but I assume it can be made within this PR - we could also 
call for an open [DISCUSS] or loop-in other stakeholders. Le me know.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to