jscheffl opened a new pull request, #44536: URL: https://github.com/apache/airflow/pull/44536
In https://github.com/apache/airflow/pull/44311#discussion_r1862825630 @kaxil, @potiuk and me had a bit of discussion. As promised to come back with this, this PR implements (as promised) a way to make the retries in Edge worker configurable. But it is also opening the box of debates because: - Do we want to add a new config? (Some people start screaming?) - (My position: Sensible default and make it configurable) - Should we really retry 10 times? - (My position: 10 attempts was the former default in internal API, in a small prod outage I can say at least this is good such that tasks do not fail in small webserver outages or connection interrupts. We see daily flakiness on our WAN. As Zombie threshold is at 300s per default retrying more than 5min might not be needed. But also we should faster on small glitches... so the exponential back-off is good. I tested with the parameters below and I think for waiting 5min max it is reasonable to test 10 times in between before fail. - Oh, why specific in Edge? (I saw multiple occasions in retries in different places in the repo. But also moving to TaskSDK I think we also should consider making this more common - at least Edge API retries from far away should be matching to the calls that are made to TaskSDK!) - (My position: I'd favor to make such setting common in Edge API and at least TaskSDK calls) @ashb would also call for your opinion. And if needed - but I assume it can be made within this PR - we could also call for an open [DISCUSS] or loop-in other stakeholders. Le me know. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
