Nishieee commented on issue #67178:
URL: https://github.com/apache/airflow/issues/67178#issuecomment-4500117201

   > I looked into this bug and found out why the EMR tasks are failing 
randomly the problem is that our AWS waiter fails immediately when it gets a 
throttling error from the API to fix this I will make the waiter ignore these 
temporary API throttling errors and keep retrying It will still fail right away 
for real errors like wrong permissions so that is safe. I am working on this 
fix right now and will open a pull request soon to solve it.
   
   i think the fix is heading the right direction but one thing worth thinking 
about - waiter_max_attempts still decrements on every iteration of the loop, 
including the throttle ones. so on a long running emr job with sustained 
throttling, in theory you could exhaust max_attempts not because the job is 
actually stuck but because too many polls got throttled. probably rare in 
practice given typical max_attempts values, but it's the same failure mode the 
original issue is reporting just shifted later in time, so might be worth 
tracking throttle-retries separately or at least logging when retries are 
eating into the attempts budget.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to