ferruzzi opened a new pull request, #28368:
URL: https://github.com/apache/airflow/pull/28368

   **TL;DR:**
   The Sagemaker System Test is flaky and fails occasionally because Reasons.  
This change greatly reduces the frequency of the failures.  
   
   **Long version:**
   The Sagemaker Tuning Job fails from time to time, partially because the test 
is being run on a small dataset for the sake of time and resources.  The tuning 
task will only fail if every retry of the tuning job fails.  Previously 
increasing MaxNumberOfTrainingJobs from 1 to 2 reduced failure rate from 
roughly 50% of the time down to roughly failing once every 7 runs.  After 
increasing it from 2 to 10 I have run this text 60 times locally and it failed 
only one of those.  10 is the largest value allowed by the API.  Since we are 
automatically running the tests internally now on AWS infrastructure, we can 
afford to throw a little more compute power behind it to get the tests passing 
more reliably.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to