FatalLin opened a new pull request #694:
URL: https://github.com/apache/submarine/pull/694


   ### What is this PR for?
   just like we mentioned in Jira ticket, for now submarine will retry those 
retry able jobs endlessly even those job never had a chance to success. It's 
waste of resource obviously, so I add a MLJob property BackoffLimit to prevent 
this kind of situation, at same time I change the MLJobSpec from interface into 
abstract class to share property with TFJobSpec and PytorchJobSpec.
   I also fixed a bug to respond the correct status of experiment in failure 
case. 
   
   ### What type of PR is it?
   Improvement
   
   ### Todos
   N/A
   
   ### What is the Jira issue?
   https://issues.apache.org/jira/browse/SUBMARINE-952
   
   ### How should this be tested?
   modify the test case 
(https://github.com/apache/submarine/blob/master/submarine-test/test-e2e/src/test/java/org/apache/submarine/integration/experimentIT.java#L90)
 from {1024, 1024} to {512, 512},
   and the experiment will hit OOMFailure, and the experiment status will 
change into failed after retry 3 times.
   ### Screenshots (if appropriate)
   <img width="1380" alt="截圖 2021-08-01 下午5 03 43" 
src="https://user-images.githubusercontent.com/5687317/128044592-e2cee95c-2ee9-4702-88ff-d41950e003ec.png";>
   <img width="1394" alt="截圖 2021-08-03 下午11 10 39" 
src="https://user-images.githubusercontent.com/5687317/128044618-454afdb5-c1b8-4395-a75e-f470a7c41625.png";>
   
   ### Questions:
   * Do the license files need updating? No
   * Are there breaking changes for older versions? No
   * Does this need new documentation? No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to