kevin85421 opened a new pull request #699:
URL: https://github.com/apache/submarine/pull/699


   ### What is this PR for?
   The following two pull requests aim to resolve the Out-Of-Memory error. 
However, it is very inconvenient for users to predict the actual memory usage. 
Thus, using the memory request and memory limit mechanism to allow 
overcommitment of memory is helpful for users.
   
   * https://github.com/apache/submarine/pull/621
   * https://github.com/apache/submarine/pull/510
   
   In this PR, I set the memory limit to twice the memory request to enable 
overcommitment of memory. With this patch, the OOM errors can be reduced 
effectively.
   
   This 
[article](https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-resource-requests-and-limits)
 is a good resource to better understand this PR.
   
   ### What type of PR is it?
   [Feature]
   
   ### Todos
   
   
   ### What is the Jira issue?
   https://issues.apache.org/jira/browse/SUBMARINE-948
   
   ### How should this be tested?
   **Test1**
   * Create a distributed TensorFlow MNIST job, and set the memory quota of a 
worker to 512 MB. To elaborate, modify 
[experimentIT.java:90](https://github.com/apache/submarine/blob/master/submarine-test/test-e2e/src/test/java/org/apache/submarine/integration/experimentIT.java#L90)
 to 
     ```java
     experimentPage.fillTfSpec(2, new String[]{"Ps", "Worker"}, new int[]{1, 
1}, new int[]{1, 1}, new int[]{512, 512});
     ```
   * Without this PR, this MNIST job will be killed due to an Out-Of-Memory 
error. On the other hand, with this PR, the MNIST job will not be killed.
   
   **Test2**
   ```
   kubectl describe ${your_experiment_pod}
   ```
   <img width="422" alt="ζˆͺεœ– 2021-08-06 δΈ‹εˆ2 40 42" 
src="https://user-images.githubusercontent.com/20109646/128474314-bcfc0067-a841-4bdb-8ce2-4014849ffd57.png";>
   
   ### Screenshots (if appropriate)
   
   ### Questions:
   * Do the license files need updating? No
   * Are there breaking changes for older versions? No
   * Does this need new documentation? No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to