ZitingShen opened a new pull request #1530: URL: https://github.com/apache/samza/pull/1530
Symptom: Some samza job will crash due to YARN PMEM error despite the physical-memory-mb is much lower than the container size. <img width="761" alt="Screen Shot 2021-09-09 at 3 40 44 PM" src="https://user-images.githubusercontent.com/9065044/132929912-9f8495e1-5c6e-4c1c-909a-fafba08cd5d5.png"> Cause: Current physical-memory-mb metric only calculates the RSS memory of the java process that runs the application but ignores all its child processes, including those that load tensorflow models and take a lot of memory. <img width="1105" alt="Screen Shot 2021-09-09 at 3 39 37 PM" src="https://user-images.githubusercontent.com/9065044/132929844-a171bd16-0e9c-4250-8f3f-9a5958933f78.png"> Changes: Get all the child processes of the java process that runs the application, and sum their RSS memory with the RSS memory of the java process as the physical-memory-mb of the container. Test: unit tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
