EBernhardson has uploaded a new change for review. (
https://gerrit.wikimedia.org/r/395923 )
Change subject: Enable more accurate smaps based rss checking
......................................................................
Enable more accurate smaps based rss checking
Training xgboost models in the hadoop cluster is running into some
issues where yarn regularly kills containers, but only some of them.
Based on review of yarn's code it appears this is because we are using
the default RSS calculation which is documented as less accurate.
Specifically it includes pages that the kernel is free to evict, and
double(triple, etc) counts read only memory shared by many processes.
A custom implementation of that algorithm was injected into a background
task of training mlr models and found that the more accurate algorithm
shows a constant memory usage. Enabling this will allow us to stop
over-allocating memory to account for this discrepency, and require
250Gb less memory for the 9 hour training process.
Bug: T182276
Change-Id: I0f8223db4d4abc26eb9d04ff106b7e49602f504e
---
M templates/hadoop/yarn-site.xml.erb
1 file changed, 6 insertions(+), 0 deletions(-)
git pull ssh://gerrit.wikimedia.org:29418/operations/puppet/cdh
refs/changes/23/395923/1
diff --git a/templates/hadoop/yarn-site.xml.erb
b/templates/hadoop/yarn-site.xml.erb
index 5657577..913a028 100644
--- a/templates/hadoop/yarn-site.xml.erb
+++ b/templates/hadoop/yarn-site.xml.erb
@@ -169,6 +169,12 @@
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
+ <property>
+ <description>RSS usage of a process computed via /proc/pid/stat is not
very accurate as it includes shared pages of a process. /proc/pid/smaps
provides useful information like Private_Dirty, Private_Clean, Shared_Dirty,
Shared_Clean which can be used for computing more accurate RSS. When this flag
is enabled, RSS is computed as Min(Shared_Dirty, Pss) + Private_Clean +
Private_Dirty. It excludes read-only shared mappings in RSS
computation.</description>
+
<name>yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled</name>
+ <value>true</value>
+ </property>
+
<% if @datanode_mounts -%>
<property>
<description>List of directories to store localized files in.</description>
--
To view, visit https://gerrit.wikimedia.org/r/395923
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: I0f8223db4d4abc26eb9d04ff106b7e49602f504e
Gerrit-PatchSet: 1
Gerrit-Project: operations/puppet/cdh
Gerrit-Branch: master
Gerrit-Owner: EBernhardson <[email protected]>
_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits