[ https://issues.apache.org/jira/browse/TEZ-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13976142#comment-13976142 ]
Rajesh Balamohan commented on TEZ-1074: --------------------------------------- [~bikassaha] I had a patch initially to send only the updated counters & it brought down the CPU usage from 76-80% to 50-55%. Still, quite an amount of processing/readFields was happening due to FileSystem counters. Couple of counters are incremented very frequently for every record in places like DefaultSorter (e.g mapOutputByteCounter, mapOutputRecordCounter). These updated counters always landed up on the payload of the heartbeat; which caused the CPU of DAGAM to stay at 50-55%. Sending the updated counters alone did not help in fixing the issue completely. Tez does not make use of the counters shared in heartbeat payload for making any realtime decisions. Hence the suggestion was to increase the default value of this knob from 100 to 2000 ms (for approximately 200 node cluster). PB implementation would definitely help in reducing the usage even further. > DAGAppMaster takes lots of CPU when running a reasonably large job in the > cluster > --------------------------------------------------------------------------------- > > Key: TEZ-1074 > URL: https://issues.apache.org/jira/browse/TEZ-1074 > Project: Apache Tez > Issue Type: Bug > Reporter: Rajesh Balamohan > Attachments: Screen Shot 2014-04-19 at 7.26.36 PM.png, > TEZ-1074-v2.patch > > > - Ran a job which used 200 containers. > - DAGAppMaster was running at 70% CPU most of the time during the job. > - Profiling revealed that lots of time was spent on TezEvent.readFields --> > ... --> TaskStatusUpdateEvent.readFields(). > - Default "tez.task.am.heartbeat.interval-ms.max=100" ms. With 200 > containers, potentially 2000 events (these events have TezCounters) per > second would be processed by DAGAppMaster. > With large job, cpu usage of DAGAppMaster can bloat up significantly. > One option to reduce CPU usage could be to send modified TezCounters in > TezStatusUpdateEvent. -- This message was sent by Atlassian JIRA (v6.2#6252)