[jira] [Commented] (TEZ-1074) DAGAppMaster takes lots of CPU when running a reasonably large job in the cluster

Rajesh Balamohan (JIRA) Wed, 23 Apr 2014 13:31:22 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13978847#comment-13978847
 ]


Rajesh Balamohan commented on TEZ-1074:
---------------------------------------

If we send only the updated counters, we need to iterate through and merge all 
the counters back in the STATUS_UPDATER (expensive op).  Otherwise AM would end 
up having half cooked set of counters.  Since we send counters every second, i 
have removed the counter update portion in the latest patch.

> DAGAppMaster takes lots of CPU when running a reasonably large job in the 
> cluster
> ---------------------------------------------------------------------------------
>
>                 Key: TEZ-1074
>                 URL: https://issues.apache.org/jira/browse/TEZ-1074
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>         Attachments: Screen Shot 2014-04-19 at 7.26.36 PM.png, 
> TEZ-1074-v1.patch, TEZ-1074-v2.patch, TEZ-1074-v7.patch, TEZ-1074-v8.patch
>
>
> - Ran a job which used 200 containers.
> - DAGAppMaster was running at 70% CPU most of the time during the job.
> - Profiling revealed that lots of time was spent on TezEvent.readFields --> 
> ... --> TaskStatusUpdateEvent.readFields().
> - Default "tez.task.am.heartbeat.interval-ms.max=100" ms.  With 200 
> containers, potentially 2000 events (these events have TezCounters) per 
> second would be processed by DAGAppMaster.
> With large job, cpu usage of DAGAppMaster can bloat up significantly.  
> One option to reduce CPU usage could be to send modified TezCounters in 
> TezStatusUpdateEvent.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TEZ-1074) DAGAppMaster takes lots of CPU when running a reasonably large job in the cluster

Reply via email to