[ 
https://issues.apache.org/jira/browse/HADOOP-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12464417
 ] 

dhruba borthakur commented on HADOOP-885:
-----------------------------------------

Regarding gettimeofday: I agree with you. In fact, it might be true that most 
places do not even need a 1 second precision. Maybe a 5 second precision or so. 
A code inspection will reveal this.

Regarding keeping RPC connections open: I had a 900 node cluster on which no 
jobs were running. In this case, there should have been only 900 connections to 
the namenode (one from each datanode). Also, the config value  
ipc.client.maxidletime has been set to 120 seconds which means that a 
connection will be closed only after 120 seconds of inactivity. I  had observed 
(in the dfshealth.jsp page) that heartbeats from all datanodes were being 
processed within 3 seconds or so. There is another config parameter  
ipc.client.idlethreshold that specifies the threshold number of connections 
that triggers connection reaping. The default is set to 4000 connections.

So, it is still a mystery to me why connections were getting reaped so 
aggressively?


> Reduce CPU usage on namenode: gettimeofday
> ------------------------------------------
>
>                 Key: HADOOP-885
>                 URL: https://issues.apache.org/jira/browse/HADOOP-885
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.10.1
>            Reporter: dhruba borthakur
>         Assigned To: dhruba borthakur
>
> On a 900 node idle cluster, the namenode spends about  20% of CPU. Most of 
> this CPU is spent processing pure heartbeats. No jobs are running on this 
> cluster and all nodes are alive and acting well.
> Of the total namenode CPU usage, about 12% is in usermode and about 70% is in 
> kernel mode! The question that natually arises is why is heartbeat processing 
> taking so much time in kernel mode?
> An strace of namenode reveals that a 20 second period has about 52000 
> syscalls with the following breakup:
> gettimeofday  :       18000 calls
> accept             :          2655 calls
> close               :          2655 calls
> shutdown       :          2655 calls
> fcntl                  :          7965 calls
> read                 :          7965 calls
> futex                 :          5295 calls
> poll                   :          4894 calls
> A code inspection reveals that the code is doing multiple (about 5) calls to 
> System.currentTimeMillis() in processing a single request in the RPC.java and 
> Server.java classes. This might mean that there is a possibility of 
> optimization.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to