[
https://issues.apache.org/jira/browse/HADOOP-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463999
]
Yoram Arnon commented on HADOOP-885:
------------------------------------
If the required time resolution is one second, a thread that calls
gettimeofday, updating a global variable, and sleeps for a second would remove
virtually all those system calls.
as for the RPC - it was agreed that the namenode would be hard pressed to
maintain the number of connections, so connections are cached for a very short
time to enable short sessions with clients and are then torn down. in the case
of heartbeats the caching is much smaller than the delay between heartbeats.
reviving an old discussion, UDP will improve performance *a lot*. It would
require a mechanism for delivering large messages that don't fit in a single
packet, but in the overwhelming majority of cases message are small and the
performance improvement will be significant. Seven overhead packets and a lot
of kernel processing just to send a single packet for a single reply is a high
price to pay.
> Reduce CPU usage on namenode: gettimeofday
> ------------------------------------------
>
> Key: HADOOP-885
> URL: https://issues.apache.org/jira/browse/HADOOP-885
> Project: Hadoop
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.10.1
> Reporter: dhruba borthakur
> Assigned To: dhruba borthakur
>
> On a 900 node idle cluster, the namenode spends about 20% of CPU. Most of
> this CPU is spent processing pure heartbeats. No jobs are running on this
> cluster and all nodes are alive and acting well.
> Of the total namenode CPU usage, about 12% is in usermode and about 70% is in
> kernel mode! The question that natually arises is why is heartbeat processing
> taking so much time in kernel mode?
> An strace of namenode reveals that a 20 second period has about 52000
> syscalls with the following breakup:
> gettimeofday : 18000 calls
> accept : 2655 calls
> close : 2655 calls
> shutdown : 2655 calls
> fcntl : 7965 calls
> read : 7965 calls
> futex : 5295 calls
> poll : 4894 calls
> A code inspection reveals that the code is doing multiple (about 5) calls to
> System.currentTimeMillis() in processing a single request in the RPC.java and
> Server.java classes. This might mean that there is a possibility of
> optimization.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira