[
https://issues.apache.org/jira/browse/HADOOP-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sanjay Radia updated HADOOP-2448:
---------------------------------
Fix Version/s: (was: 0.16.0)
> Improve Block report processing and name node restarts (Master Jira)
> --------------------------------------------------------------------
>
> Key: HADOOP-2448
> URL: https://issues.apache.org/jira/browse/HADOOP-2448
> Project: Hadoop
> Issue Type: Improvement
> Components: dfs
> Reporter: Sanjay Radia
> Assignee: Sanjay Radia
>
> It has been reported that for large clusters (2K datanodes) , a restarted
> namenode can often take hours to leave the safe-mode.
> - admins have reported that if the data nodes are started, say 100 at a time,
> it significantly improves the startup time of the name node
> - setting the initial heap (as opposed to max heap) to be larger also helps
> t- this avoids the GCs before more memory is added to the heap.
> Observations of the Name node via JConsole and instrumentation:
> - if 80% of memory is used for maintining the names and blocks data
> structures, then processing block reports can generate a lot of GC causing
> block reports to take a long time to process. This causes datanodes that sent
> the block reports to timeout and resend the block reports making the
> situation worse.
> Hence to improve the situation the following are proposed:
> 1. Have random backoffs (of say 60sec for a 1K cluster) of the initial block
> report sent by a DN. This would match the randomization of the normal hourly
> block reports. (Jira HADOOP-2326)
> 2. Have the NN tell the DN how much to backoff (i.e. rather than a single
> configuration parameter for the backoff). This would allow the system to
> adjust automatically to cluster size - smaller clusters will startup faster
> than larger clusters. (Jira HADOOP-2444)
> 3. Change the block reports to be array of longs rather then array of block
> report objects - this would reduce the amount of memory used to process a
> block report. This would help the initial startup and also the block report
> process during normal operation outside of the safe-mode. (Jira HADOOP-2110)
> 4. Queue and acknowledge the receipts of the block reports and have separate
> set of threads process the block report queue. (HADOOP-2111)
> 4 Jiras have been filed as noted.
> Based on experiments, we may not want to proceed with option 4. While option
> 4 did help block report processing when tried on its own, it turned out that
> in combination with 1 it did not help much. Furthermore, clean up of RPC to
> remove the client-side timeout (see JIRA Hadoop-2188) would make this fix
> obsolete.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.