[
https://issues.apache.org/jira/browse/HADOOP-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576031#action_12576031
]
Konstantin Shvachko commented on HADOOP-2606:
---------------------------------------------
I spent some time investigating this issue.
ReplicationMonitor is definitely a problem when we have a lot of
under-replicated blocks. Here is how it works.
ReplicationMonitor wakes up every 3 secs and selects 32% of data-nodes.
For each selected data-node the monitor scans the list of under-replicated
blocks (called neededReplications)
and selects two blocks from that list that the current node can replicate.
If we have 2000 nodes and 500,000 blocks each iteration of the monitor (the one
that happens
every 3 seconds) consists of about 640 searches in the list of 500,000 blocks.
Each search is a sequential scan of the list until 2 blocks are found.
This sure can take a lot of time on average, and is especially expensive if a
data-node
does not contain replicas of the blocks in the list.
Rather than optimizing this algorithm I propose to change it so that instead of
choosing
data-nodes and then looking for related blocks the ReplicationMonitor selected
under replicated blocks and assigned for replication to one of the data-nodes
it belongs to.
We of course should avoid the case when a lot of blocks (more than 4?) are
assigned for replication to the same node.
So if all nodes a block belongs to have already been scheduled for a lot of
replications, the block should be skipped.
The number of blocks to scan during one sweep should depend on the number of
live data-nodes.
I'd say double that number.
> Namenode unstable when replicating 500k blocks at once
> ------------------------------------------------------
>
> Key: HADOOP-2606
> URL: https://issues.apache.org/jira/browse/HADOOP-2606
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.14.3
> Reporter: Koji Noguchi
> Assignee: Konstantin Shvachko
> Fix For: 0.17.0
>
>
> We tried to decommission about 40 nodes at once, each containing 12k blocks.
> (about 500k total)
> (This also happened when we first tried to decommission 2 million blocks)
> Clients started experiencing "java.lang.RuntimeException:
> java.net.SocketTimeoutException: timed out waiting for rpc
> response" and namenode was in 100% cpu state.
> It was spending most of its time on one thread,
> "[EMAIL PROTECTED]" daemon prio=10 tid=0x0000002e10702800 nid=0x6718
> runnable [0x0000000041a42000..0x0000000041a42a30]
> java.lang.Thread.State: RUNNABLE
> at
> org.apache.hadoop.dfs.FSNamesystem.containingNodeList(FSNamesystem.java:2766)
> at
> org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2870)
> - locked <0x0000002aa3cef720> (a
> org.apache.hadoop.dfs.UnderReplicatedBlocks)
> - locked <0x0000002aa3c42e28> (a org.apache.hadoop.dfs.FSNamesystem)
> at
> org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1928)
> at
> org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1868)
> at java.lang.Thread.run(Thread.java:619)
> We confirmed that Namenode was not in the fullGC states when these problem
> happened.
> Also, dfsadmin -metasave was showing "Blocks waiting for replication" was
> decreasing very slowly.
> I believe this is not specific to decommission and same problem would happen
> if we lose one rack.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.