[
https://issues.apache.org/jira/browse/HADOOP-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580005#action_12580005
]
dhruba borthakur commented on HADOOP-2606:
------------------------------------------
1. This patch exits the ReplicationMonitor thread when it receives
Interruptedexception. This is nice, because it helps unit tests that restart
namenode. Maybe we can make the same change for all other FSNamesystem deamons,
e.g. DecommissionedMonitor, ResolutionMonitor, etc.
2. A typo "arleady reached replication limit". Should be "already ....".
3. If a block in neededReplication does not belong to any file, we silently
remove it from neededreplication. This is a cannot happen case and we could log
a message in the log.
4. This patch prefers nodes-being-decommissioned to be source of replication
requests. When a node changes to the decommmissioned state, the administrator
is likely to shutdown that node. There is a higher probability that node is
currently serving a replication request. That repliaction request will timeout
because the machine was shutdown. This is probably acceptable.
5. FSNamesystem.chooseSourceDatanode() should always return a node if possible.
In the current code, this is not guaranteed because r.nextBoolean() may return
false for many invocations at a stretch. It might be a good idea to do the
following at the end of chooseSourceDatanode:
if (srcNode == null) {
srcNode = first datanode in list that has not reached its limit
}
6. There used to be an important log message that described a replication
request:
" pending Transfer .... ask node ... ".
This has changed to
" computeReplicationWork .. ask node.."
Maybe it is a better idea to not have the name of the method in the log
messages. Otherwise, when the method name changes (in the future) that log
message changes too and makes it harder for people accustomed to earlier log
messages to debug the system.
7, Typo in NNThroughOPutbenchmark.isInPorgress(). It should be isInProgress().
> Namenode unstable when replicating 500k blocks at once
> ------------------------------------------------------
>
> Key: HADOOP-2606
> URL: https://issues.apache.org/jira/browse/HADOOP-2606
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.14.3
> Reporter: Koji Noguchi
> Assignee: Konstantin Shvachko
> Fix For: 0.17.0
>
> Attachments: ReplicatorNew.patch, ReplicatorNew1.patch,
> ReplicatorTestOld.patch
>
>
> We tried to decommission about 40 nodes at once, each containing 12k blocks.
> (about 500k total)
> (This also happened when we first tried to decommission 2 million blocks)
> Clients started experiencing "java.lang.RuntimeException:
> java.net.SocketTimeoutException: timed out waiting for rpc
> response" and namenode was in 100% cpu state.
> It was spending most of its time on one thread,
> "[EMAIL PROTECTED]" daemon prio=10 tid=0x0000002e10702800 nid=0x6718
> runnable [0x0000000041a42000..0x0000000041a42a30]
> java.lang.Thread.State: RUNNABLE
> at
> org.apache.hadoop.dfs.FSNamesystem.containingNodeList(FSNamesystem.java:2766)
> at
> org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2870)
> - locked <0x0000002aa3cef720> (a
> org.apache.hadoop.dfs.UnderReplicatedBlocks)
> - locked <0x0000002aa3c42e28> (a org.apache.hadoop.dfs.FSNamesystem)
> at
> org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1928)
> at
> org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1868)
> at java.lang.Thread.run(Thread.java:619)
> We confirmed that Namenode was not in the fullGC states when these problem
> happened.
> Also, dfsadmin -metasave was showing "Blocks waiting for replication" was
> decreasing very slowly.
> I believe this is not specific to decommission and same problem would happen
> if we lose one rack.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.