[jira] Updated: (HADOOP-2606) Namenode unstable when replicating 500k blocks at once

Konstantin Shvachko (JIRA) Fri, 14 Mar 2008 04:51:11 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Konstantin Shvachko updated HADOOP-2606:
----------------------------------------

    Attachment: ReplicatorNew.patch

This patch implements the approach mentioned above.
Namely, replication monitorscans the list of under-replicated blocks and 
schedules them for replication to and from appropriate data-nodes. This is in 
contrast to the current approach when we choose a node and then scan the list 
in order to choose a small number of blocks that the chosen node can replicate. 
The new algorithm tries to schedule more replications on nodes with ongoing 
decommission. It also does not schedule any replications on nodes that are 
already in decommissioned state, this part was not present in the previous 
algorithm.

The patch also presents a benchmark and a test.
The benchmark directly calls the replication scheduler until all blocks are 
replicated and measures how many blocks per second on average it can schedule. 
The test runs the benchmark with default parameters.

I ran the test for the old version and for the new one.
On my machine the new replicator processes about 9700 blocks per second while 
the old one does only 640, which is about *15 times faster*.
This of course does not mean that blocks will be replicated 15 times faster in 
a real cluster. This just means that replication monitor will consume much less 
cpu and will let other name-node operations run faster.

For those who want to accelerate replication: you need to adjust an 
undocumented configuration parameter "dfs.max-repl-streams", which defines 
maximal number of replications a data-node is allowed to handle at one time. 
The default it is 2.

TestReplication is supposed to fail with the new algorithm. The problem is that 
data-nodes do not report to the name-node crc exceptions obtained during 
replications. Previously another data-node (if exists) would be chosen as 
source for the block, and the replication will finally succeed. But now the 
same source node is deterministically chosen all the time. I think data-nodes 
should report crc-exceptions the same as clients do. I'll file a bug for 
discussion.


> Namenode unstable when replicating 500k blocks at once
> ------------------------------------------------------
>
>                 Key: HADOOP-2606
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2606
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.14.3
>            Reporter: Koji Noguchi
>            Assignee: Konstantin Shvachko
>             Fix For: 0.17.0
>
>         Attachments: ReplicatorNew.patch, ReplicatorTestOld.patch
>
>
> We tried to decommission about 40 nodes at once, each containing 12k blocks. 
> (about 500k total)
> (This also happened when we first tried to decommission 2 million blocks)
> Clients started experiencing  "java.lang.RuntimeException: 
> java.net.SocketTimeoutException: timed out waiting for rpc
> response" and namenode was in 100% cpu state. 
> It was spending most of its time on one thread, 
> "[EMAIL PROTECTED]" daemon prio=10 tid=0x0000002e10702800 nid=0x6718
> runnable [0x0000000041a42000..0x0000000041a42a30]
>    java.lang.Thread.State: RUNNABLE
>         at 
> org.apache.hadoop.dfs.FSNamesystem.containingNodeList(FSNamesystem.java:2766)
>         at 
> org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2870)
>         - locked <0x0000002aa3cef720> (a 
> org.apache.hadoop.dfs.UnderReplicatedBlocks)
>         - locked <0x0000002aa3c42e28> (a org.apache.hadoop.dfs.FSNamesystem)
>         at 
> org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1928)
>         at 
> org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1868)
>         at java.lang.Thread.run(Thread.java:619)
> We confirmed that Namenode was not in the fullGC states when these problem 
> happened.
> Also, dfsadmin -metasave was showing "Blocks waiting for replication" was 
> decreasing very slowly.
> I believe this is not specific to decommission and same problem would happen 
> if we lose one rack.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2606) Namenode unstable when replicating 500k blocks at once

Reply via email to