[
https://issues.apache.org/jira/browse/HADOOP-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Shvachko updated HADOOP-2606:
----------------------------------------
Attachment: ReplicatorNew.patch
This patch implements the approach mentioned above.
Namely, replication monitorscans the list of under-replicated blocks and
schedules them for replication to and from appropriate data-nodes. This is in
contrast to the current approach when we choose a node and then scan the list
in order to choose a small number of blocks that the chosen node can replicate.
The new algorithm tries to schedule more replications on nodes with ongoing
decommission. It also does not schedule any replications on nodes that are
already in decommissioned state, this part was not present in the previous
algorithm.
The patch also presents a benchmark and a test.
The benchmark directly calls the replication scheduler until all blocks are
replicated and measures how many blocks per second on average it can schedule.
The test runs the benchmark with default parameters.
I ran the test for the old version and for the new one.
On my machine the new replicator processes about 9700 blocks per second while
the old one does only 640, which is about *15 times faster*.
This of course does not mean that blocks will be replicated 15 times faster in
a real cluster. This just means that replication monitor will consume much less
cpu and will let other name-node operations run faster.
For those who want to accelerate replication: you need to adjust an
undocumented configuration parameter "dfs.max-repl-streams", which defines
maximal number of replications a data-node is allowed to handle at one time.
The default it is 2.
TestReplication is supposed to fail with the new algorithm. The problem is that
data-nodes do not report to the name-node crc exceptions obtained during
replications. Previously another data-node (if exists) would be chosen as
source for the block, and the replication will finally succeed. But now the
same source node is deterministically chosen all the time. I think data-nodes
should report crc-exceptions the same as clients do. I'll file a bug for
discussion.
> Namenode unstable when replicating 500k blocks at once
> ------------------------------------------------------
>
> Key: HADOOP-2606
> URL: https://issues.apache.org/jira/browse/HADOOP-2606
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.14.3
> Reporter: Koji Noguchi
> Assignee: Konstantin Shvachko
> Fix For: 0.17.0
>
> Attachments: ReplicatorNew.patch, ReplicatorTestOld.patch
>
>
> We tried to decommission about 40 nodes at once, each containing 12k blocks.
> (about 500k total)
> (This also happened when we first tried to decommission 2 million blocks)
> Clients started experiencing "java.lang.RuntimeException:
> java.net.SocketTimeoutException: timed out waiting for rpc
> response" and namenode was in 100% cpu state.
> It was spending most of its time on one thread,
> "[EMAIL PROTECTED]" daemon prio=10 tid=0x0000002e10702800 nid=0x6718
> runnable [0x0000000041a42000..0x0000000041a42a30]
> java.lang.Thread.State: RUNNABLE
> at
> org.apache.hadoop.dfs.FSNamesystem.containingNodeList(FSNamesystem.java:2766)
> at
> org.apache.hadoop.dfs.FSNamesystem.pendingTransfers(FSNamesystem.java:2870)
> - locked <0x0000002aa3cef720> (a
> org.apache.hadoop.dfs.UnderReplicatedBlocks)
> - locked <0x0000002aa3c42e28> (a org.apache.hadoop.dfs.FSNamesystem)
> at
> org.apache.hadoop.dfs.FSNamesystem.computeDatanodeWork(FSNamesystem.java:1928)
> at
> org.apache.hadoop.dfs.FSNamesystem$ReplicationMonitor.run(FSNamesystem.java:1868)
> at java.lang.Thread.run(Thread.java:619)
> We confirmed that Namenode was not in the fullGC states when these problem
> happened.
> Also, dfsadmin -metasave was showing "Blocks waiting for replication" was
> decreasing very slowly.
> I believe this is not specific to decommission and same problem would happen
> if we lose one rack.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.