[jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages
[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252612#comment-15252612 ] Colin Patrick McCabe commented on HDFS-10301: - I have posted a new patch, which I posted as HDFS-10301.002.patch. The idea here is that we know the number of storage reports we expect to see in the block report. We should not be removing any storages as zombies unless we have seen this number of storages and marked these storages with the ID of the latest block report. I feel that this approach is better than the one used in 001.patch, since it correctly handles the "interleaved" case. It is very difficult to prove that we can never get interleaved storage reports for the DataNode. This is because of issues like queuing inside the RPCs system, packets getting reordered or delayed by the network, and queuing inside the deferred work mechanism added by HDFS-9198. So we should handle this case correctly. > Blocks removed by thousands due to falsely detected zombie storages > --- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.1 >Reporter: Konstantin Shvachko >Priority: Critical > Attachments: HDFS-10301.002.patch, HDFS-10301.01.patch, > zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages
[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252475#comment-15252475 ] Colin Patrick McCabe commented on HDFS-10301: - Hmm. This is a challenging one. [~walter.k.su], I think I agree that the queue added in HDFS-9198 might be part of the problem here. In CDH, we haven't yet backported the deferred queuing stuff implemented in HDFS-9198, which might explain why we never saw this. Since we don't have a queue, and since NN RPCs are almost always handled in the order they arrive, CDH5 doesn't implement "reordering" of resent storage reports. Independently of this bug, I do think it's concerning that the DN keeps piling on retransmissions of FBRs even before the old ones were processed and acknowledged. This kind of behavior will obviously lead to congestion collapse if congestion is what caused the original FBRs to be processed but not acknowledged. {code} void enqueue(List actions) throws InterruptedException { synchronized (queue) { for (Runnable action : actions) { if (!queue.offer(action)) { if (!isAlive() && namesystem.isRunning()) { ExitUtil.terminate(1, getName() + " is not running"); } long now = Time.monotonicNow(); if (now - lastFull > 4000) { lastFull = now; LOG.info("Block report queue is full"); } queue.put(action); } } } } } {code} This is going to be problematic when contention gets high, because threads will spend a long time waiting to enter the {{synchronized (queue)}} section. And this will not be logged or reflected back to the admin in any way. Unfortunately, the operation that you want here, the ability to atomically add a bunch of items to the {{BlockingQueue}}, simply is not provided by {{BlockingQueue}}. The solution also seems somewhat brittle since reordering could happen because of network issues in a multi-RPC BlockReport. I'm thinking about this a little more, and it seems like the root of the problem is that in the single-RPC case, we're throwing away the information about how many storages were in the original report. We need to find a way to include that information in there... > Blocks removed by thousands due to falsely detected zombie storages > --- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.1 >Reporter: Konstantin Shvachko >Priority: Critical > Attachments: HDFS-10301.01.patch, zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages
[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15252377#comment-15252377 ] Konstantin Shvachko commented on HDFS-10301: Hey Walter, your patch looks good by itself, but it does not address the bug in the zombie storage recognition. Took me some time to review your patch, would have been easier if you explained your approach. So your patch is reordering block reports for different storages in such a way that storages from the same report are placed as a contiguous segment in the block report queue, so that processing of different BRs is not interleaved. This addresses Daryn's comment rather than solving the reported bug, as BTW Daryn correctly stated. If you want to go forward with reordering of BRs you should probably do it in another issue. I personally am not a supporter because # It introduces an unnecessary restriction on the order of execution of block reports, and # adds even more complexity to BR processing logic. I see the main problem here that block reports used to be idempotent per storage, but HDFS-7960 made execution of a subsequent storage dependent on the state produced during execution of the previous ones. I think idempotent is good, and we should keep it. I think we can mitigate the problem by one of the following # Changing the criteria of zombie storage recognition. Why should it depend on block report IDs? # Eliminating the notion of zombie storage altogether. E.g., NN can DN to run {{DirectoryScanner}} if NN thinks DN's state is outdated. # Try to move {{curBlockReportId}} from {{DatanodeDescriptor}} to {{StorageInfo}}, which will eliminate global state between storages. Also if we cannot come up with a quick solution, then we should probably roll back HDFS-7960 for now and revisit it later, because this is a critical bug effecting all of our latest releases. And that is a lot of clusters and PBs out there. > Blocks removed by thousands due to falsely detected zombie storages > --- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.1 >Reporter: Konstantin Shvachko >Priority: Critical > Attachments: HDFS-10301.01.patch, zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages
[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15251275#comment-15251275 ] Colin Patrick McCabe commented on HDFS-10301: - Thanks for the bug report. This is a tricky one. One small correction-- HDFS-7960 was not introduced as part of DataNode hotswap. It was originally introduced to solve issues caused by HDF-7575, although it fixed issues with hotswap as well. It seems like we should be able to remove existing DataNode storage report RPCs with the old ID from the queue when we receive one with a new block report ID. This would also avoid a possible congestion collapse scenario caused by repeated retransmissions after the timeout. > Blocks removed by thousands due to falsely detected zombie storages > --- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.1 >Reporter: Konstantin Shvachko >Assignee: Walter Su >Priority: Critical > Attachments: HDFS-10301.01.patch, zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages
[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247390#comment-15247390 ] Hadoop QA commented on HDFS-10301: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 11s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 3 new or modified test files. {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 29s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s {color} | {color:green} trunk passed with JDK v1.8.0_77 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 23s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 50s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 51s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 5s {color} | {color:green} trunk passed with JDK v1.8.0_77 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 43s {color} | {color:green} trunk passed with JDK v1.7.0_95 {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 46s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s {color} | {color:green} the patch passed with JDK v1.8.0_77 {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 6m 15s {color} | {color:red} hadoop-hdfs-project_hadoop-hdfs-jdk1.8.0_77 with JDK v1.8.0_77 generated 1 new + 32 unchanged - 1 fixed = 33 total (was 33) {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 38s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 39s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 6m 54s {color} | {color:red} hadoop-hdfs-project_hadoop-hdfs-jdk1.7.0_95 with JDK v1.7.0_95 generated 1 new + 34 unchanged - 1 fixed = 35 total (was 35) {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 39s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 22s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 49s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} Patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 2m 15s {color} | {color:red} hadoop-hdfs-project/hadoop-hdfs generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0) {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 4s {color} | {color:green} the patch passed with JDK v1.8.0_77 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 46s {color} | {color:green} the patch passed with JDK v1.7.0_95 {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 55m 36s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.8.0_77. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 52m 52s {color} | {color:red} hadoop-hdfs in the patch failed with JDK v1.7.0_95. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 21s {color} | {color:green} Patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 133m 26s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | FindBugs | module:hadoop-hdfs-project/hadoop-hdfs | | | Synchronization performed on java.util.concurrent.ArrayBlockingQueue in
[jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages
[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15247129#comment-15247129 ] Walter Su commented on HDFS-10301: -- Oh, I see. In this case, the reports are not splitted. And because the for-loop is outside the lock, the 2 for-loops interleaved. {code} for (int r = 0; r < reports.length; r++) { {code} > Blocks removed by thousands due to falsely detected zombie storages > --- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.1 >Reporter: Konstantin Shvachko >Priority: Critical > Attachments: zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages
[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246996#comment-15246996 ] Walter Su commented on HDFS-10301: -- 1. IPC reader is single-thread by default. If it's multi-threaded, The order of putting rpc requests into {{callQueue}} is unspecified. 1. IPC {{callQueue}} is fifo. 2. IPC Handler is multi-threaded. If 2 handlers are both waiting the fsn lock, the entry order depends on the fairness of the lock. bq. When constructed as fair, threads contend for entry using an *approximately* arrival-order policy. When the currently held lock is released either the longest-waiting single writer thread will be assigned the write lock... (quore from https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReentrantReadWriteLock.html) I think if DN can't get acked from NN, it shouldn't assume the arrival/processing order(esp when reestablish a connection). Well, I'm still curious about how the interleave happened. Any thoughts? > Blocks removed by thousands due to falsely detected zombie storages > --- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.1 >Reporter: Konstantin Shvachko >Priority: Critical > Attachments: zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages
[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15246571#comment-15246571 ] Konstantin Shvachko commented on HDFS-10301: Hey Daryn, not sure how HDFS-9198 eliminates it from occurring. DataNodes are still waiting for NN to process each BR, so they can timeout and send the same block report multiple times. On the NN side, BR ops processing is multi-threaded, so it can still interleave processing storages from different reports. Could you please clarify, what am I missing? > Blocks removed by thousands due to falsely detected zombie storages > --- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.1 >Reporter: Konstantin Shvachko >Priority: Critical > Attachments: zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages
[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15245867#comment-15245867 ] Daryn Sharp commented on HDFS-10301: Enabling HDFS-9198 will fifo process BRs. It doesn't solve this implementation bug but virtually eliminates it from occurring. > Blocks removed by thousands due to falsely detected zombie storages > --- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.1 >Reporter: Konstantin Shvachko >Priority: Critical > Attachments: zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages
[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244890#comment-15244890 ] Konstantin Shvachko commented on HDFS-10301: My DN has the following six storages: {code} DS-019298c0-aab9-45b4-8b62-95d6809380ff:NORMAL:kkk.sss.22.105 DS-0ea95238-d9ba-4f62-ae18-fdb9333465ce:NORMAL:kkk.sss.22.105 DS-191fc04b-90be-42c9-b6fb-fdd1517bf4c7:NORMAL:kkk.sss.22.105 DS-4a2e91c7-cdf0-408b-83a6-286c3534d673:NORMAL:kkk.sss.22.105 DS-5b2941f7-2b52-45a8-b135-dcbe488cc65b:NORMAL:kkk.sss.22.105 DS-6849f605-fd83-462d-97c3-cb6949383f7e:NORMAL:kkk.sss.22.105 {code} Here are the logs for its block reports. All throw the same exception, but I pasted it only once. {code} 2016-04-12 22:31:58,931 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x283d25423fb64d, containing 6 storage report(s), of which we sent 0. The reports had 81565 total blocks and used 0 RPC(s). This took 19 msec to generate and 60078 msecs for RPC and NN processing. Got back no commands. 2016-04-12 22:31:58,931 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService java.net.SocketTimeoutException: Call From dn-hcl1264.my.cluster.com/kkk.sss.22.105 to namenode-ha1.my.cluster.com:9000 failed on socket timeout exception: java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/kkk.sss.22.105:10101 remote=namenode-ha1.my.cluster.com/10.150.1.56:9000]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:408) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:750) at org.apache.hadoop.ipc.Client.call(Client.java:1473) at org.apache.hadoop.ipc.Client.call(Client.java:1400) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy12.blockReport(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:178) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:494) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:732) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:872) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/kkk.sss.22.105:10101 remote=namenode-ha1.my.cluster.com/10.150.1.56:9000] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.FilterInputStream.read(FilterInputStream.java:133) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967) 2016-04-12 22:32:59,179 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x283d334a100bde, containing 6 storage report(s), of which we sent 0. The reports had 81565 total blocks and used 0 RPC(s). This took 17 msec to generate and 60066 msecs for RPC and NN processing. Got back no commands. 2016-04-12 22:33:59,311 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x283d414ae386b2, containing 6 storage report(s), of which we sent 0. The reports had 81565 total blocks and used 0 RPC(s). This took 16 msec to generate and 60055 msecs for RPC and NN processing. Got back no commands. 2016-04-12 22:34:59,409 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x283d4f4a605732, containing 6 storage report(s), of which we sent 0. The reports had 81565 total blocks and used 0
[jira] [Commented] (HDFS-10301) Blocks removed by thousands due to falsely detected zombie storages
[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244876#comment-15244876 ] Konstantin Shvachko commented on HDFS-10301: More details. # My DataNode has 6 storages. It sends a block report and times out, then it sends the same block report five more times with different blockReportIds. # The NameNode starts executing all six reports around the same time, and interleaves them, that is it processes the first storage of BR2 before it process the last storage of BR1. (Color coded logs are coming) # While processing storages from BR2 NameNode changes the lastBlockReportId field to the id of BR2. This messes with processing storages from BR1, which have not been processed yet. Namely these storages are considered zombie, and all replicas are removed from those storages along with the storage itself. # The storage is then reconstructed by the NameNode when it receives a heartbeat from the DataNode, but this storage is marked as "stale", but the replicas will not be reconstructed until the next block report, which in my case is a few hours later. # I noticed missing blocks because several DataNodes exhibited the same behavior and all replicas of the same block were lost. # The replicas eventually reappeared (several hours later), because DataNodes do not physically remove the replicas and report them in the next block report. The behavior was introduced by HDFS-7960 as a part of hot-swap feature. I did not do hot-swap, and did not failover the NameNode. > Blocks removed by thousands due to falsely detected zombie storages > --- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.6.1 >Reporter: Konstantin Shvachko >Priority: Critical > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)