[jira] [Updated] (HDFS-7134) Replication count for a block should not update till the blocks have settled on Datanodes

2017-05-22 Thread gurmukh singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gurmukh singh updated HDFS-7134:

Component/s: datanode

> Replication count for a block should not update till the blocks have settled 
> on Datanodes
> -
>
> Key: HDFS-7134
> URL: https://issues.apache.org/jira/browse/HDFS-7134
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 1.2.1, 2.6.0, 2.7.3
> Environment: Linux nn1.cluster1.com 2.6.32-431.20.3.el6.x86_64 #1 SMP 
> Thu Jun 19 21:14:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> [hadoop@nn1 conf]$ cat /etc/redhat-release
> CentOS release 6.5 (Final)
>Reporter: gurmukh singh
>Priority: Critical
>  Labels: HDFS
>
> The count for the number of replica's for a block should not change till the 
> blocks have settled on the datanodes.
> Test Case:
> Hadoop Cluster with 1 namenode and 3 datanodes.
> nn1.cluster1.com(192.168.1.70)
> dn1.cluster1.com(192.168.1.72)
> dn2.cluster1.com(192.168.1.73)
> dn3.cluster1.com(192.168.1.74)
> Cluster up and running fine with replication set to "1" for parameter 
> "dfs.replication on all nodes"
> 
> dfs.replication
> 1
> 
> To reduce the wait time, have reduced the dfs.heartbeat and recheck 
> parameters.
> on datanode2 (192.168.1.72)
> [hadoop@dn2 ~]$ hadoop fs -Ddfs.replication=2 -put from_dn2 /
> [hadoop@dn2 ~]$ hadoop fs -ls /from_dn2
> Found 1 items
> -rw-r--r--   2 hadoop supergroup 17 2014-09-23 13:33 /from_dn2
> On Namenode
> ===
> As expected, copy was done from datanode2, one copy will go locally.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:53:16 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
> 192.168.1.73:50010]
> Can see the blocks on the data nodes disks as well under the "current" 
> directory.
> Now, shutdown datanode2(192.168.1.73) and as expected block moves to another 
> datanode to maintain a replication of 2
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:54:21 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
> 192.168.1.72:50010]
> But, now if i bring back the datanode2, and although the namenode see that 
> this block is at 3 places now and fires a invalidate command for 
> datanode1(192.168.1.72) but the replication on the namenode is bumped to 3 
> immediately.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:56:12 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
> 192.168.1.72:50010, 192.168.1.73:50010]
> on Datanode1 - The invalidate command has been fired immediately and the 
> block deleted.
> =
> 2014-09-23 13:54:17,483 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
> /192.168.1.72:50010
> 2014-09-23 13:54:17,502 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Received blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
> /192.168.1.72:50010 size 17
> 2014-09-23 13:55:28,720 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Scheduling blk_8132629811771280764_1175 file 
> /space/disk1/current/blk_8132629811771280764 for deletion
> 2014-09-23 13:55:28,721 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Deleted blk_8132629811771280764_1175 at file 
> /space/disk1/current/blk_8132629811771280764
> The namenode still shows 3 replica's. even if one has been deleted, even 
> after more then 30 mins.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 14:21:27 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
> 192.168.1.72:50010, 192.168.1.73:50010]
> This could be a dangerous, if someone remove or other 2 datanodes fail.
> On Datanode 1
> =
> Before, the datanode1 is brought back
> [hadoop@dn1 conf]$ ls -l /space/disk*/current
> /space/disk1/current:
> total 28
> -rw-rw-r-- 1 hadoop hadoop   13 Sep 21 09:09 blk_2278001646987517832
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 21 09:09 blk_2278001646987517832_1171.meta
> -rw-rw-r-- 1 hadoop hadoop   17 Sep 23 13:54 blk_8132629811771280764
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 23 13:54 blk_8132629811771280764_1175.meta
> -rw-rw-r-- 1 hadoop 

[jira] [Updated] (HDFS-7134) Replication count for a block should not update till the blocks have settled on Datanodes

2017-05-22 Thread gurmukh singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gurmukh singh updated HDFS-7134:

Labels: HDFS  (was: )

> Replication count for a block should not update till the blocks have settled 
> on Datanodes
> -
>
> Key: HDFS-7134
> URL: https://issues.apache.org/jira/browse/HDFS-7134
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 1.2.1, 2.6.0, 2.7.3
> Environment: Linux nn1.cluster1.com 2.6.32-431.20.3.el6.x86_64 #1 SMP 
> Thu Jun 19 21:14:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> [hadoop@nn1 conf]$ cat /etc/redhat-release
> CentOS release 6.5 (Final)
>Reporter: gurmukh singh
>Priority: Critical
>  Labels: HDFS
>
> The count for the number of replica's for a block should not change till the 
> blocks have settled on the datanodes.
> Test Case:
> Hadoop Cluster with 1 namenode and 3 datanodes.
> nn1.cluster1.com(192.168.1.70)
> dn1.cluster1.com(192.168.1.72)
> dn2.cluster1.com(192.168.1.73)
> dn3.cluster1.com(192.168.1.74)
> Cluster up and running fine with replication set to "1" for parameter 
> "dfs.replication on all nodes"
> 
> dfs.replication
> 1
> 
> To reduce the wait time, have reduced the dfs.heartbeat and recheck 
> parameters.
> on datanode2 (192.168.1.72)
> [hadoop@dn2 ~]$ hadoop fs -Ddfs.replication=2 -put from_dn2 /
> [hadoop@dn2 ~]$ hadoop fs -ls /from_dn2
> Found 1 items
> -rw-r--r--   2 hadoop supergroup 17 2014-09-23 13:33 /from_dn2
> On Namenode
> ===
> As expected, copy was done from datanode2, one copy will go locally.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:53:16 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
> 192.168.1.73:50010]
> Can see the blocks on the data nodes disks as well under the "current" 
> directory.
> Now, shutdown datanode2(192.168.1.73) and as expected block moves to another 
> datanode to maintain a replication of 2
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:54:21 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
> 192.168.1.72:50010]
> But, now if i bring back the datanode2, and although the namenode see that 
> this block is at 3 places now and fires a invalidate command for 
> datanode1(192.168.1.72) but the replication on the namenode is bumped to 3 
> immediately.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:56:12 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
> 192.168.1.72:50010, 192.168.1.73:50010]
> on Datanode1 - The invalidate command has been fired immediately and the 
> block deleted.
> =
> 2014-09-23 13:54:17,483 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
> /192.168.1.72:50010
> 2014-09-23 13:54:17,502 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Received blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
> /192.168.1.72:50010 size 17
> 2014-09-23 13:55:28,720 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Scheduling blk_8132629811771280764_1175 file 
> /space/disk1/current/blk_8132629811771280764 for deletion
> 2014-09-23 13:55:28,721 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Deleted blk_8132629811771280764_1175 at file 
> /space/disk1/current/blk_8132629811771280764
> The namenode still shows 3 replica's. even if one has been deleted, even 
> after more then 30 mins.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 14:21:27 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
> 192.168.1.72:50010, 192.168.1.73:50010]
> This could be a dangerous, if someone remove or other 2 datanodes fail.
> On Datanode 1
> =
> Before, the datanode1 is brought back
> [hadoop@dn1 conf]$ ls -l /space/disk*/current
> /space/disk1/current:
> total 28
> -rw-rw-r-- 1 hadoop hadoop   13 Sep 21 09:09 blk_2278001646987517832
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 21 09:09 blk_2278001646987517832_1171.meta
> -rw-rw-r-- 1 hadoop hadoop   17 Sep 23 13:54 blk_8132629811771280764
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 23 13:54 blk_8132629811771280764_1175.meta
> -rw-rw-r-- 1 hadoop hadoop 5299 

[jira] [Updated] (HDFS-7134) Replication count for a block should not update till the blocks have settled on Datanodes

2017-05-22 Thread gurmukh singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gurmukh singh updated HDFS-7134:

Component/s: hdfs

> Replication count for a block should not update till the blocks have settled 
> on Datanodes
> -
>
> Key: HDFS-7134
> URL: https://issues.apache.org/jira/browse/HDFS-7134
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 1.2.1, 2.6.0, 2.7.3
> Environment: Linux nn1.cluster1.com 2.6.32-431.20.3.el6.x86_64 #1 SMP 
> Thu Jun 19 21:14:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> [hadoop@nn1 conf]$ cat /etc/redhat-release
> CentOS release 6.5 (Final)
>Reporter: gurmukh singh
>Priority: Critical
>  Labels: HDFS
>
> The count for the number of replica's for a block should not change till the 
> blocks have settled on the datanodes.
> Test Case:
> Hadoop Cluster with 1 namenode and 3 datanodes.
> nn1.cluster1.com(192.168.1.70)
> dn1.cluster1.com(192.168.1.72)
> dn2.cluster1.com(192.168.1.73)
> dn3.cluster1.com(192.168.1.74)
> Cluster up and running fine with replication set to "1" for parameter 
> "dfs.replication on all nodes"
> 
> dfs.replication
> 1
> 
> To reduce the wait time, have reduced the dfs.heartbeat and recheck 
> parameters.
> on datanode2 (192.168.1.72)
> [hadoop@dn2 ~]$ hadoop fs -Ddfs.replication=2 -put from_dn2 /
> [hadoop@dn2 ~]$ hadoop fs -ls /from_dn2
> Found 1 items
> -rw-r--r--   2 hadoop supergroup 17 2014-09-23 13:33 /from_dn2
> On Namenode
> ===
> As expected, copy was done from datanode2, one copy will go locally.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:53:16 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
> 192.168.1.73:50010]
> Can see the blocks on the data nodes disks as well under the "current" 
> directory.
> Now, shutdown datanode2(192.168.1.73) and as expected block moves to another 
> datanode to maintain a replication of 2
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:54:21 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
> 192.168.1.72:50010]
> But, now if i bring back the datanode2, and although the namenode see that 
> this block is at 3 places now and fires a invalidate command for 
> datanode1(192.168.1.72) but the replication on the namenode is bumped to 3 
> immediately.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:56:12 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
> 192.168.1.72:50010, 192.168.1.73:50010]
> on Datanode1 - The invalidate command has been fired immediately and the 
> block deleted.
> =
> 2014-09-23 13:54:17,483 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
> /192.168.1.72:50010
> 2014-09-23 13:54:17,502 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Received blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
> /192.168.1.72:50010 size 17
> 2014-09-23 13:55:28,720 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Scheduling blk_8132629811771280764_1175 file 
> /space/disk1/current/blk_8132629811771280764 for deletion
> 2014-09-23 13:55:28,721 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Deleted blk_8132629811771280764_1175 at file 
> /space/disk1/current/blk_8132629811771280764
> The namenode still shows 3 replica's. even if one has been deleted, even 
> after more then 30 mins.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 14:21:27 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
> 192.168.1.72:50010, 192.168.1.73:50010]
> This could be a dangerous, if someone remove or other 2 datanodes fail.
> On Datanode 1
> =
> Before, the datanode1 is brought back
> [hadoop@dn1 conf]$ ls -l /space/disk*/current
> /space/disk1/current:
> total 28
> -rw-rw-r-- 1 hadoop hadoop   13 Sep 21 09:09 blk_2278001646987517832
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 21 09:09 blk_2278001646987517832_1171.meta
> -rw-rw-r-- 1 hadoop hadoop   17 Sep 23 13:54 blk_8132629811771280764
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 23 13:54 blk_8132629811771280764_1175.meta
> -rw-rw-r-- 1 hadoop hadoop 5299 Sep 

[jira] [Updated] (HDFS-7134) Replication count for a block should not update till the blocks have settled on Datanodes

2017-05-22 Thread gurmukh singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gurmukh singh updated HDFS-7134:

Affects Version/s: 2.7.3

> Replication count for a block should not update till the blocks have settled 
> on Datanodes
> -
>
> Key: HDFS-7134
> URL: https://issues.apache.org/jira/browse/HDFS-7134
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 1.2.1, 2.6.0, 2.7.3
> Environment: Linux nn1.cluster1.com 2.6.32-431.20.3.el6.x86_64 #1 SMP 
> Thu Jun 19 21:14:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> [hadoop@nn1 conf]$ cat /etc/redhat-release
> CentOS release 6.5 (Final)
>Reporter: gurmukh singh
>Priority: Critical
>
> The count for the number of replica's for a block should not change till the 
> blocks have settled on the datanodes.
> Test Case:
> Hadoop Cluster with 1 namenode and 3 datanodes.
> nn1.cluster1.com(192.168.1.70)
> dn1.cluster1.com(192.168.1.72)
> dn2.cluster1.com(192.168.1.73)
> dn3.cluster1.com(192.168.1.74)
> Cluster up and running fine with replication set to "1" for parameter 
> "dfs.replication on all nodes"
> 
> dfs.replication
> 1
> 
> To reduce the wait time, have reduced the dfs.heartbeat and recheck 
> parameters.
> on datanode2 (192.168.1.72)
> [hadoop@dn2 ~]$ hadoop fs -Ddfs.replication=2 -put from_dn2 /
> [hadoop@dn2 ~]$ hadoop fs -ls /from_dn2
> Found 1 items
> -rw-r--r--   2 hadoop supergroup 17 2014-09-23 13:33 /from_dn2
> On Namenode
> ===
> As expected, copy was done from datanode2, one copy will go locally.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:53:16 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
> 192.168.1.73:50010]
> Can see the blocks on the data nodes disks as well under the "current" 
> directory.
> Now, shutdown datanode2(192.168.1.73) and as expected block moves to another 
> datanode to maintain a replication of 2
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:54:21 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
> 192.168.1.72:50010]
> But, now if i bring back the datanode2, and although the namenode see that 
> this block is at 3 places now and fires a invalidate command for 
> datanode1(192.168.1.72) but the replication on the namenode is bumped to 3 
> immediately.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:56:12 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
> 192.168.1.72:50010, 192.168.1.73:50010]
> on Datanode1 - The invalidate command has been fired immediately and the 
> block deleted.
> =
> 2014-09-23 13:54:17,483 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
> /192.168.1.72:50010
> 2014-09-23 13:54:17,502 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Received blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
> /192.168.1.72:50010 size 17
> 2014-09-23 13:55:28,720 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Scheduling blk_8132629811771280764_1175 file 
> /space/disk1/current/blk_8132629811771280764 for deletion
> 2014-09-23 13:55:28,721 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Deleted blk_8132629811771280764_1175 at file 
> /space/disk1/current/blk_8132629811771280764
> The namenode still shows 3 replica's. even if one has been deleted, even 
> after more then 30 mins.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 14:21:27 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
> 192.168.1.72:50010, 192.168.1.73:50010]
> This could be a dangerous, if someone remove or other 2 datanodes fail.
> On Datanode 1
> =
> Before, the datanode1 is brought back
> [hadoop@dn1 conf]$ ls -l /space/disk*/current
> /space/disk1/current:
> total 28
> -rw-rw-r-- 1 hadoop hadoop   13 Sep 21 09:09 blk_2278001646987517832
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 21 09:09 blk_2278001646987517832_1171.meta
> -rw-rw-r-- 1 hadoop hadoop   17 Sep 23 13:54 blk_8132629811771280764
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 23 13:54 blk_8132629811771280764_1175.meta
> -rw-rw-r-- 1 hadoop hadoop 5299 Sep 21 10:04 dncp_block_verification.log.curr
> 

[jira] [Updated] (HDFS-7134) Replication count for a block should not update till the blocks have settled on Datanodes

2016-05-30 Thread gurmukh singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gurmukh singh updated HDFS-7134:

Priority: Critical  (was: Major)

> Replication count for a block should not update till the blocks have settled 
> on Datanodes
> -
>
> Key: HDFS-7134
> URL: https://issues.apache.org/jira/browse/HDFS-7134
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 1.2.1, 2.6.0
> Environment: Linux nn1.cluster1.com 2.6.32-431.20.3.el6.x86_64 #1 SMP 
> Thu Jun 19 21:14:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> [hadoop@nn1 conf]$ cat /etc/redhat-release
> CentOS release 6.5 (Final)
>Reporter: gurmukh singh
>Priority: Critical
>
> The count for the number of replica's for a block should not change till the 
> blocks have settled on the datanodes.
> Test Case:
> Hadoop Cluster with 1 namenode and 3 datanodes.
> nn1.cluster1.com(192.168.1.70)
> dn1.cluster1.com(192.168.1.72)
> dn2.cluster1.com(192.168.1.73)
> dn3.cluster1.com(192.168.1.74)
> Cluster up and running fine with replication set to "1" for parameter 
> "dfs.replication on all nodes"
> 
> dfs.replication
> 1
> 
> To reduce the wait time, have reduced the dfs.heartbeat and recheck 
> parameters.
> on datanode2 (192.168.1.72)
> [hadoop@dn2 ~]$ hadoop fs -Ddfs.replication=2 -put from_dn2 /
> [hadoop@dn2 ~]$ hadoop fs -ls /from_dn2
> Found 1 items
> -rw-r--r--   2 hadoop supergroup 17 2014-09-23 13:33 /from_dn2
> On Namenode
> ===
> As expected, copy was done from datanode2, one copy will go locally.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:53:16 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
> 192.168.1.73:50010]
> Can see the blocks on the data nodes disks as well under the "current" 
> directory.
> Now, shutdown datanode2(192.168.1.73) and as expected block moves to another 
> datanode to maintain a replication of 2
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:54:21 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
> 192.168.1.72:50010]
> But, now if i bring back the datanode2, and although the namenode see that 
> this block is at 3 places now and fires a invalidate command for 
> datanode1(192.168.1.72) but the replication on the namenode is bumped to 3 
> immediately.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:56:12 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
> 192.168.1.72:50010, 192.168.1.73:50010]
> on Datanode1 - The invalidate command has been fired immediately and the 
> block deleted.
> =
> 2014-09-23 13:54:17,483 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
> /192.168.1.72:50010
> 2014-09-23 13:54:17,502 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Received blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
> /192.168.1.72:50010 size 17
> 2014-09-23 13:55:28,720 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Scheduling blk_8132629811771280764_1175 file 
> /space/disk1/current/blk_8132629811771280764 for deletion
> 2014-09-23 13:55:28,721 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Deleted blk_8132629811771280764_1175 at file 
> /space/disk1/current/blk_8132629811771280764
> The namenode still shows 3 replica's. even if one has been deleted, even 
> after more then 30 mins.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 14:21:27 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
> 192.168.1.72:50010, 192.168.1.73:50010]
> This could be a dangerous, if someone remove or other 2 datanodes fail.
> On Datanode 1
> =
> Before, the datanode1 is brought back
> [hadoop@dn1 conf]$ ls -l /space/disk*/current
> /space/disk1/current:
> total 28
> -rw-rw-r-- 1 hadoop hadoop   13 Sep 21 09:09 blk_2278001646987517832
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 21 09:09 blk_2278001646987517832_1171.meta
> -rw-rw-r-- 1 hadoop hadoop   17 Sep 23 13:54 blk_8132629811771280764
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 23 13:54 blk_8132629811771280764_1175.meta
> -rw-rw-r-- 1 hadoop hadoop 5299 Sep 21 10:04 dncp_block_verification.log.curr
> 

[jira] [Updated] (HDFS-7134) Replication count for a block should not update till the blocks have settled on Datanodes

2015-06-11 Thread gurmukh singh (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

gurmukh singh updated HDFS-7134:

Affects Version/s: 2.6.0

 Replication count for a block should not update till the blocks have settled 
 on Datanodes
 -

 Key: HDFS-7134
 URL: https://issues.apache.org/jira/browse/HDFS-7134
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 1.2.1, 2.6.0
 Environment: Linux nn1.cluster1.com 2.6.32-431.20.3.el6.x86_64 #1 SMP 
 Thu Jun 19 21:14:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
 [hadoop@nn1 conf]$ cat /etc/redhat-release
 CentOS release 6.5 (Final)
Reporter: gurmukh singh

 The count for the number of replica's for a block should not change till the 
 blocks have settled on the datanodes.
 Test Case:
 Hadoop Cluster with 1 namenode and 3 datanodes.
 nn1.cluster1.com(192.168.1.70)
 dn1.cluster1.com(192.168.1.72)
 dn2.cluster1.com(192.168.1.73)
 dn3.cluster1.com(192.168.1.74)
 Cluster up and running fine with replication set to 1 for parameter 
 dfs.replication on all nodes
 property
 namedfs.replication/name
 value1/value
 /property
 To reduce the wait time, have reduced the dfs.heartbeat and recheck 
 parameters.
 on datanode2 (192.168.1.72)
 [hadoop@dn2 ~]$ hadoop fs -Ddfs.replication=2 -put from_dn2 /
 [hadoop@dn2 ~]$ hadoop fs -ls /from_dn2
 Found 1 items
 -rw-r--r--   2 hadoop supergroup 17 2014-09-23 13:33 /from_dn2
 On Namenode
 ===
 As expected, copy was done from datanode2, one copy will go locally.
 [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
 FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
 13:53:16 IST 2014
 /from_dn2 17 bytes, 1 block(s):  OK
 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
 192.168.1.73:50010]
 Can see the blocks on the data nodes disks as well under the current 
 directory.
 Now, shutdown datanode2(192.168.1.73) and as expected block moves to another 
 datanode to maintain a replication of 2
 [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
 FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
 13:54:21 IST 2014
 /from_dn2 17 bytes, 1 block(s):  OK
 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
 192.168.1.72:50010]
 But, now if i bring back the datanode2, and although the namenode see that 
 this block is at 3 places now and fires a invalidate command for 
 datanode1(192.168.1.72) but the replication on the namenode is bumped to 3 
 immediately.
 [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
 FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
 13:56:12 IST 2014
 /from_dn2 17 bytes, 1 block(s):  OK
 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
 192.168.1.72:50010, 192.168.1.73:50010]
 on Datanode1 - The invalidate command has been fired immediately and the 
 block deleted.
 =
 2014-09-23 13:54:17,483 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
 Receiving blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
 /192.168.1.72:50010
 2014-09-23 13:54:17,502 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
 Received blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
 /192.168.1.72:50010 size 17
 2014-09-23 13:55:28,720 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
 Scheduling blk_8132629811771280764_1175 file 
 /space/disk1/current/blk_8132629811771280764 for deletion
 2014-09-23 13:55:28,721 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
 Deleted blk_8132629811771280764_1175 at file 
 /space/disk1/current/blk_8132629811771280764
 The namenode still shows 3 replica's. even if one has been deleted, even 
 after more then 30 mins.
 [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
 FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
 14:21:27 IST 2014
 /from_dn2 17 bytes, 1 block(s):  OK
 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
 192.168.1.72:50010, 192.168.1.73:50010]
 This could be a dangerous, if someone remove or other 2 datanodes fail.
 On Datanode 1
 =
 Before, the datanode1 is brought back
 [hadoop@dn1 conf]$ ls -l /space/disk*/current
 /space/disk1/current:
 total 28
 -rw-rw-r-- 1 hadoop hadoop   13 Sep 21 09:09 blk_2278001646987517832
 -rw-rw-r-- 1 hadoop hadoop   11 Sep 21 09:09 blk_2278001646987517832_1171.meta
 -rw-rw-r-- 1 hadoop hadoop   17 Sep 23 13:54 blk_8132629811771280764
 -rw-rw-r-- 1 hadoop hadoop   11 Sep 23 13:54 blk_8132629811771280764_1175.meta
 -rw-rw-r-- 1 hadoop hadoop 5299 Sep 21 10:04 dncp_block_verification.log.curr
 -rw-rw-r-- 1 hadoop hadoop  157 Sep 23 13:51 VERSION
 After, starting datanode daemon
 [hadoop@dn1 conf]$ ls -l