[jira] [Commented] (HDFS-15067) Optimize heartbeat for large cluster

2022-07-26 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571727#comment-17571727
 ] 

Surendra Singh Lilhore commented on HDFS-15067:
---

Thanks [~prasad-acit] . We can merge this, let's ask other people to review 
this.

> Optimize heartbeat for large cluster
> 
>
> Key: HDFS-15067
> URL: https://issues.apache.org/jira/browse/HDFS-15067
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, 
> HDFS-15067.03.patch, image-2020-01-09-18-00-49-556.png
>
>
> In a large cluster Namenode spend some time in processing heartbeats. For 
> example, in 10K node cluster namenode process 10K RPC's for heartbeat in each 
> 3sec. This will impact the client response time. This heart beat can be 
> optimized. DN can start skipping one heart beat if no 
> work(Write/replication/Delete) is allocated from long time. DN can start 
> sending heart beat in 6 sec. Once the DN stating getting work from NN , it 
> can start sending heart beat normally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-02-20 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17495321#comment-17495321
 ] 

Surendra Singh Lilhore commented on HDFS-16456:
---

Thanks [~caozhiqiang] and appreciate your effort.

I am not in favor of changing network topology for this issue. We can give try 
to find target from other rack after getting 
NotEnoughReplicasException in below logic.
{code:java}
    if (totalReplicaExpected < numOfRacks ||
        totalReplicaExpected % numOfRacks == 0) {
      writer = chooseOnce(numOfReplicas, writer, excludedNodes, blocksize,
          maxNodesPerRack, results, avoidStaleNodes, storageTypes);
      return writer;
    } {code}
[~tasanuma], [~weichiu] Please give your opinion.

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Priority: Critical
> Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch
>
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with or less than the replication 
> number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In chooseEvenlyFromRemainingRacks(), this numResultsOflastChoose = 
> results.size(); code should be move to after chooseOnce(), or it will throw 
> lastException and make choose targets failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-02-16 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493237#comment-17493237
 ] 

Surendra Singh Lilhore edited comment on HDFS-16456 at 2/16/22, 1:30 PM:
-

[~caozhiqiang], thanks for patch.
{noformat}
   hbManager.startDecommission(node);
+  // Update cluster's numOfRacks
+  blockManager.getDatanodeManager().getNetworkTopology().remove(node); 
{noformat}
I don't think this is right way to remove node from topology. After starting 
decommissioning we shouldn't remove node, it is still part of cluster.


was (Author: surendrasingh):
[~caozhiqiang], thanks for patch.
{noformat}
   hbManager.startDecommission(node);
+  // Update cluster's numOfRacks
+  blockManager.getDatanodeManager().getNetworkTopology().remove(node); 
{noformat}
I don't thing this is right way to remove node from topology. After starting 
decommissioning we shouldn't remove node, it is still part of cluster.

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Priority: Critical
> Attachments: HDFS-16456.001.patch
>
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with the replication number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In chooseEvenlyFromRemainingRacks(), this numResultsOflastChoose = 
> results.size(); code should be move to after chooseOnce(), or it will throw 
> lastException and make choose targets failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For 

[jira] [Commented] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication

2022-02-16 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493237#comment-17493237
 ] 

Surendra Singh Lilhore commented on HDFS-16456:
---

[~caozhiqiang], thanks for patch.
{noformat}
   hbManager.startDecommission(node);
+  // Update cluster's numOfRacks
+  blockManager.getDatanodeManager().getNetworkTopology().remove(node); 
{noformat}
I don't thing this is right way to remove node from topology. After starting 
decommissioning we shouldn't remove node, it is still part of cluster.

> EC: Decommission a rack with only on dn will fail when the rack number is 
> equal with replication
> 
>
> Key: HDFS-16456
> URL: https://issues.apache.org/jira/browse/HDFS-16456
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, namenode
>Affects Versions: 3.4.0
>Reporter: caozhiqiang
>Priority: Critical
> Attachments: HDFS-16456.001.patch
>
>
> In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason:
>  # Enable EC policy, such as RS-6-3-1024k.
>  # The rack number in this cluster is equal with the replication number(9)
>  # A rack only has one DN, and decommission this DN.
> The root cause is in 
> BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will 
> give a limit parameter maxNodesPerRack for choose targets. In this scenario, 
> the maxNodesPerRack is 1, which means each rack can only be chosen one 
> datanode.
> {code:java}
>   protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
>...
>     // If more replicas than racks, evenly spread the replicas.
>     // This calculation rounds up.
>     int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> return new int[] {numOfReplicas, maxNodesPerRack};
>   } {code}
> int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1;
> here will be called, where totalNumOfReplicas=9 and  numOfRacks=9  
> When we decommission one dn which is only one node in its rack, the 
> chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() 
> will throw NotEnoughReplicasException, but the exception will not be caught 
> and fail to fallback to chooseEvenlyFromRemainingRacks() function.
> When decommission, after choose targets, verifyBlockPlacement() function will 
> return the total rack number contains the invalid rack, and 
> BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false 
> and it will also cause decommission fail.
> {code:java}
>   public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs,
>       int numberOfReplicas) {
>     if (locs == null)
>       locs = DatanodeDescriptor.EMPTY_ARRAY;
>     if (!clusterMap.hasClusterEverBeenMultiRack()) {
>       // only one rack
>       return new BlockPlacementStatusDefault(1, 1, 1);
>     }
>     // Count locations on different racks.
>     Set racks = new HashSet<>();
>     for (DatanodeInfo dn : locs) {
>       racks.add(dn.getNetworkLocation());
>     }
>     return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas,
>         clusterMap.getNumOfRacks());
>   } {code}
> {code:java}
>   public boolean isPlacementPolicySatisfied() {
>     return requiredRacks <= currentRacks || currentRacks >= totalRacks;
>   }{code}
> According to the above description, we should make the below modify to fix it:
>  # In startDecommission() or stopDecommission(), we should also change the 
> numOfRacks in class NetworkTopology. Or choose targets may fail for the 
> maxNodesPerRack is too small. And even choose targets success, 
> isPlacementPolicySatisfied will also return false cause decommission fail.
>  # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first 
> chooseOnce() function should also be put in try..catch..., or it will not 
> fallback to call chooseEvenlyFromRemainingRacks() when throw exception.
>  # In chooseEvenlyFromRemainingRacks(), this numResultsOflastChoose = 
> results.size(); code should be move to after chooseOnce(), or it will throw 
> lastException and make choose targets failed.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15863) RBF: Validation message to be corrected in FairnessPolicyController

2021-03-28 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17310156#comment-17310156
 ] 

Surendra Singh Lilhore edited comment on HDFS-15863 at 3/28/21, 10:18 AM:
--

+1 for v5.

 

Triggered build : https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/559/


was (Author: surendrasingh):
+1 for v5.

> RBF: Validation message to be corrected in FairnessPolicyController
> ---
>
> Key: HDFS-15863
> URL: https://issues.apache.org/jira/browse/HDFS-15863
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.4.0
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Minor
> Attachments: HDFS-15863.001.patch, HDFS-15863.002.patch, 
> HDFS-15863.003.patch, HDFS-15863.004.patch, HDFS-15863.005.patch
>
>
> org.apache.hadoop.hdfs.server.federation.fairness.StaticRouterRpcFairnessPolicyController#validateCount
> When dfs.federation.router.handler.count is lessthan the total dedicated 
> handlers for all NS, then error message shows 0 & -ve values in error 
> message, instead can show the actual configured values.
> Current message is : "Available handlers -5 lower than min 0 for nsId nn1"
> This can be changed to: "Configured handlers 
> ${DFS_ROUTER_HANDLER_COUNT_KEY}=10 lower than min 15 for nsId nn1", where 10 
> is hander count & 15 is sum of dedicated handler count.
> Related to: HDFS-14090



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15863) RBF: Validation message to be corrected in FairnessPolicyController

2021-03-28 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17310156#comment-17310156
 ] 

Surendra Singh Lilhore commented on HDFS-15863:
---

+1 for v5.

> RBF: Validation message to be corrected in FairnessPolicyController
> ---
>
> Key: HDFS-15863
> URL: https://issues.apache.org/jira/browse/HDFS-15863
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.4.0
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Minor
> Attachments: HDFS-15863.001.patch, HDFS-15863.002.patch, 
> HDFS-15863.003.patch, HDFS-15863.004.patch, HDFS-15863.005.patch
>
>
> org.apache.hadoop.hdfs.server.federation.fairness.StaticRouterRpcFairnessPolicyController#validateCount
> When dfs.federation.router.handler.count is lessthan the total dedicated 
> handlers for all NS, then error message shows 0 & -ve values in error 
> message, instead can show the actual configured values.
> Current message is : "Available handlers -5 lower than min 0 for nsId nn1"
> This can be changed to: "Configured handlers 
> ${DFS_ROUTER_HANDLER_COUNT_KEY}=10 lower than min 15 for nsId nn1", where 10 
> is hander count & 15 is sum of dedicated handler count.
> Related to: HDFS-14090



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15812) after deleting data of hbase table hdfs size is not decreasing

2021-02-11 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283149#comment-17283149
 ] 

Surendra Singh Lilhore commented on HDFS-15812:
---

[~satycse06], Please can you check the namenode log, what happened to hbase 
related files after deleting table  ?

> after deleting data of hbase table hdfs size is not decreasing
> --
>
> Key: HDFS-15812
> URL: https://issues.apache.org/jira/browse/HDFS-15812
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.0.2-alpha
> Environment: HDP 3.1.4.0-315
> Hbase 2.0.2.3.1.4.0-315
>Reporter: Satya Gaurav
>Priority: Major
>
> I am deleting the data from hbase table, it's deleting from hbase table but 
> the size of the hdfs directory is not reducing. Even I ran the major 
> compaction but after that also hdfs size didn't reduce. Any solution for this 
> issue?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15812) after deleting data of hbase table hdfs size is not decreasing

2021-02-03 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278552#comment-17278552
 ] 

Surendra Singh Lilhore commented on HDFS-15812:
---

[~satycse06], This doc may help to understand. You need to check HBase side 
deletion policy. 

[https://docs.cloudera.com/cdp-private-cloud-base/7.1.3/managing-hbase/topics/hbase-deletion.html]

I don't see any problem from HDFS side.

> after deleting data of hbase table hdfs size is not decreasing
> --
>
> Key: HDFS-15812
> URL: https://issues.apache.org/jira/browse/HDFS-15812
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.0.2-alpha
> Environment: HDP 3.1.4.0-315
> Hbase 2.0.2.3.1.4.0-315
>Reporter: Satya Gaurav
>Priority: Major
>
> I am deleting the data from hbase table, it's deleting from hbase table but 
> the size of the hdfs directory is not reducing. Even I ran the major 
> compaction but after that also hdfs size didn't reduce. Any solution for this 
> issue?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15812) after deleting data of hbase table hdfs size is not decreasing

2021-02-02 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277683#comment-17277683
 ] 

Surendra Singh Lilhore commented on HDFS-15812:
---

please send your query on 
[u...@hadoop.apache.org.|mailto:u...@hadoop.apache.org]

> after deleting data of hbase table hdfs size is not decreasing
> --
>
> Key: HDFS-15812
> URL: https://issues.apache.org/jira/browse/HDFS-15812
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.0.2-alpha
> Environment: HDP 3.1.4.0-315
> Hbase 2.0.2.3.1.4.0-315
>Reporter: Satya Gaurav
>Priority: Major
>
> I am deleting the data from hbase table, it's deleting from hbase table but 
> the size of the hdfs directory is not reducing. Even I ran the major 
> compaction but after that also hdfs size didn't reduce. Any solution for this 
> issue?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15812) after deleting data of hbase table hdfs size is not decreasing

2021-02-02 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277682#comment-17277682
 ] 

Surendra Singh Lilhore commented on HDFS-15812:
---

[~satycse06], it will take time to delete data from hdfs if is moved to trash.

> after deleting data of hbase table hdfs size is not decreasing
> --
>
> Key: HDFS-15812
> URL: https://issues.apache.org/jira/browse/HDFS-15812
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.0.2-alpha
> Environment: HDP 3.1.4.0-315
> Hbase 2.0.2.3.1.4.0-315
>Reporter: Satya Gaurav
>Priority: Major
>
> I am deleting the data from hbase table, it's deleting from hbase table but 
> the size of the hdfs directory is not reducing. Even I ran the major 
> compaction but after that also hdfs size didn't reduce. Any solution for this 
> issue?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13522) Support observer node from Router-Based Federation

2020-09-08 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191970#comment-17191970
 ] 

Surendra Singh Lilhore commented on HDFS-13522:
---

Hi [~hemanthboyina], In initial review I got two things, which need to be taken 
care.
 # Load balancing between multiple observer.
 # Webhdfs call, I think you may get NPE for webhdfs call.

I will review this patch in detail.

> Support observer node from Router-Based Federation
> --
>
> Key: HDFS-13522
> URL: https://issues.apache.org/jira/browse/HDFS-13522
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: federation, namenode
>Reporter: Erik Krogen
>Assignee: Chao Sun
>Priority: Major
> Attachments: HDFS-13522.001.patch, HDFS-13522_WIP.patch, RBF_ 
> Observer support.pdf, Router+Observer RPC clogging.png, 
> ShortTerm-Routers+Observer.png
>
>
> Changes will need to occur to the router to support the new observer node.
> One such change will be to make the router understand the observer state, 
> e.g. {{FederationNamenodeServiceState}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15476) Make AsyncStream class' executor_ member private

2020-07-19 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore resolved HDFS-15476.
---
Resolution: Fixed

Thanks for contribution [~Suraj Naik]

> Make AsyncStream class' executor_ member private
> 
>
> Key: HDFS-15476
> URL: https://issues.apache.org/jira/browse/HDFS-15476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: build, libhdfs++
>Reporter: Suraj Naik
>Assignee: Suraj Naik
>Priority: Minor
> Fix For: 3.4.0
>
>
> As part of [HDFS-15385|https://issues.apache.org/jira/browse/HDFS-15385] the 
> boost library was upgraded.
> The AsyncStream class has a getter function which returns the executor. 
> Keeping the executor member public makes the getter function's role 
> pointless. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15476) Make AsyncStream class' executor_ member private

2020-07-19 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160606#comment-17160606
 ] 

Surendra Singh Lilhore commented on HDFS-15476:
---

Added [~Suraj Naik] in HDFS contributor list.

> Make AsyncStream class' executor_ member private
> 
>
> Key: HDFS-15476
> URL: https://issues.apache.org/jira/browse/HDFS-15476
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: build, libhdfs++
>Reporter: Suraj Naik
>Priority: Minor
> Fix For: 3.4.0
>
>
> As part of [HDFS-15385|https://issues.apache.org/jira/browse/HDFS-15385] the 
> boost library was upgraded.
> The AsyncStream class has a getter function which returns the executor. 
> Keeping the executor member public makes the getter function's role 
> pointless. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15067) Optimize heartbeat for large cluster

2020-07-13 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17156643#comment-17156643
 ] 

Surendra Singh Lilhore commented on HDFS-15067:
---

Thanks [~ayushtkn] & [~umamaheswararao] for review
{quote}Let's say a DN does not have any work for some time and you started 
skipping heartbeats. When you are skipping, NN assigns some replication work to 
this node, they will just stay in NN side DatanodeDescriptor. Since there are 
no heartbeats received, that DN will not consume that work from NN right? So, 
assigned replication can be delayed? Am i missing something?
{quote}
Yes, max 30s delay (stale interval).
{quote}We also report xceiver counts (and lot of other metrics) in heartbeats 
which will be used which choosing good nodes etc. I am wondering, whether we 
miss any approximation(far from original approximation)?
{quote}
Currently only for block write request (write xceiver) it will think some work 
received and start sending normal heartbeat. Can we consider read request also 
as a work and start sending normal heartbeat ?.
{quote}I saw in your proposal that, at least one heartbeat in stale interval. I 
feel one hb may be risk as it can be delayed or failed due to nw fluctuations. 
So, it may be risk that you will declare that node as stale wrongly?
{quote}
yeah this is problem. Any suggestion for this ?, can we send two continue heart 
to solve this ?
{quote}Does this proved some benefit in your cluster? I mean in response time 
etc.
{quote}
yes we got good benefit in 20k node cluster. In that one is cluster activation 
time (Active NN out of safemode with 20K node) reduced by 50%.

> Optimize heartbeat for large cluster
> 
>
> Key: HDFS-15067
> URL: https://issues.apache.org/jira/browse/HDFS-15067
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, 
> HDFS-15067.03.patch, image-2020-01-09-18-00-49-556.png
>
>
> In a large cluster Namenode spend some time in processing heartbeats. For 
> example, in 10K node cluster namenode process 10K RPC's for heartbeat in each 
> 3sec. This will impact the client response time. This heart beat can be 
> optimized. DN can start skipping one heart beat if no 
> work(Write/replication/Delete) is allocated from long time. DN can start 
> sending heart beat in 6 sec. Once the DN stating getting work from NN , it 
> can start sending heart beat normally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15067) Optimize heartbeat for large cluster

2020-06-24 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143903#comment-17143903
 ] 

Surendra Singh Lilhore commented on HDFS-15067:
---

Attached v3 patch.

please review..

> Optimize heartbeat for large cluster
> 
>
> Key: HDFS-15067
> URL: https://issues.apache.org/jira/browse/HDFS-15067
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, 
> HDFS-15067.03.patch, image-2020-01-09-18-00-49-556.png
>
>
> In a large cluster Namenode spend some time in processing heartbeats. For 
> example, in 10K node cluster namenode process 10K RPC's for heartbeat in each 
> 3sec. This will impact the client response time. This heart beat can be 
> optimized. DN can start skipping one heart beat if no 
> work(Write/replication/Delete) is allocated from long time. DN can start 
> sending heart beat in 6 sec. Once the DN stating getting work from NN , it 
> can start sending heart beat normally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15067) Optimize heartbeat for large cluster

2020-06-24 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15067:
--
Attachment: HDFS-15067.03.patch

> Optimize heartbeat for large cluster
> 
>
> Key: HDFS-15067
> URL: https://issues.apache.org/jira/browse/HDFS-15067
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, 
> HDFS-15067.03.patch, image-2020-01-09-18-00-49-556.png
>
>
> In a large cluster Namenode spend some time in processing heartbeats. For 
> example, in 10K node cluster namenode process 10K RPC's for heartbeat in each 
> 3sec. This will impact the client response time. This heart beat can be 
> optimized. DN can start skipping one heart beat if no 
> work(Write/replication/Delete) is allocated from long time. DN can start 
> sending heart beat in 6 sec. Once the DN stating getting work from NN , it 
> can start sending heart beat normally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15375) Reconstruction Work should not happen for Corrupt Block

2020-06-02 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124099#comment-17124099
 ] 

Surendra Singh Lilhore commented on HDFS-15375:
---

Triggered one build to check the impact of this patch. 

> Reconstruction Work should not happen for Corrupt Block
> ---
>
> Key: HDFS-15375
> URL: https://issues.apache.org/jira/browse/HDFS-15375
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15375-testrepro.patch, HDFS-15375.001.patch
>
>
> In BlockManager#updateNeededReconstructions , while updating the 
> NeededReconstruction we are adding Pendingreconstruction blocks to live 
> replicas
> {code:java}
>  int pendingNum = pendingReconstruction.getNumReplicas(block);
>   int curExpectedReplicas = getExpectedRedundancyNum(block);
>   if (!hasEnoughEffectiveReplicas(block, repl, pendingNum)) {
> neededReconstruction.update(block, repl.liveReplicas() + 
> pendingNum,{code}
> But if two replicas were in pending reconstruction (due to corruption) , and 
> if the third replica is corrupted the block should be in 
> QUEUE_WITH_CORRUPT_BLOCKS but because of above logic it was getting added in 
> QUEUE_LOW_REDUNDANCY , this makes the RedudancyMonitor to reconstruct a 
> corrupted block , which is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15375) Reconstruction Work should not happen for Corrupt Block

2020-06-02 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124096#comment-17124096
 ] 

Surendra Singh Lilhore commented on HDFS-15375:
---

{quote}-                 neededReconstruction.update(block, repl.liveReplicas() 
+ pendingNum,{quote}
We can't remove {{pendingNum}} from here, it will create extra replication task 
if this count doesn't include pendingNum. In your case all the block are 
corrupted means live replica will be zero. You can add some logic based on live 
replica zero check.

> Reconstruction Work should not happen for Corrupt Block
> ---
>
> Key: HDFS-15375
> URL: https://issues.apache.org/jira/browse/HDFS-15375
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15375-testrepro.patch, HDFS-15375.001.patch
>
>
> In BlockManager#updateNeededReconstructions , while updating the 
> NeededReconstruction we are adding Pendingreconstruction blocks to live 
> replicas
> {code:java}
>  int pendingNum = pendingReconstruction.getNumReplicas(block);
>   int curExpectedReplicas = getExpectedRedundancyNum(block);
>   if (!hasEnoughEffectiveReplicas(block, repl, pendingNum)) {
> neededReconstruction.update(block, repl.liveReplicas() + 
> pendingNum,{code}
> But if two replicas were in pending reconstruction (due to corruption) , and 
> if the third replica is corrupted the block should be in 
> QUEUE_WITH_CORRUPT_BLOCKS but because of above logic it was getting added in 
> QUEUE_LOW_REDUNDANCY , this makes the RedudancyMonitor to reconstruct a 
> corrupted block , which is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15375) Reconstruction Work should not happen for Corrupt Block

2020-05-29 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119717#comment-17119717
 ] 

Surendra Singh Lilhore commented on HDFS-15375:
---

[~hemanthboyina], thanks for patch.

one doubt, without this fix how much time it will take to come out from 
QUEUE_LOW_REDUNDANCY if third replica also corrupted. 

> Reconstruction Work should not happen for Corrupt Block
> ---
>
> Key: HDFS-15375
> URL: https://issues.apache.org/jira/browse/HDFS-15375
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15375-testrepro.patch, HDFS-15375.001.patch
>
>
> In BlockManager#updateNeededReconstructions , while updating the 
> NeededReconstruction we are adding Pendingreconstruction blocks to live 
> replicas
> {code:java}
>  int pendingNum = pendingReconstruction.getNumReplicas(block);
>   int curExpectedReplicas = getExpectedRedundancyNum(block);
>   if (!hasEnoughEffectiveReplicas(block, repl, pendingNum)) {
> neededReconstruction.update(block, repl.liveReplicas() + 
> pendingNum,{code}
> But if two replicas were in pending reconstruction (due to corruption) , and 
> if the third replica is corrupted the block should be in 
> QUEUE_WITH_CORRUPT_BLOCKS but because of above logic it was getting added in 
> QUEUE_LOW_REDUNDANCY , this makes the RedudancyMonitor to reconstruct a 
> corrupted block , which is wrong



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14762) "Path(Path/String parent, String child)" will fail when "child" contains ":"

2020-05-18 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110069#comment-17110069
 ] 

Surendra Singh Lilhore commented on HDFS-14762:
---

Hi [~ayushtkn],
{quote}File Name used IPv6? What is the relation of name & IPv6?
{quote}
We are trying  HDFS with IPv6. Datanode create block pool directory and the 
name of block pool contain IP of namenode. If the NN started with Ipv6 then 
this name contain ":" and same problem occur. 

> "Path(Path/String parent, String child)" will fail when "child" contains ":"
> 
>
> Key: HDFS-14762
> URL: https://issues.apache.org/jira/browse/HDFS-14762
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shixiong Zhu
>Priority: Major
> Attachments: HDFS-14762.001.patch, HDFS-14762.002.patch, 
> HDFS-14762.003.patch, HDFS-14762.004.patch
>
>
> When the "child" parameter contains ":", "Path(Path/String parent, String 
> child)" will throw the following exception:
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: ...
> {code}
> Not sure if this is a legit bug. But the following places will hit this error 
> when seeing a Path with a file name containing ":":
> https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java#L101
> https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Globber.java#L270



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-14762) "Path(Path/String parent, String child)" will fail when "child" contains ":"

2020-05-18 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore reopened HDFS-14762:
---

We should handle this scenario. It is valid scenario. 

We faced same problem when in some file name used IPv6.

> "Path(Path/String parent, String child)" will fail when "child" contains ":"
> 
>
> Key: HDFS-14762
> URL: https://issues.apache.org/jira/browse/HDFS-14762
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shixiong Zhu
>Priority: Major
> Attachments: HDFS-14762.001.patch, HDFS-14762.002.patch, 
> HDFS-14762.003.patch, HDFS-14762.004.patch
>
>
> When the "child" parameter contains ":", "Path(Path/String parent, String 
> child)" will throw the following exception:
> {code}
> java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: ...
> {code}
> Not sure if this is a legit bug. But the following places will hit this error 
> when seeing a Path with a file name containing ":":
> https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java#L101
> https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Globber.java#L270



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14452) Make Op#valueOf() Public

2020-05-15 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17107990#comment-17107990
 ] 

Surendra Singh Lilhore commented on HDFS-14452:
---

+1 LGTM

> Make Op#valueOf() Public
> 
>
> Key: HDFS-14452
> URL: https://issues.apache.org/jira/browse/HDFS-14452
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ipc
>Affects Versions: 3.2.0
>Reporter: David Mollitor
>Assignee: hemanthboyina
>Priority: Minor
>  Labels: noob
> Attachments: HDFS-14452.patch
>
>
> Change signature of {{private static Op valueOf(byte code)}} to be public.  
> Right now, the only easy way to look up in Op is to pass in a {{DataInput}} 
> object, which is not all that flexible and efficient for other custom 
> implementations that want to store the Op code a different way.
> https://github.com/apache/hadoop/blob/8c95cb9d6bef369fef6a8364f0c0764eba90e44a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/Op.java#L53



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15316) Deletion failure should not remove directory from snapshottables

2020-05-13 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15316:
--
Fix Version/s: 3.4.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Thanks [~hemanthboyina] for contribution.

Committed to trunk.

Thanks [~ayushtkn] for review.

> Deletion failure should not remove directory from snapshottables
> 
>
> Key: HDFS-15316
> URL: https://issues.apache.org/jira/browse/HDFS-15316
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15316.001.patch, HDFS-15316.002.patch
>
>
> If deleting a directory doesn't succeeds , still we are removing directory 
> from snapshottables  
> this makes the system inconsistent , we will be able to create snapshots but 
> snapshot diff throws Directory is not snaphottable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15316) Deletion failure should not remove directory from snapshottables

2020-05-05 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099570#comment-17099570
 ] 

Surendra Singh Lilhore commented on HDFS-15316:
---

Thanks [~hemanthboyina] for patch.

it is very rare scenario but good to handle.

+1

Will commit tomorrow if no more comment.

> Deletion failure should not remove directory from snapshottables
> 
>
> Key: HDFS-15316
> URL: https://issues.apache.org/jira/browse/HDFS-15316
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15316.001.patch, HDFS-15316.002.patch
>
>
> If deleting a directory doesn't succeeds , still we are removing directory 
> from snapshottables  
> this makes the system inconsistent , we will be able to create snapshots but 
> snapshot diff throws Directory is not snaphottable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.

2020-04-29 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15210:
--
Fix Version/s: 3.4.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Thanks [~ayushtkn] for review.

Committed to trunk.

> EC : File write hanged when DN is shutdown by admin command.
> 
>
> Key: HDFS-15210
> URL: https://issues.apache.org/jira/browse/HDFS-15210
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: HDFS-15210.001.patch, HDFS-15210.002.patch, 
> HDFS-15210.003.patch, dump.txt
>
>
> EC Blocks : blk_-9223372036854291632_10668910, 
> blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, 
> blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910
>  
> Two block DN restarted : blk_-9223372036854291630_10668910 & 
> blk_-9223372036854291632_10668910
> {code:java}
> 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8
> 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code}
>  
> Restarted streams are stuck in below stacktrace :
> {code}
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110)
>  at 
> org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at 
> org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.

2020-04-23 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091207#comment-17091207
 ] 

Surendra Singh Lilhore commented on HDFS-15210:
---

Attached v3 patch.

> EC : File write hanged when DN is shutdown by admin command.
> 
>
> Key: HDFS-15210
> URL: https://issues.apache.org/jira/browse/HDFS-15210
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15210.001.patch, HDFS-15210.002.patch, 
> HDFS-15210.003.patch, dump.txt
>
>
> EC Blocks : blk_-9223372036854291632_10668910, 
> blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, 
> blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910
>  
> Two block DN restarted : blk_-9223372036854291630_10668910 & 
> blk_-9223372036854291632_10668910
> {code:java}
> 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8
> 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code}
>  
> Restarted streams are stuck in below stacktrace :
> {code}
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110)
>  at 
> org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at 
> org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.

2020-04-23 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15210:
--
Attachment: HDFS-15210.003.patch

> EC : File write hanged when DN is shutdown by admin command.
> 
>
> Key: HDFS-15210
> URL: https://issues.apache.org/jira/browse/HDFS-15210
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15210.001.patch, HDFS-15210.002.patch, 
> HDFS-15210.003.patch, dump.txt
>
>
> EC Blocks : blk_-9223372036854291632_10668910, 
> blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, 
> blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910
>  
> Two block DN restarted : blk_-9223372036854291630_10668910 & 
> blk_-9223372036854291632_10668910
> {code:java}
> 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8
> 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code}
>  
> Restarted streams are stuck in below stacktrace :
> {code}
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110)
>  at 
> org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at 
> org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15218) RBF: MountTableRefresherService failed to refresh other router MountTableEntries in secure mode.

2020-04-18 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15218:
--
Fix Version/s: (was: 3.40)
   3.4.0

> RBF: MountTableRefresherService failed to refresh other router 
> MountTableEntries in secure mode.
> 
>
> Key: HDFS-15218
> URL: https://issues.apache.org/jira/browse/HDFS-15218
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
> Attachments: HDFS-15218.001.patch
>
>
> {code:java}
> 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed 
> to refresh mount table entries cache at router X:25020 | 
> MountTableRefresherThread.java:69
> java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort 
> XXX/XXX:0. Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284)
> at 
> org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15218) RBF: MountTableRefresherService failed to refresh other router MountTableEntries in secure mode.

2020-04-18 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15218:
--
Fix Version/s: 3.3.0
   3.40
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Thanks [~brahmareddy] & [~elgoiri] for review.

Committed to trunk, branch-3.3.

> RBF: MountTableRefresherService failed to refresh other router 
> MountTableEntries in secure mode.
> 
>
> Key: HDFS-15218
> URL: https://issues.apache.org/jira/browse/HDFS-15218
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Fix For: 3.40, 3.3.0
>
> Attachments: HDFS-15218.001.patch
>
>
> {code:java}
> 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed 
> to refresh mount table entries cache at router X:25020 | 
> MountTableRefresherThread.java:69
> java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort 
> XXX/XXX:0. Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284)
> at 
> org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15218) RBF: MountTableRefresherService failed to refresh other router MountTableEntries in secure mode.

2020-04-18 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15218:
--
Summary: RBF: MountTableRefresherService failed to refresh other router 
MountTableEntries in secure mode.  (was: RBF: MountTableRefresherService failed 
to refresh other router mount table in secure mode.)

> RBF: MountTableRefresherService failed to refresh other router 
> MountTableEntries in secure mode.
> 
>
> Key: HDFS-15218
> URL: https://issues.apache.org/jira/browse/HDFS-15218
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15218.001.patch
>
>
> {code:java}
> 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed 
> to refresh mount table entries cache at router X:25020 | 
> MountTableRefresherThread.java:69
> java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort 
> XXX/XXX:0. Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284)
> at 
> org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15218) RBF: MountTableRefresherService failed to refresh other router mount table in secure mode.

2020-04-18 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15218:
--
Summary: RBF: MountTableRefresherService failed to refresh other router 
mount table in secure mode.  (was: RBF: MountTableRefresherService fail in 
secure cluster.)

> RBF: MountTableRefresherService failed to refresh other router mount table in 
> secure mode.
> --
>
> Key: HDFS-15218
> URL: https://issues.apache.org/jira/browse/HDFS-15218
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15218.001.patch
>
>
> {code:java}
> 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed 
> to refresh mount table entries cache at router X:25020 | 
> MountTableRefresherThread.java:69
> java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort 
> XXX/XXX:0. Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284)
> at 
> org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15218) RBF: MountTableRefresherService fail in secure cluster.

2020-04-16 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085080#comment-17085080
 ] 

Surendra Singh Lilhore edited comment on HDFS-15218 at 4/16/20, 4:46 PM:
-

[~brahmareddy] shall we go ahead with commit? It is important for 3.3.0.


was (Author: surendrasingh):
[~brahmareddy] shall we go ahead with commit? It is important for 3.3.0.

 

 

> RBF: MountTableRefresherService fail in secure cluster.
> ---
>
> Key: HDFS-15218
> URL: https://issues.apache.org/jira/browse/HDFS-15218
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15218.001.patch
>
>
> {code:java}
> 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed 
> to refresh mount table entries cache at router X:25020 | 
> MountTableRefresherThread.java:69
> java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort 
> XXX/XXX:0. Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284)
> at 
> org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15218) RBF: MountTableRefresherService fail in secure cluster.

2020-04-16 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085080#comment-17085080
 ] 

Surendra Singh Lilhore commented on HDFS-15218:
---

[~brahmareddy] shall we go ahead with commit? It is important for 3.3.0.

 

 

> RBF: MountTableRefresherService fail in secure cluster.
> ---
>
> Key: HDFS-15218
> URL: https://issues.apache.org/jira/browse/HDFS-15218
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15218.001.patch
>
>
> {code:java}
> 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed 
> to refresh mount table entries cache at router X:25020 | 
> MountTableRefresherThread.java:69
> java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort 
> XXX/XXX:0. Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284)
> at 
> org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.

2020-04-10 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080583#comment-17080583
 ] 

Surendra Singh Lilhore commented on HDFS-15210:
---

Attached v2 patch

> EC : File write hanged when DN is shutdown by admin command.
> 
>
> Key: HDFS-15210
> URL: https://issues.apache.org/jira/browse/HDFS-15210
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15210.001.patch, HDFS-15210.002.patch, dump.txt
>
>
> EC Blocks : blk_-9223372036854291632_10668910, 
> blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, 
> blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910
>  
> Two block DN restarted : blk_-9223372036854291630_10668910 & 
> blk_-9223372036854291632_10668910
> {code:java}
> 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8
> 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code}
>  
> Restarted streams are stuck in below stacktrace :
> {code}
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110)
>  at 
> org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at 
> org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.

2020-04-10 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15210:
--
Attachment: HDFS-15210.002.patch

> EC : File write hanged when DN is shutdown by admin command.
> 
>
> Key: HDFS-15210
> URL: https://issues.apache.org/jira/browse/HDFS-15210
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15210.001.patch, HDFS-15210.002.patch, dump.txt
>
>
> EC Blocks : blk_-9223372036854291632_10668910, 
> blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, 
> blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910
>  
> Two block DN restarted : blk_-9223372036854291630_10668910 & 
> blk_-9223372036854291632_10668910
> {code:java}
> 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8
> 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code}
>  
> Restarted streams are stuck in below stacktrace :
> {code}
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110)
>  at 
> org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at 
> org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15198) RBF: In Secure Mode, Router can't refresh other router's mountTableEntries

2020-04-09 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079557#comment-17079557
 ] 

Surendra Singh Lilhore commented on HDFS-15198:
---

{quote}Should we merge the code change in HDFS-15218 and the unit test here?
{quote}
I am ok with this.

> RBF: In Secure Mode, Router can't refresh other router's mountTableEntries
> --
>
> Key: HDFS-15198
> URL: https://issues.apache.org/jira/browse/HDFS-15198
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Major
> Attachments: HDFS-15198.001.patch, HDFS-15198.002.patch, 
> HDFS-15198.003.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In issue HDFS-13443, update mount table cache imediately. The specified 
> router update their own mount table cache imediately, then update other's by 
> rpc protocol refreshMountTableEntries. But in secure mode, can't refresh 
> other's router's. In specified router's log, error like this
> {code}
> 2020-02-27 22:59:07,212 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server : 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> 2020-02-27 22:59:07,213 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread: 
> Failed to refresh mount table entries cache at router $host:8111
> java.io.IOException: DestHost:destPort host:8111 , LocalHost:localPort 
> $host/$ip:0. Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:288)
> at 
> org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65)
> 2020-02-27 22:59:07,214 INFO 
> org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver: Added 
> new mount point /test_11 to resolver
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15198) RBF: In Secure Mode, Router can't refresh other router's mountTableEntries

2020-04-09 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079015#comment-17079015
 ] 

Surendra Singh Lilhore commented on HDFS-15198:
---

Please refer HDFS-15218

> RBF: In Secure Mode, Router can't refresh other router's mountTableEntries
> --
>
> Key: HDFS-15198
> URL: https://issues.apache.org/jira/browse/HDFS-15198
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Major
> Attachments: HDFS-15198.001.patch, HDFS-15198.002.patch, 
> HDFS-15198.003.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In issue HDFS-13443, update mount table cache imediately. The specified 
> router update their own mount table cache imediately, then update other's by 
> rpc protocol refreshMountTableEntries. But in secure mode, can't refresh 
> other's router's. In specified router's log, error like this
> {code}
> 2020-02-27 22:59:07,212 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server : 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> 2020-02-27 22:59:07,213 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread: 
> Failed to refresh mount table entries cache at router $host:8111
> java.io.IOException: DestHost:destPort host:8111 , LocalHost:localPort 
> $host/$ip:0. Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:288)
> at 
> org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65)
> 2020-02-27 22:59:07,214 INFO 
> org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver: Added 
> new mount point /test_11 to resolver
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11298) Add storage policy info in FileStatus

2020-04-05 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-11298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-11298:
--
Resolution: Won't Fix
Status: Resolved  (was: Patch Available)

> Add storage policy info in FileStatus
> -
>
> Key: HDFS-11298
> URL: https://issues.apache.org/jira/browse/HDFS-11298
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.7.2
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-11298.001.patch
>
>
> Its good to add storagePolicy field in FileStatus. We no need to call 
> getStoragePolicy() API to get the policy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15200) Delete Corrupt Replica Immediately Irrespective of Replicas On Stale Storage

2020-03-17 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060804#comment-17060804
 ] 

Surendra Singh Lilhore commented on HDFS-15200:
---

{quote}The default true was suggested by Akira Ajisaka above
{quote}
I agree with this, it should be true by default. 

> Delete Corrupt Replica Immediately Irrespective of Replicas On Stale Storage 
> -
>
> Key: HDFS-15200
> URL: https://issues.apache.org/jira/browse/HDFS-15200
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
> Attachments: HDFS-15200-01.patch, HDFS-15200-02.patch, 
> HDFS-15200-03.patch, HDFS-15200-04.patch, HDFS-15200-05.patch
>
>
> Presently {{invalidateBlock(..)}} before adding a replica into invalidates, 
> checks whether any  block replica is on stale storage, if any replica is on 
> stale storage, it postpones deletion of the replica.
> Here :
> {code:java}
>// Check how many copies we have of the block
> if (nr.replicasOnStaleNodes() > 0) {
>   blockLog.debug("BLOCK* invalidateBlocks: postponing " +
>   "invalidation of {} on {} because {} replica(s) are located on " +
>   "nodes with potentially out-of-date block reports", b, dn,
>   nr.replicasOnStaleNodes());
>   postponeBlock(b.getCorrupted());
>   return false;
> {code}
>  
> In case of corrupt replica, we can skip this logic and delete the corrupt 
> replica immediately, as a corrupt replica can't get corrected.
> One outcome of this behavior presently is namenodes showing different block 
> states post failover, as:
> If a replica is marked corrupt, the Active NN, will mark it as corrupt, and 
> mark it for deletion and remove it from corruptReplica's and  
> excessRedundancyMap.
> If before the deletion of replica, Failover happens.
> The standby Namenode will mark all the storages as stale.
> Then will start processing IBR's, Now since the replica's would be on stale 
> storage, it will skip deletion, and removal from corruptReplica's
> Hence both the namenode will show different numbers and different corrupt 
> replicas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15227) FSCK -upgradedomains is failing for upgradedomains when more than 2 million blocks present in hdfs and write in progress of some blocks

2020-03-17 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060656#comment-17060656
 ] 

Surendra Singh Lilhore edited comment on HDFS-15227 at 3/17/20, 6:17 AM:
-

Thanks [~ayushtkn] for patch.

+1


was (Author: surendrasingh):
Thanks [~ayushtkn] for patch.

+1. Just add comment in patch for null check scenario.

> FSCK -upgradedomains is failing for upgradedomains when more than 2 million 
> blocks present in hdfs and write in progress of some blocks
> ---
>
> Key: HDFS-15227
> URL: https://issues.apache.org/jira/browse/HDFS-15227
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: krishna reddy
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-15227-01.patch, TestToRepro.patch
>
>
> FSCK -upgradedomains is failing for upgradedomains when more than 2 million 
> blocks present in hdfs and write in progress of some blocks
> "hdfs fsck / -files -blocks -upgradedomains"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15227) FSCK -upgradedomains is failing for upgradedomains when more than 2 million blocks present in hdfs and write in progress of some blocks

2020-03-17 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060656#comment-17060656
 ] 

Surendra Singh Lilhore commented on HDFS-15227:
---

Thanks [~ayushtkn] for patch.

+1. Just add comment in patch for null check scenario.

> FSCK -upgradedomains is failing for upgradedomains when more than 2 million 
> blocks present in hdfs and write in progress of some blocks
> ---
>
> Key: HDFS-15227
> URL: https://issues.apache.org/jira/browse/HDFS-15227
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: krishna reddy
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-15227-01.patch, TestToRepro.patch
>
>
> FSCK -upgradedomains is failing for upgradedomains when more than 2 million 
> blocks present in hdfs and write in progress of some blocks
> "hdfs fsck / -files -blocks -upgradedomains"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15211) EC: File write hangs during close in case of Exception during updatePipeline

2020-03-15 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15211:
--
Fix Version/s: 3.2.2
   3.1.4
   3.3.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Thanks [~ayushtkn] for contribution.
Committed to trunk, branch-3.2, branch-3.1.

> EC: File write hangs during close in case of Exception during updatePipeline
> 
>
> Key: HDFS-15211
> URL: https://issues.apache.org/jira/browse/HDFS-15211
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.1, 3.3.0, 3.2.1
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-15211-01.patch, HDFS-15211-02.patch, 
> HDFS-15211-03.patch, HDFS-15211-04.patch, HDFS-15211-05.patch, 
> TestToRepro-01.patch, Thread-Dump, Thread-Dump-02
>
>
> Ec file write hangs during file close, if there is a exception due to closure 
> of slow stream, and number of data streamers failed increase more than parity 
> block.
> Since in the close, the Stream will try to flush all the healthy streamers, 
> but the streamers won't be having any result due to exception. and the 
> streamers will stay stuck.
> Hence the close will also get stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15211) EC: File write hangs during close in case of Exception during updatePipeline

2020-03-15 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059707#comment-17059707
 ] 

Surendra Singh Lilhore commented on HDFS-15211:
---

+1

> EC: File write hangs during close in case of Exception during updatePipeline
> 
>
> Key: HDFS-15211
> URL: https://issues.apache.org/jira/browse/HDFS-15211
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.1, 3.3.0, 3.2.1
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
> Attachments: HDFS-15211-01.patch, HDFS-15211-02.patch, 
> HDFS-15211-03.patch, HDFS-15211-04.patch, HDFS-15211-05.patch, 
> TestToRepro-01.patch, Thread-Dump, Thread-Dump-02
>
>
> Ec file write hangs during file close, if there is a exception due to closure 
> of slow stream, and number of data streamers failed increase more than parity 
> block.
> Since in the close, the Stream will try to flush all the healthy streamers, 
> but the streamers won't be having any result due to exception. and the 
> streamers will stay stuck.
> Hence the close will also get stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15220) FSCK calls are redirecting to Active NN

2020-03-12 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058423#comment-17058423
 ] 

Surendra Singh Lilhore commented on HDFS-15220:
---

[~weichiu], how fsck will do the msync ? it is http call.

> FSCK calls are redirecting to Active NN
> ---
>
> Key: HDFS-15220
> URL: https://issues.apache.org/jira/browse/HDFS-15220
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: krishna reddy
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Run any fsck except (-delete & - move) should go to ONN as it is read 
> operation
> In below image spikes indicates when it ran fsck / -storagepolicies
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15220) FSCK calls are redirecting to Active NN

2020-03-12 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058089#comment-17058089
 ] 

Surendra Singh Lilhore commented on HDFS-15220:
---

Fsck call should not be sent to observer. Always it should be sent to active 
namenode. User use this command to check the current state of server. 

> FSCK calls are redirecting to Active NN
> ---
>
> Key: HDFS-15220
> URL: https://issues.apache.org/jira/browse/HDFS-15220
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: krishna reddy
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Run any fsck except -delete & - move should go to ONN as it is read operation
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15211) EC: File write hangs during close in case of Exception during updatePipeline

2020-03-12 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057941#comment-17057941
 ] 

Surendra Singh Lilhore commented on HDFS-15211:
---

Thanks [~ayushtkn].

Minor comment. Please remove unrelated changes.
{code:java}
   // failures when sending the last packet. We actually do not 
need to
-  // bump GS for this kind of failure. Thus counting the total 
number
-  // of failures may be good enough.
+  // bump GS for this kind of failure. Thus counting the total
+  //  number of failures may be good enough.{code}

Other changes are good.

> EC: File write hangs during close in case of Exception during updatePipeline
> 
>
> Key: HDFS-15211
> URL: https://issues.apache.org/jira/browse/HDFS-15211
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.1, 3.3.0, 3.2.1
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
> Attachments: HDFS-15211-01.patch, HDFS-15211-02.patch, 
> HDFS-15211-03.patch, TestToRepro-01.patch, Thread-Dump, Thread-Dump-02
>
>
> Ec file write hangs during file close, if there is a exception due to closure 
> of slow stream, and number of data streamers failed increase more than parity 
> block.
> Since in the close, the Stream will try to flush all the healthy streamers, 
> but the streamers won't be having any result due to exception. and the 
> streamers will stay stuck.
> Hence the close will also get stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId

2020-03-12 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057935#comment-17057935
 ] 

Surendra Singh Lilhore edited comment on HDFS-14442 at 3/12/20, 1:48 PM:
-

Re-based test code before committing.

Thanks [~Sushma_28] for contribution.
Thanks [~xkrogen] & [~ayushtkn] for review.


was (Author: surendrasingh):
Thanks [~Sushma_28] for contribution.
Thanks [~xkrogen] & [~ayushtkn] for review.

> Disagreement between HAUtil.getAddressOfActive and 
> RpcInvocationHandler.getConnectionId
> ---
>
> Key: HDFS-14442
> URL: https://issues.apache.org/jira/browse/HDFS-14442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Erik Krogen
>Assignee: Ravuri Sushma sree
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
> Attachments: HDFS-14442.001.patch, HDFS-14442.002.patch, 
> HDFS-14442.003.patch, HDFS-14442.004.patch
>
>
> While working on HDFS-14245, we noticed a discrepancy in some proxy-handling 
> code.
> The description of {{RpcInvocationHandler.getConnectionId()}} states:
> {code}
>   /**
>* Returns the connection id associated with the InvocationHandler instance.
>* @return ConnectionId
>*/
>   ConnectionId getConnectionId();
> {code}
> It does not make any claims about whether this connection ID will be an 
> active proxy or not. Yet in {{HAUtil}} we have:
> {code}
>   /**
>* Get the internet address of the currently-active NN. This should rarely 
> be
>* used, since callers of this method who connect directly to the NN using 
> the
>* resulting InetSocketAddress will not be able to connect to the active NN 
> if
>* a failover were to occur after this method has been called.
>* 
>* @param fs the file system to get the active address of.
>* @return the internet address of the currently-active NN.
>* @throws IOException if an error occurs while resolving the active NN.
>*/
>   public static InetSocketAddress getAddressOfActive(FileSystem fs)
>   throws IOException {
> if (!(fs instanceof DistributedFileSystem)) {
>   throw new IllegalArgumentException("FileSystem " + fs + " is not a 
> DFS.");
> }
> // force client address resolution.
> fs.exists(new Path("/"));
> DistributedFileSystem dfs = (DistributedFileSystem) fs;
> DFSClient dfsClient = dfs.getClient();
> return RPC.getServerAddress(dfsClient.getNamenode());
>   }
> {code}
> Where the call {{RPC.getServerAddress()}} eventually terminates into 
> {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> 
> {{RPC.getConnectionIdForProxy()}} -> 
> {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making 
> an incorrect assumption that {{RpcInvocationHandler}} will necessarily return 
> an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a 
> counter-example to this, since the current connection ID may be pointing at, 
> for example, an Observer NameNode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId

2020-03-12 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-14442:
--
Fix Version/s: 3.2.2
   3.3.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Thanks [~Sushma_28] for contribution.
Thanks [~xkrogen] & [~ayushtkn] for review.

> Disagreement between HAUtil.getAddressOfActive and 
> RpcInvocationHandler.getConnectionId
> ---
>
> Key: HDFS-14442
> URL: https://issues.apache.org/jira/browse/HDFS-14442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Erik Krogen
>Assignee: Ravuri Sushma sree
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
> Attachments: HDFS-14442.001.patch, HDFS-14442.002.patch, 
> HDFS-14442.003.patch, HDFS-14442.004.patch
>
>
> While working on HDFS-14245, we noticed a discrepancy in some proxy-handling 
> code.
> The description of {{RpcInvocationHandler.getConnectionId()}} states:
> {code}
>   /**
>* Returns the connection id associated with the InvocationHandler instance.
>* @return ConnectionId
>*/
>   ConnectionId getConnectionId();
> {code}
> It does not make any claims about whether this connection ID will be an 
> active proxy or not. Yet in {{HAUtil}} we have:
> {code}
>   /**
>* Get the internet address of the currently-active NN. This should rarely 
> be
>* used, since callers of this method who connect directly to the NN using 
> the
>* resulting InetSocketAddress will not be able to connect to the active NN 
> if
>* a failover were to occur after this method has been called.
>* 
>* @param fs the file system to get the active address of.
>* @return the internet address of the currently-active NN.
>* @throws IOException if an error occurs while resolving the active NN.
>*/
>   public static InetSocketAddress getAddressOfActive(FileSystem fs)
>   throws IOException {
> if (!(fs instanceof DistributedFileSystem)) {
>   throw new IllegalArgumentException("FileSystem " + fs + " is not a 
> DFS.");
> }
> // force client address resolution.
> fs.exists(new Path("/"));
> DistributedFileSystem dfs = (DistributedFileSystem) fs;
> DFSClient dfsClient = dfs.getClient();
> return RPC.getServerAddress(dfsClient.getNamenode());
>   }
> {code}
> Where the call {{RPC.getServerAddress()}} eventually terminates into 
> {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> 
> {{RPC.getConnectionIdForProxy()}} -> 
> {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making 
> an incorrect assumption that {{RpcInvocationHandler}} will necessarily return 
> an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a 
> counter-example to this, since the current connection ID may be pointing at, 
> for example, an Observer NameNode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15198) RBF: In Secure Mode, Router can't refresh other router's mountTableEntries

2020-03-10 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056291#comment-17056291
 ] 

Surendra Singh Lilhore commented on HDFS-15198:
---

Thanks [~zhengchenyu]  for patch.

We can't change the RouterClient.
{code:java}
-this.ugi = UserGroupInformation.getCurrentUser();
+if (UserGroupInformation.isSecurityEnabled()) {
+  this.ugi = UserGroupInformation.getLoginUser();
+} else {
+  this.ugi = UserGroupInformation.getCurrentUser();
+} {code}

It is used in RouterAdmin also and there it should be currentUser() only.  
Please refer DFSAdmin.java

> RBF: In Secure Mode, Router can't refresh other router's mountTableEntries
> --
>
> Key: HDFS-15198
> URL: https://issues.apache.org/jira/browse/HDFS-15198
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Reporter: zhengchenyu
>Assignee: zhengchenyu
>Priority: Major
> Attachments: HDFS-15198.001.patch, HDFS-15198.002.patch, 
> HDFS-15198.003.patch
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In issue HDFS-13443, update mount table cache imediately. The specified 
> router update their own mount table cache imediately, then update other's by 
> rpc protocol refreshMountTableEntries. But in secure mode, can't refresh 
> other's router's. In specified router's log, error like this
> {code}
> 2020-02-27 22:59:07,212 WARN org.apache.hadoop.ipc.Client: Exception 
> encountered while connecting to the server : 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> 2020-02-27 22:59:07,213 ERROR 
> org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread: 
> Failed to refresh mount table entries cache at router $host:8111
> java.io.IOException: DestHost:destPort host:8111 , LocalHost:localPort 
> $host/$ip:0. Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:288)
> at 
> org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65)
> 2020-02-27 22:59:07,214 INFO 
> org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver: Added 
> new mount point /test_11 to resolver
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15218) RBF: MountTableRefresherService fail in secure cluster.

2020-03-10 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056285#comment-17056285
 ] 

Surendra Singh Lilhore commented on HDFS-15218:
---

[~elgoiri], yes it is same.

> RBF: MountTableRefresherService fail in secure cluster.
> ---
>
> Key: HDFS-15218
> URL: https://issues.apache.org/jira/browse/HDFS-15218
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15218.001.patch
>
>
> {code:java}
> 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed 
> to refresh mount table entries cache at router X:25020 | 
> MountTableRefresherThread.java:69
> java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort 
> XXX/XXX:0. Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284)
> at 
> org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15218) RBF: MountTableRefresherService fail in secure cluster.

2020-03-10 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15218:
--
Status: Patch Available  (was: Open)

> RBF: MountTableRefresherService fail in secure cluster.
> ---
>
> Key: HDFS-15218
> URL: https://issues.apache.org/jira/browse/HDFS-15218
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15218.001.patch
>
>
> {code:java}
> 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed 
> to refresh mount table entries cache at router X:25020 | 
> MountTableRefresherThread.java:69
> java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort 
> XXX/XXX:0. Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284)
> at 
> org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15218) RBF: MountTableRefresherService fail in secure cluster.

2020-03-10 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15218:
--
Attachment: HDFS-15218.001.patch

> RBF: MountTableRefresherService fail in secure cluster.
> ---
>
> Key: HDFS-15218
> URL: https://issues.apache.org/jira/browse/HDFS-15218
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15218.001.patch
>
>
> {code:java}
> 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed 
> to refresh mount table entries cache at router X:25020 | 
> MountTableRefresherThread.java:69
> java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort 
> XXX/XXX:0. Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
> at 
> org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284)
> at 
> org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65)
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.

2020-03-10 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15210:
--
Status: Patch Available  (was: Open)

> EC : File write hanged when DN is shutdown by admin command.
> 
>
> Key: HDFS-15210
> URL: https://issues.apache.org/jira/browse/HDFS-15210
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15210.001.patch, dump.txt
>
>
> EC Blocks : blk_-9223372036854291632_10668910, 
> blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, 
> blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910
>  
> Two block DN restarted : blk_-9223372036854291630_10668910 & 
> blk_-9223372036854291632_10668910
> {code:java}
> 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8
> 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code}
>  
> Restarted streams are stuck in below stacktrace :
> {code}
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110)
>  at 
> org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at 
> org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.

2020-03-10 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15210:
--
Attachment: HDFS-15210.001.patch

> EC : File write hanged when DN is shutdown by admin command.
> 
>
> Key: HDFS-15210
> URL: https://issues.apache.org/jira/browse/HDFS-15210
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15210.001.patch, dump.txt
>
>
> EC Blocks : blk_-9223372036854291632_10668910, 
> blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, 
> blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910
>  
> Two block DN restarted : blk_-9223372036854291630_10668910 & 
> blk_-9223372036854291632_10668910
> {code:java}
> 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8
> 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code}
>  
> Restarted streams are stuck in below stacktrace :
> {code}
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110)
>  at 
> org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at 
> org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.

2020-03-10 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15135:
--
Fix Version/s: 3.2.2
   3.3.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Thanks [~Sushma_28]  for contribution.

> EC : ArrayIndexOutOfBoundsException in 
> BlockRecoveryWorker#RecoveryTaskStriped.
> ---
>
> Key: HDFS-15135
> URL: https://issues.apache.org/jira/browse/HDFS-15135
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Reporter: Surendra Singh Lilhore
>Assignee: Ravuri Sushma sree
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
> Attachments: HDFS-15135-branch-3.2.001.patch, 
> HDFS-15135-branch-3.2.002.patch, HDFS-15135.001.patch, HDFS-15135.002.patch, 
> HDFS-15135.003.patch, HDFS-15135.004.patch, HDFS-15135.005.patch
>
>
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 8
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
>at java.lang.Thread.run(Thread.java:745) {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15218) RBF : MountTableRefresherService fail in secure cluster.

2020-03-10 Thread Surendra Singh Lilhore (Jira)
Surendra Singh Lilhore created HDFS-15218:
-

 Summary: RBF : MountTableRefresherService fail in secure cluster.
 Key: HDFS-15218
 URL: https://issues.apache.org/jira/browse/HDFS-15218
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: rbf
Affects Versions: 3.1.1
Reporter: Surendra Singh Lilhore
Assignee: Surendra Singh Lilhore


{code:java}
2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed to 
refresh mount table entries cache at router X:25020 | 
MountTableRefresherThread.java:69
java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort 
XXX/XXX:0. Failed on local exception: java.io.IOException: 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]
at 
org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284)
at 
org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65)
 {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.

2020-03-10 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055688#comment-17055688
 ] 

Surendra Singh Lilhore commented on HDFS-15135:
---

+1, will merge today.

> EC : ArrayIndexOutOfBoundsException in 
> BlockRecoveryWorker#RecoveryTaskStriped.
> ---
>
> Key: HDFS-15135
> URL: https://issues.apache.org/jira/browse/HDFS-15135
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Reporter: Surendra Singh Lilhore
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: HDFS-15135-branch-3.2.001.patch, 
> HDFS-15135-branch-3.2.002.patch, HDFS-15135.001.patch, HDFS-15135.002.patch, 
> HDFS-15135.003.patch, HDFS-15135.004.patch, HDFS-15135.005.patch
>
>
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 8
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
>at java.lang.Thread.run(Thread.java:745) {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.

2020-03-08 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054328#comment-17054328
 ] 

Surendra Singh Lilhore commented on HDFS-15135:
---

Please fix the check-style issues.

> EC : ArrayIndexOutOfBoundsException in 
> BlockRecoveryWorker#RecoveryTaskStriped.
> ---
>
> Key: HDFS-15135
> URL: https://issues.apache.org/jira/browse/HDFS-15135
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Reporter: Surendra Singh Lilhore
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: HDFS-15135-branch-3.2.001.patch, HDFS-15135.001.patch, 
> HDFS-15135.002.patch, HDFS-15135.003.patch, HDFS-15135.004.patch, 
> HDFS-15135.005.patch
>
>
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 8
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
>at java.lang.Thread.run(Thread.java:745) {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.

2020-03-06 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15210:
--
Attachment: dump.txt

> EC : File write hanged when DN is shutdown by admin command.
> 
>
> Key: HDFS-15210
> URL: https://issues.apache.org/jira/browse/HDFS-15210
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: dump.txt
>
>
> EC Blocks : blk_-9223372036854291632_10668910, 
> blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, 
> blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910
>  
> Two block DN restarted : blk_-9223372036854291630_10668910 & 
> blk_-9223372036854291632_10668910
> {code:java}
> 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8
> 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code}
>  
> Restarted streams are stuck in below stacktrace :
> {code}
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110)
>  at 
> org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at 
> org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.

2020-03-06 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15210:
--
Description: 
EC Blocks : blk_-9223372036854291632_10668910, 
blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, 
blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910

 

Two block DN restarted : blk_-9223372036854291630_10668910 & 
blk_-9223372036854291632_10668910
{code:java}
2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
OOB_RESTART downstreamAckTimeNanos: 0 flag: 8
2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code}
 

Restarted streams are stuck in below stacktrace :
{code}
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at 
org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110)
 at 
org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140)
 at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540)
 at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276)
 at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at 
org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46)
{code}

  was:
EC Blocks : blk_-9223372036854291632_10668910, 
blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, 
blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910

 

Two block DN restarted : blk_-9223372036854291630_10668910 & 
blk_-9223372036854291632_10668910
{code:java}
2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
OOB_RESTART downstreamAckTimeNanos: 0 flag: 8
2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code}
 

Restarted streams are stuck in below stacktrace :
{noformat}
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at 
org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110)
 at 
org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140)
 at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540)
 at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276)
 at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at 
org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46){noformat}


> EC : File write hanged when DN is shutdown by admin command.
> 
>
> Key: HDFS-15210
> URL: https://issues.apache.org/jira/browse/HDFS-15210
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
>
> EC Blocks : blk_-9223372036854291632_10668910, 
> blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, 
> blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910
>  
> Two block DN restarted : blk_-9223372036854291630_10668910 & 
> blk_-9223372036854291632_10668910
> {code:java}
> 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8
> 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code}
>  
> Restarted streams are stuck in below stacktrace :
> {code}
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110)
>  at 
> org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at 
> org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15210) EC : File write hand when DN is shutdown by admin command.

2020-03-06 Thread Surendra Singh Lilhore (Jira)
Surendra Singh Lilhore created HDFS-15210:
-

 Summary: EC : File write hand when DN is shutdown by admin command.
 Key: HDFS-15210
 URL: https://issues.apache.org/jira/browse/HDFS-15210
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ec
Affects Versions: 3.1.1
Reporter: Surendra Singh Lilhore
Assignee: Surendra Singh Lilhore


EC Blocks : blk_-9223372036854291632_10668910, 
blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, 
blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910

 

Two block DN restarted : blk_-9223372036854291630_10668910 & 
blk_-9223372036854291632_10668910
{code:java}
2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
OOB_RESTART downstreamAckTimeNanos: 0 flag: 8
2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code}
 

Restarted streams are stuck in below stacktrace :
{noformat}
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at 
org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110)
 at 
org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140)
 at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540)
 at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276)
 at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at 
org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46){noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.

2020-03-06 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15210:
--
Summary: EC : File write hanged when DN is shutdown by admin command.  
(was: EC : File write hand when DN is shutdown by admin command.)

> EC : File write hanged when DN is shutdown by admin command.
> 
>
> Key: HDFS-15210
> URL: https://issues.apache.org/jira/browse/HDFS-15210
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
>
> EC Blocks : blk_-9223372036854291632_10668910, 
> blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, 
> blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910
>  
> Two block DN restarted : blk_-9223372036854291630_10668910 & 
> blk_-9223372036854291632_10668910
> {code:java}
> 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8
> 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: 
> OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code}
>  
> Restarted streams are stuck in below stacktrace :
> {noformat}
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) 
> at 
> org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110)
>  at 
> org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540)
>  at 
> org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at 
> org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46){noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId

2020-03-04 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051819#comment-17051819
 ] 

Surendra Singh Lilhore commented on HDFS-14442:
---

+1
{quote}v003 patch LGTM, I will commit once I get a chance to verify the tests 
locally.
{quote}
[~xkrogen], any comment ?

> Disagreement between HAUtil.getAddressOfActive and 
> RpcInvocationHandler.getConnectionId
> ---
>
> Key: HDFS-14442
> URL: https://issues.apache.org/jira/browse/HDFS-14442
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.3.0
>Reporter: Erik Krogen
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: HDFS-14442.001.patch, HDFS-14442.002.patch, 
> HDFS-14442.003.patch, HDFS-14442.004.patch
>
>
> While working on HDFS-14245, we noticed a discrepancy in some proxy-handling 
> code.
> The description of {{RpcInvocationHandler.getConnectionId()}} states:
> {code}
>   /**
>* Returns the connection id associated with the InvocationHandler instance.
>* @return ConnectionId
>*/
>   ConnectionId getConnectionId();
> {code}
> It does not make any claims about whether this connection ID will be an 
> active proxy or not. Yet in {{HAUtil}} we have:
> {code}
>   /**
>* Get the internet address of the currently-active NN. This should rarely 
> be
>* used, since callers of this method who connect directly to the NN using 
> the
>* resulting InetSocketAddress will not be able to connect to the active NN 
> if
>* a failover were to occur after this method has been called.
>* 
>* @param fs the file system to get the active address of.
>* @return the internet address of the currently-active NN.
>* @throws IOException if an error occurs while resolving the active NN.
>*/
>   public static InetSocketAddress getAddressOfActive(FileSystem fs)
>   throws IOException {
> if (!(fs instanceof DistributedFileSystem)) {
>   throw new IllegalArgumentException("FileSystem " + fs + " is not a 
> DFS.");
> }
> // force client address resolution.
> fs.exists(new Path("/"));
> DistributedFileSystem dfs = (DistributedFileSystem) fs;
> DFSClient dfsClient = dfs.getClient();
> return RPC.getServerAddress(dfsClient.getNamenode());
>   }
> {code}
> Where the call {{RPC.getServerAddress()}} eventually terminates into 
> {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> 
> {{RPC.getConnectionIdForProxy()}} -> 
> {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making 
> an incorrect assumption that {{RpcInvocationHandler}} will necessarily return 
> an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a 
> counter-example to this, since the current connection ID may be pointing at, 
> for example, an Observer NameNode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14977) Quota Usage and Content summary are not same in Truncate with Snapshot

2020-03-03 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050947#comment-17050947
 ] 

Surendra Singh Lilhore commented on HDFS-14977:
---

+1

> Quota Usage and Content summary are not same in Truncate with Snapshot 
> ---
>
> Key: HDFS-14977
> URL: https://issues.apache.org/jira/browse/HDFS-14977
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-14977.001.patch, HDFS-14977.002.patch, 
> HDFS-14977.003.patch
>
>
> steps : hdfs dfs -mkdir /dir
>            hdfs dfs -put file /dir          (file size = 10bytes)
>            hdfs dfsadmin -allowSnapshot /dir
>            hdfs dfs -createSnapshot /dir s1 
> space consumed with Quotausage and Content Summary is 30bytes
>            hdfs dfs -truncate -w 5 /dir/file
> space consumed with Quotausage , Content Summary is 45 bytes
>            hdfs dfs -deleteSnapshot /dir s1
> space consumed with Quotausage is 45bytes and Content Summary is 15bytes 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15200) Delete Corrupt Replica Immediately Irrespective of Replicas On Stale Storage

2020-03-02 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049202#comment-17049202
 ] 

Surendra Singh Lilhore commented on HDFS-15200:
---

I feel we can delete corrupt replica because no chance of getting corrected it. 
As stale storage replica will be reported live in next BR, hopefully :).

[~arp], [~aajisaka], [~weichiu]  any thought on this ?

> Delete Corrupt Replica Immediately Irrespective of Replicas On Stale Storage 
> -
>
> Key: HDFS-15200
> URL: https://issues.apache.org/jira/browse/HDFS-15200
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
>
> Presently {{invalidateBlock(..)}} before adding a replica into invalidates, 
> checks whether any  block replica is on stale storage, if any replica is on 
> stale storage, it postpones deletion of the replica.
> Here :
> {code:java}
>// Check how many copies we have of the block
> if (nr.replicasOnStaleNodes() > 0) {
>   blockLog.debug("BLOCK* invalidateBlocks: postponing " +
>   "invalidation of {} on {} because {} replica(s) are located on " +
>   "nodes with potentially out-of-date block reports", b, dn,
>   nr.replicasOnStaleNodes());
>   postponeBlock(b.getCorrupted());
>   return false;
> {code}
>  
> In case of corrupt replica, we can skip this logic and delete the corrupt 
> replica immediately, as a corrupt replica can't get corrected.
> One outcome of this behavior presently is namenodes showing different block 
> states post failover, as:
> If a replica is marked corrupt, the Active NN, will mark it as corrupt, and 
> mark it for deletion and remove it from corruptReplica's and  
> excessRedundancyMap.
> If before the deletion of replica, Failover happens.
> The standby Namenode will mark all the storages as stale.
> Then will start processing IBR's, Now since the replica's would be on stale 
> storage, it will skip deletion, and removal from corruptReplica's
> Hence both the namenode will show different numbers and different corrupt 
> replicas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15159) Prevent adding same DN multiple times in PendingReconstructionBlocks

2020-03-01 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048640#comment-17048640
 ] 

Surendra Singh Lilhore commented on HDFS-15159:
---

[~hemanthboyina], Thanks for patch.

Better add test here. You can mock DN commands and asset the scheduled replica 
targets.

> Prevent adding same DN multiple times in PendingReconstructionBlocks
> 
>
> Key: HDFS-15159
> URL: https://issues.apache.org/jira/browse/HDFS-15159
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15159.001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14977) Quota Usage and Content summary are not same in Truncate with Snapshot

2020-03-01 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048637#comment-17048637
 ] 

Surendra Singh Lilhore edited comment on HDFS-14977 at 3/1/20 5:23 PM:
---

Thanks [~hemanthboyina]  for patch.

Changes looks good. Some comments for test code.

Simplify the variables like below. Remove string variables.
{code:java}
Path root = new Path("/");
Path dirPath = new Path(root,"dir");
assertTrue(fs.mkdirs(dirPath));;
Path filePath = new Path(dirPath, "file");
 {code}
[~elgoiri], Can we remove {{csSpaceConsumed, qoSpaceConsumed}} variables and 
add function call in assert like below ?
{code:java}
assertEquals(fs.getContentSummary(root).getSpaceConsumed(), 
fs.getQuotaUsage(root).getSpaceConsumed());{code}


was (Author: surendrasingh):
Thanks [~hemanthboyina]  for patch.

Changes looks good. Some comments for test code.

Simplify the variables like below. Remove string variables.
{code:java}
Path root = new Path("/");
Path dirPath = new Path(root,"dir");
assertTrue(fs.mkdirs(dirPath));;
Path filePath = new Path(dirPath, "file");
 {code}
[~elgoiri], Can we remove \{{ csSpaceConsumed, qoSpaceConsumed}} variables and 
add function call in assert like below ?
{code:java}
assertEquals(fs.getContentSummary(root).getSpaceConsumed(), 
fs.getQuotaUsage(root).getSpaceConsumed());{code}

> Quota Usage and Content summary are not same in Truncate with Snapshot 
> ---
>
> Key: HDFS-14977
> URL: https://issues.apache.org/jira/browse/HDFS-14977
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-14977.001.patch, HDFS-14977.002.patch
>
>
> steps : hdfs dfs -mkdir /dir
>            hdfs dfs -put file /dir          (file size = 10bytes)
>            hdfs dfsadmin -allowSnapshot /dir
>            hdfs dfs -createSnapshot /dir s1 
> space consumed with Quotausage and Content Summary is 30bytes
>            hdfs dfs -truncate -w 5 /dir/file
> space consumed with Quotausage , Content Summary is 45 bytes
>            hdfs dfs -deleteSnapshot /dir s1
> space consumed with Quotausage is 45bytes and Content Summary is 15bytes 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14977) Quota Usage and Content summary are not same in Truncate with Snapshot

2020-03-01 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048637#comment-17048637
 ] 

Surendra Singh Lilhore commented on HDFS-14977:
---

Thanks [~hemanthboyina]  for patch.

Changes looks good. Some comments for test code.

Simplify the variables like below. Remove string variables.
{code:java}
Path root = new Path("/");
Path dirPath = new Path(root,"dir");
assertTrue(fs.mkdirs(dirPath));;
Path filePath = new Path(dirPath, "file");
 {code}
[~elgoiri], Can we remove \{{ csSpaceConsumed, qoSpaceConsumed}} variables and 
add function call in assert like below ?
{code:java}
assertEquals(fs.getContentSummary(root).getSpaceConsumed(), 
fs.getQuotaUsage(root).getSpaceConsumed());{code}

> Quota Usage and Content summary are not same in Truncate with Snapshot 
> ---
>
> Key: HDFS-14977
> URL: https://issues.apache.org/jira/browse/HDFS-14977
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-14977.001.patch, HDFS-14977.002.patch
>
>
> steps : hdfs dfs -mkdir /dir
>            hdfs dfs -put file /dir          (file size = 10bytes)
>            hdfs dfsadmin -allowSnapshot /dir
>            hdfs dfs -createSnapshot /dir s1 
> space consumed with Quotausage and Content Summary is 30bytes
>            hdfs dfs -truncate -w 5 /dir/file
> space consumed with Quotausage , Content Summary is 45 bytes
>            hdfs dfs -deleteSnapshot /dir s1
> space consumed with Quotausage is 45bytes and Content Summary is 15bytes 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15199) NPE in BlockSender

2020-02-28 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15199:
--
Fix Version/s: 3.2.2
   3.1.4
   3.3.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Committed to trunk, branch-3.2, branch-3.1.

> NPE in BlockSender
> --
>
> Key: HDFS-15199
> URL: https://issues.apache.org/jira/browse/HDFS-15199
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-15199-01.patch
>
>
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:662)
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:819)
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:766)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:607)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290)
>   at java.lang.Thread.run(Thread.java:748)
> 2020-02-28 11:49:13,357 [stripedRead-0] INFO  datanode.DataNode 
> (StripedBlockReader.java:call(182)) - Premature EOF reading from 
> org.apache.hadoop.net.SocketInputStream@8a99d11
> 2020-02-28 11:49:13,362 [ResponseProcessor for block 
> BP-1162371257-10.19.127.112-1582870703783:blk_-9223372036854775774_1004] WARN 
>  hdfs.DataStreamer (DataStreamer.java:run(1217)) - Exception for 
> BP-1162371257-10.19.127.112-1582870703783:blk
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15199) NPE in BlockSender

2020-02-28 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047607#comment-17047607
 ] 

Surendra Singh Lilhore commented on HDFS-15199:
---

Thanks [~ayushtkn] for contribution.

> NPE in BlockSender
> --
>
> Key: HDFS-15199
> URL: https://issues.apache.org/jira/browse/HDFS-15199
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-15199-01.patch
>
>
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:662)
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:819)
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:766)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:607)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290)
>   at java.lang.Thread.run(Thread.java:748)
> 2020-02-28 11:49:13,357 [stripedRead-0] INFO  datanode.DataNode 
> (StripedBlockReader.java:call(182)) - Premature EOF reading from 
> org.apache.hadoop.net.SocketInputStream@8a99d11
> 2020-02-28 11:49:13,362 [ResponseProcessor for block 
> BP-1162371257-10.19.127.112-1582870703783:blk_-9223372036854775774_1004] WARN 
>  hdfs.DataStreamer (DataStreamer.java:run(1217)) - Exception for 
> BP-1162371257-10.19.127.112-1582870703783:blk
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15199) NPE in BlockSender

2020-02-28 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047574#comment-17047574
 ] 

Surendra Singh Lilhore commented on HDFS-15199:
---

+1

> NPE in BlockSender
> --
>
> Key: HDFS-15199
> URL: https://issues.apache.org/jira/browse/HDFS-15199
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-15199-01.patch
>
>
> {noformat}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:662)
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:819)
>   at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:766)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:607)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152)
>   at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290)
>   at java.lang.Thread.run(Thread.java:748)
> 2020-02-28 11:49:13,357 [stripedRead-0] INFO  datanode.DataNode 
> (StripedBlockReader.java:call(182)) - Premature EOF reading from 
> org.apache.hadoop.net.SocketInputStream@8a99d11
> 2020-02-28 11:49:13,362 [ResponseProcessor for block 
> BP-1162371257-10.19.127.112-1582870703783:blk_-9223372036854775774_1004] WARN 
>  hdfs.DataStreamer (DataStreamer.java:run(1217)) - Exception for 
> BP-1162371257-10.19.127.112-1582870703783:blk
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15167) Block Report Interval shouldn't be reset apart from first Block Report

2020-02-27 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15167:
--
Fix Version/s: 3.3.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Committed to trunk.

Thanks [~elgoiri]  for review and [~ayushtkn]  Contribution.

> Block Report Interval shouldn't be reset apart from first Block Report
> --
>
> Key: HDFS-15167
> URL: https://issues.apache.org/jira/browse/HDFS-15167
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-15167-01.patch, HDFS-15167-02.patch, 
> HDFS-15167-03.patch, HDFS-15167-04.patch, HDFS-15167-05.patch, 
> HDFS-15167-06.patch, HDFS-15167-07.patch, HDFS-15167-08.patch
>
>
> Presently BlockReport interval is reset even in case the BR is manually 
> triggered or BR is triggered for diskError.
> Which isn't required. As per the comment also, it is intended for first BR 
> only :
> {code:java}
>   // If we have sent the first set of block reports, then wait a random
>   // time before we start the periodic block reports.
>   if (resetBlockReportTime) {
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15167) Block Report Interval shouldn't be reset apart from first Block Report

2020-02-26 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046242#comment-17046242
 ] 

Surendra Singh Lilhore commented on HDFS-15167:
---

+1

> Block Report Interval shouldn't be reset apart from first Block Report
> --
>
> Key: HDFS-15167
> URL: https://issues.apache.org/jira/browse/HDFS-15167
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-15167-01.patch, HDFS-15167-02.patch, 
> HDFS-15167-03.patch, HDFS-15167-04.patch, HDFS-15167-05.patch, 
> HDFS-15167-06.patch, HDFS-15167-07.patch, HDFS-15167-08.patch
>
>
> Presently BlockReport interval is reset even in case the BR is manually 
> triggered or BR is triggered for diskError.
> Which isn't required. As per the comment also, it is intended for first BR 
> only :
> {code:java}
>   // If we have sent the first set of block reports, then wait a random
>   // time before we start the periodic block reports.
>   if (resetBlockReportTime) {
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15167) Block Report Interval shouldn't be reset apart from first Block Report

2020-02-15 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037730#comment-17037730
 ] 

Surendra Singh Lilhore commented on HDFS-15167:
---

Thanks [~ayushtkn] for patch.

One doubt, Do we need to use {{resetBlockReportTime}} in 
{{scheduleBlockReport()}}?

> Block Report Interval shouldn't be reset apart from first Block Report
> --
>
> Key: HDFS-15167
> URL: https://issues.apache.org/jira/browse/HDFS-15167
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Attachments: HDFS-15167-01.patch, HDFS-15167-02.patch, 
> HDFS-15167-03.patch, HDFS-15167-04.patch, HDFS-15167-05.patch
>
>
> Presently BlockReport interval is reset even in case the BR is manually 
> triggered or BR is triggered for diskError.
> Which isn't required. As per the comment also, it is intended for first BR 
> only :
> {code:java}
>   // If we have sent the first set of block reports, then wait a random
>   // time before we start the periodic block reports.
>   if (resetBlockReportTime) {
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.

2020-02-15 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037723#comment-17037723
 ] 

Surendra Singh Lilhore edited comment on HDFS-15135 at 2/16/20 7:20 AM:


+1

Committed to trunk.
[~Sushma_28], please attached the patch for branch-3.2. Need to rebase test 
code.


was (Author: surendrasingh):
Committed to trunk.
[~Sushma_28], please attached the patch for branch-3.2. Need to rebase test 
code.

> EC : ArrayIndexOutOfBoundsException in 
> BlockRecoveryWorker#RecoveryTaskStriped.
> ---
>
> Key: HDFS-15135
> URL: https://issues.apache.org/jira/browse/HDFS-15135
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Reporter: Surendra Singh Lilhore
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch, 
> HDFS-15135.003.patch, HDFS-15135.004.patch, HDFS-15135.005.patch
>
>
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 8
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
>at java.lang.Thread.run(Thread.java:745) {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.

2020-02-15 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037723#comment-17037723
 ] 

Surendra Singh Lilhore commented on HDFS-15135:
---

Committed to trunk.
[~Sushma_28], please attached the patch for branch-3.2. Need to rebase test 
code.

> EC : ArrayIndexOutOfBoundsException in 
> BlockRecoveryWorker#RecoveryTaskStriped.
> ---
>
> Key: HDFS-15135
> URL: https://issues.apache.org/jira/browse/HDFS-15135
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Reporter: Surendra Singh Lilhore
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch, 
> HDFS-15135.003.patch, HDFS-15135.004.patch, HDFS-15135.005.patch
>
>
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 8
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
>at java.lang.Thread.run(Thread.java:745) {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.

2020-02-13 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036390#comment-17036390
 ] 

Surendra Singh Lilhore commented on HDFS-15135:
---

new build triggered.

> EC : ArrayIndexOutOfBoundsException in 
> BlockRecoveryWorker#RecoveryTaskStriped.
> ---
>
> Key: HDFS-15135
> URL: https://issues.apache.org/jira/browse/HDFS-15135
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Reporter: Surendra Singh Lilhore
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch, 
> HDFS-15135.003.patch, HDFS-15135.004.patch, HDFS-15135.005.patch
>
>
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 8
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
>at java.lang.Thread.run(Thread.java:745) {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.

2020-02-13 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15086:
--
Fix Version/s: 3.2.2
   3.1.4
   3.3.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Committed to branch-3.2 & branch-3.1

> Block scheduled counter never get decremet if the block got deleted before 
> replication.
> ---
>
> Key: HDFS-15086
> URL: https://issues.apache.org/jira/browse/HDFS-15086
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: hemanthboyina
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch, 
> HDFS-15086.003.patch, HDFS-15086.004.patch, HDFS-15086.005.patch
>
>
> If the block is scheduled for replication and same file get deleted then this 
> type of block will be reported as a bad block from DN. 
> For this failed replication work scheduled block counter never get decrement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.

2020-02-13 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036144#comment-17036144
 ] 

Surendra Singh Lilhore commented on HDFS-15086:
---

Committed to trunk.

> Block scheduled counter never get decremet if the block got deleted before 
> replication.
> ---
>
> Key: HDFS-15086
> URL: https://issues.apache.org/jira/browse/HDFS-15086
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch, 
> HDFS-15086.003.patch, HDFS-15086.004.patch, HDFS-15086.005.patch
>
>
> If the block is scheduled for replication and same file get deleted then this 
> type of block will be reported as a bad block from DN. 
> For this failed replication work scheduled block counter never get decrement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.

2020-02-13 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036127#comment-17036127
 ] 

Surendra Singh Lilhore commented on HDFS-15135:
---

please handle checkstyle..

> EC : ArrayIndexOutOfBoundsException in 
> BlockRecoveryWorker#RecoveryTaskStriped.
> ---
>
> Key: HDFS-15135
> URL: https://issues.apache.org/jira/browse/HDFS-15135
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Reporter: Surendra Singh Lilhore
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch, 
> HDFS-15135.003.patch, HDFS-15135.004.patch, HDFS-15135.005.patch
>
>
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 8
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
>at java.lang.Thread.run(Thread.java:745) {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.

2020-02-12 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035925#comment-17035925
 ] 

Surendra Singh Lilhore commented on HDFS-15086:
---

+1

> Block scheduled counter never get decremet if the block got deleted before 
> replication.
> ---
>
> Key: HDFS-15086
> URL: https://issues.apache.org/jira/browse/HDFS-15086
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch, 
> HDFS-15086.003.patch, HDFS-15086.004.patch, HDFS-15086.005.patch
>
>
> If the block is scheduled for replication and same file get deleted then this 
> type of block will be reported as a bad block from DN. 
> For this failed replication work scheduled block counter never get decrement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.

2020-02-10 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033617#comment-17033617
 ] 

Surendra Singh Lilhore commented on HDFS-15086:
---

triggred new build..

> Block scheduled counter never get decremet if the block got deleted before 
> replication.
> ---
>
> Key: HDFS-15086
> URL: https://issues.apache.org/jira/browse/HDFS-15086
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch, 
> HDFS-15086.003.patch
>
>
> If the block is scheduled for replication and same file get deleted then this 
> type of block will be reported as a bad block from DN. 
> For this failed replication work scheduled block counter never get decrement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.

2020-02-07 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032351#comment-17032351
 ] 

Surendra Singh Lilhore commented on HDFS-15086:
---

One more thing, can you create new Jira for this. It is not related to this 
jira.
{code:java}
+  List targets =
+  pendingReconstruction.getTargets(rw.getBlock());
+  if (targets != null) {
+for (DatanodeStorageInfo dn : targets) {
+  if (!excludedNodes.contains(dn.getDatanodeDescriptor())) {
+excludedNodes.add(dn.getDatanodeDescriptor());
+  }
+}
+  } {code}

> Block scheduled counter never get decremet if the block got deleted before 
> replication.
> ---
>
> Key: HDFS-15086
> URL: https://issues.apache.org/jira/browse/HDFS-15086
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch
>
>
> If the block is scheduled for replication and same file get deleted then this 
> type of block will be reported as a bad block from DN. 
> For this failed replication work scheduled block counter never get decrement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.

2020-02-07 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore reassigned HDFS-15135:
-

Assignee: Surendra Singh Lilhore  (was: Ravuri Sushma sree)

> EC : ArrayIndexOutOfBoundsException in 
> BlockRecoveryWorker#RecoveryTaskStriped.
> ---
>
> Key: HDFS-15135
> URL: https://issues.apache.org/jira/browse/HDFS-15135
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch
>
>
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 8
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
>at java.lang.Thread.run(Thread.java:745) {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.

2020-02-07 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore reassigned HDFS-15135:
-

Assignee: Ravuri Sushma sree  (was: Surendra Singh Lilhore)

> EC : ArrayIndexOutOfBoundsException in 
> BlockRecoveryWorker#RecoveryTaskStriped.
> ---
>
> Key: HDFS-15135
> URL: https://issues.apache.org/jira/browse/HDFS-15135
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Reporter: Surendra Singh Lilhore
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch
>
>
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 8
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
>at java.lang.Thread.run(Thread.java:745) {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.

2020-02-07 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032275#comment-17032275
 ] 

Surendra Singh Lilhore edited comment on HDFS-15135 at 2/7/20 10:13 AM:


Thanks [~Sushma_28]  for patch.

Changes looks good.

Some comments related to test case :
 # Move your UT in {{TestBlockRecovery}} class.
 # No need to add LOG in test case, just add comment instead of log.
 # Handle whitespace and check-style issue.


was (Author: surendrasingh):
Thanks [~Sushma_28]  for patch.

Changes looks good.

Some comments related to test case :
 # Move your UT in {{TestBlockRecovery}} class.
 # No need to add LOG in test case, just add comment instead of log.

> EC : ArrayIndexOutOfBoundsException in 
> BlockRecoveryWorker#RecoveryTaskStriped.
> ---
>
> Key: HDFS-15135
> URL: https://issues.apache.org/jira/browse/HDFS-15135
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Reporter: Surendra Singh Lilhore
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch
>
>
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 8
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
>at java.lang.Thread.run(Thread.java:745) {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.

2020-02-07 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032275#comment-17032275
 ] 

Surendra Singh Lilhore commented on HDFS-15135:
---

Thanks [~Sushma_28]  for patch.

Changes looks good.

Some comments related to test case :
 # Move your UT in {{TestBlockRecovery}} class.
 # No need to add LOG in test case, just add comment instead of log.

> EC : ArrayIndexOutOfBoundsException in 
> BlockRecoveryWorker#RecoveryTaskStriped.
> ---
>
> Key: HDFS-15135
> URL: https://issues.apache.org/jira/browse/HDFS-15135
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Reporter: Surendra Singh Lilhore
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch
>
>
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 8
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
>at java.lang.Thread.run(Thread.java:745) {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.

2020-02-07 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032230#comment-17032230
 ] 

Surendra Singh Lilhore commented on HDFS-15086:
---

Thanks [~hemanthboyina]  for patch,

Changes looks good, Some comments.
 # Please add comment some place regarding changes like {{DatanodeManager}}, 
{{BlockManager.computeReconstructionWorkForBlocks()}}.
 # In UT get the filesystem object inside the try block, 
{{cluster.getFileSystem()}} throws IOException.

> Block scheduled counter never get decremet if the block got deleted before 
> replication.
> ---
>
> Key: HDFS-15086
> URL: https://issues.apache.org/jira/browse/HDFS-15086
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch
>
>
> If the block is scheduled for replication and same file get deleted then this 
> type of block will be reported as a bad block from DN. 
> For this failed replication work scheduled block counter never get decrement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.

2020-02-06 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031825#comment-17031825
 ] 

Surendra Singh Lilhore commented on HDFS-15086:
---

Thanks [~hemanthboyina] ,

I will review it tomorrow.

> Block scheduled counter never get decremet if the block got deleted before 
> replication.
> ---
>
> Key: HDFS-15086
> URL: https://issues.apache.org/jira/browse/HDFS-15086
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch
>
>
> If the block is scheduled for replication and same file get deleted then this 
> type of block will be reported as a bad block from DN. 
> For this failed replication work scheduled block counter never get decrement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.

2020-02-04 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030392#comment-17030392
 ] 

Surendra Singh Lilhore commented on HDFS-15135:
---

[~Sushma_28], try if you can add UT for lease recovery.

> EC : ArrayIndexOutOfBoundsException in 
> BlockRecoveryWorker#RecoveryTaskStriped.
> ---
>
> Key: HDFS-15135
> URL: https://issues.apache.org/jira/browse/HDFS-15135
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Reporter: Surendra Singh Lilhore
>Assignee: Ravuri Sushma sree
>Priority: Major
> Attachments: HDFS-15135.001.patch
>
>
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 8
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
>at java.lang.Thread.run(Thread.java:745) {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15133) Use rocksdb to store NameNode inode and blockInfo

2020-01-21 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020765#comment-17020765
 ] 

Surendra Singh Lilhore commented on HDFS-15133:
---

bq. The RDBStore and TypedTable can be responsible for the kv store manager, so 
we can starts all work by the moving the code of RDBStore related to 
hadoop-common, so that ozone and hdfs or yarn and other component can use this 
wonderful feature without any more effort.

[~maobaolong], good idea (y).  

> Use rocksdb to store NameNode inode and blockInfo
> -
>
> Key: HDFS-15133
> URL: https://issues.apache.org/jira/browse/HDFS-15133
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: maobaolong
>Priority: Major
>
> Maybe we don't need checkpoint to a fsimage file, the rocksdb checkpoint can 
> achieve the same request.
> This is ozone and alluxio way to manage meta data of master node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13532) RBF: Adding security

2020-01-21 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-13532:
--
Fix Version/s: 3.3.0

> RBF: Adding security
> 
>
> Key: HDFS-13532
> URL: https://issues.apache.org/jira/browse/HDFS-13532
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Íñigo Goiri
>Assignee: CR Hota
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: RBF _ Security delegation token thoughts.pdf, RBF _ 
> Security delegation token thoughts_updated.pdf, RBF _ Security delegation 
> token thoughts_updated_2.pdf, RBF-DelegationToken-Approach1b.pdf, RBF_ 
> Security delegation token thoughts_updated_3.pdf, Security_for_Router-based 
> Federation_design_doc.pdf
>
>
> HDFS Router based federation should support security. This includes 
> authentication and delegation tokens.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.

2020-01-21 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020058#comment-17020058
 ] 

Surendra Singh Lilhore commented on HDFS-15135:
---

{code:java}
  // notify Namenode the new size and locations
  final DatanodeID[] newLocs = new DatanodeID[totalBlkNum];
  final String[] newStorages = new String[totalBlkNum];
  for (int i = 0; i < blockIndices.length; i++) {
newLocs[blockIndices[i]] = DatanodeID.EMPTY_DATANODE_ID;
newStorages[blockIndices[i]] = "";
  } {code}

"blockIndices[i]" called on wrong index..

> EC : ArrayIndexOutOfBoundsException in 
> BlockRecoveryWorker#RecoveryTaskStriped.
> ---
>
> Key: HDFS-15135
> URL: https://issues.apache.org/jira/browse/HDFS-15135
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Surendra Singh Lilhore
>Priority: Major
>
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 8
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
>at java.lang.Thread.run(Thread.java:745) {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.

2020-01-21 Thread Surendra Singh Lilhore (Jira)
Surendra Singh Lilhore created HDFS-15135:
-

 Summary: EC : ArrayIndexOutOfBoundsException in 
BlockRecoveryWorker#RecoveryTaskStriped.
 Key: HDFS-15135
 URL: https://issues.apache.org/jira/browse/HDFS-15135
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Surendra Singh Lilhore


{noformat}
java.lang.ArrayIndexOutOfBoundsException: 8 at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
 at java.lang.Thread.run(Thread.java:745) {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.

2020-01-21 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15135:
--
Description: 
{noformat}
java.lang.ArrayIndexOutOfBoundsException: 8
   at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
   at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
   at java.lang.Thread.run(Thread.java:745) {noformat}

  was:
{noformat}
java.lang.ArrayIndexOutOfBoundsException: 8 at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
 at java.lang.Thread.run(Thread.java:745) {noformat}


> EC : ArrayIndexOutOfBoundsException in 
> BlockRecoveryWorker#RecoveryTaskStriped.
> ---
>
> Key: HDFS-15135
> URL: https://issues.apache.org/jira/browse/HDFS-15135
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Surendra Singh Lilhore
>Priority: Major
>
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException: 8
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464)
>at 
> org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602)
>at java.lang.Thread.run(Thread.java:745) {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed

2020-01-20 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019847#comment-17019847
 ] 

Surendra Singh Lilhore commented on HDFS-15092:
---

Changes LGTM, I triggered jenkins build.

> TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
> -
>
> Key: HDFS-15092
> URL: https://issues.apache.org/jira/browse/HDFS-15092
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: test
>Affects Versions: 3.3.0
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Minor
> Attachments: HDFS-15092.001.patch, HDFS-15092.002.patch
>
>
> TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
> {quote}
> java.lang.AssertionError: 
> Expected :5
> Actual   :4
>  
>   at org.junit.Assert.fail(Assert.java:88)
>   at org.junit.Assert.failNotEquals(Assert.java:834)
>   at org.junit.Assert.assertEquals(Assert.java:645)
>   at org.junit.Assert.assertEquals(Assert.java:631)
>   at 
> org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {quote}
> Maybe we should increase sleep time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15067) Optimize heartbeat for large cluster

2020-01-11 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013555#comment-17013555
 ] 

Surendra Singh Lilhore commented on HDFS-15067:
---

Thanks [~ayushtkn] .
{quote} I am of the opinion, rather than having two logics, have one. default 
value can be like a fallback, you don't configure or you configure it wrong, I 
go back to say x, rather than having two logics
{quote}
Will check this, will try to use some fix number.
{quote}This condition checks in layman terms that if the known active turned to 
standby, in this case Ideally we should reset the heartbeats for all the bps, 
so that the new active can be identified, otherwise the bps tracking the 
standby will be at max dn interval, so it will be delayed in identifying the 
new active.
{quote}
Agree with you, need to handle this, I will update in next patch with remaining 
UT's and documentation.

> Optimize heartbeat for large cluster
> 
>
> Key: HDFS-15067
> URL: https://issues.apache.org/jira/browse/HDFS-15067
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, 
> image-2020-01-09-18-00-49-556.png
>
>
> In a large cluster Namenode spend some time in processing heartbeats. For 
> example, in 10K node cluster namenode process 10K RPC's for heartbeat in each 
> 3sec. This will impact the client response time. This heart beat can be 
> optimized. DN can start skipping one heart beat if no 
> work(Write/replication/Delete) is allocated from long time. DN can start 
> sending heart beat in 6 sec. Once the DN stating getting work from NN , it 
> can start sending heart beat normally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15067) Optimize heartbeat for large cluster

2020-01-10 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013221#comment-17013221
 ] 

Surendra Singh Lilhore commented on HDFS-15067:
---

Thanks [~ayushtkn]  for review.
{quote}I guess the standby/observer namenode will not be sending any response 
to the datanode, so the heartbeat interval for the standby shall always be the 
max configured,

Just a opinion, the standby and observer, will in anyway, reach to max skip 
interval, may be we can shoot them directly to the max value post first heart 
beat rather than going exponentially.
{quote}
Do you think it will give some benefits ?. Standby/Observer anyway not doing 
anything, sending extra heartbeat by independent thread will not cost anything .
{quote} I think in case of failover, we should reset the counter to start,
{quote}
handled.
{quote}In case of Connection Exception, or any connection issues
{quote}
handled
{quote}For the default value the number has 3 in the defaults, in case of 
invalid that shoots to {{StaleInterval - 1 HeartBeat}} both seems at quite 
extremes, the first being at the lower and the later being at the higher, I 
think we can keep something is percent to stale interval, may be 40% or 50% to 
stale interval.
{quote}
Admin should handle this configuration only if he know the NN and DN 
communication pattern. Configuring wrong thing in big cluster is not accepted 
and if he configured also he should correct it when he think system is behaving 
abnormally.

I don't think configuring in percentage is good idea. heartbeats are major 
thing and it should be counted in numbers only. For example if doctor gives you 
some pills and if he asked you to take 10% of pills daily, You need to 
calculate and find out how many pills you need to take, but doctor don't know 
what result you got after your calculation and you are taking correct number of 
pills are not.

Based on configured heartbeat interval he can easily find out how  many max 
heartbeat we should skip even in worst case to run system normally. Admin 
should try to skip minimum heartbeat to delay some other operation. I feel 3 
heartbeats are ideal based on 3sec heartbeat interval.
{quote}nit : in case of change in value specified, there should be a warn log, 
stating specified value is more then stale interval, using default of..
{quote}
handled.

> Optimize heartbeat for large cluster
> 
>
> Key: HDFS-15067
> URL: https://issues.apache.org/jira/browse/HDFS-15067
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, 
> image-2020-01-09-18-00-49-556.png
>
>
> In a large cluster Namenode spend some time in processing heartbeats. For 
> example, in 10K node cluster namenode process 10K RPC's for heartbeat in each 
> 3sec. This will impact the client response time. This heart beat can be 
> optimized. DN can start skipping one heart beat if no 
> work(Write/replication/Delete) is allocated from long time. DN can start 
> sending heart beat in 6 sec. Once the DN stating getting work from NN , it 
> can start sending heart beat normally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15067) Optimize heartbeat for large cluster

2020-01-10 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15067:
--
Attachment: HDFS-15067.02.patch

> Optimize heartbeat for large cluster
> 
>
> Key: HDFS-15067
> URL: https://issues.apache.org/jira/browse/HDFS-15067
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, 
> image-2020-01-09-18-00-49-556.png
>
>
> In a large cluster Namenode spend some time in processing heartbeats. For 
> example, in 10K node cluster namenode process 10K RPC's for heartbeat in each 
> 3sec. This will impact the client response time. This heart beat can be 
> optimized. DN can start skipping one heart beat if no 
> work(Write/replication/Delete) is allocated from long time. DN can start 
> sending heart beat in 6 sec. Once the DN stating getting work from NN , it 
> can start sending heart beat normally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15067) Optimize heartbeat for large cluster

2020-01-10 Thread Surendra Singh Lilhore (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Surendra Singh Lilhore updated HDFS-15067:
--
Issue Type: New Feature  (was: Improvement)

> Optimize heartbeat for large cluster
> 
>
> Key: HDFS-15067
> URL: https://issues.apache.org/jira/browse/HDFS-15067
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Surendra Singh Lilhore
>Priority: Major
> Attachments: HDFS-15067.01.patch, image-2020-01-09-18-00-49-556.png
>
>
> In a large cluster Namenode spend some time in processing heartbeats. For 
> example, in 10K node cluster namenode process 10K RPC's for heartbeat in each 
> 3sec. This will impact the client response time. This heart beat can be 
> optimized. DN can start skipping one heart beat if no 
> work(Write/replication/Delete) is allocated from long time. DN can start 
> sending heart beat in 6 sec. Once the DN stating getting work from NN , it 
> can start sending heart beat normally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >