[jira] [Commented] (HDFS-15067) Optimize heartbeat for large cluster
[ https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571727#comment-17571727 ] Surendra Singh Lilhore commented on HDFS-15067: --- Thanks [~prasad-acit] . We can merge this, let's ask other people to review this. > Optimize heartbeat for large cluster > > > Key: HDFS-15067 > URL: https://issues.apache.org/jira/browse/HDFS-15067 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, > HDFS-15067.03.patch, image-2020-01-09-18-00-49-556.png > > > In a large cluster Namenode spend some time in processing heartbeats. For > example, in 10K node cluster namenode process 10K RPC's for heartbeat in each > 3sec. This will impact the client response time. This heart beat can be > optimized. DN can start skipping one heart beat if no > work(Write/replication/Delete) is allocated from long time. DN can start > sending heart beat in 6 sec. Once the DN stating getting work from NN , it > can start sending heart beat normally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17495321#comment-17495321 ] Surendra Singh Lilhore commented on HDFS-16456: --- Thanks [~caozhiqiang] and appreciate your effort. I am not in favor of changing network topology for this issue. We can give try to find target from other rack after getting NotEnoughReplicasException in below logic. {code:java} if (totalReplicaExpected < numOfRacks || totalReplicaExpected % numOfRacks == 0) { writer = chooseOnce(numOfReplicas, writer, excludedNodes, blocksize, maxNodesPerRack, results, avoidStaleNodes, storageTypes); return writer; } {code} [~tasanuma], [~weichiu] Please give your opinion. > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch, HDFS-16456.002.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with or less than the replication > number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In chooseEvenlyFromRemainingRacks(), this numResultsOflastChoose = > results.size(); code should be move to after chooseOnce(), or it will throw > lastException and make choose targets failed. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493237#comment-17493237 ] Surendra Singh Lilhore edited comment on HDFS-16456 at 2/16/22, 1:30 PM: - [~caozhiqiang], thanks for patch. {noformat} hbManager.startDecommission(node); + // Update cluster's numOfRacks + blockManager.getDatanodeManager().getNetworkTopology().remove(node); {noformat} I don't think this is right way to remove node from topology. After starting decommissioning we shouldn't remove node, it is still part of cluster. was (Author: surendrasingh): [~caozhiqiang], thanks for patch. {noformat} hbManager.startDecommission(node); + // Update cluster's numOfRacks + blockManager.getDatanodeManager().getNetworkTopology().remove(node); {noformat} I don't thing this is right way to remove node from topology. After starting decommissioning we shouldn't remove node, it is still part of cluster. > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with the replication number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In chooseEvenlyFromRemainingRacks(), this numResultsOflastChoose = > results.size(); code should be move to after chooseOnce(), or it will throw > lastException and make choose targets failed. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For
[jira] [Commented] (HDFS-16456) EC: Decommission a rack with only on dn will fail when the rack number is equal with replication
[ https://issues.apache.org/jira/browse/HDFS-16456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17493237#comment-17493237 ] Surendra Singh Lilhore commented on HDFS-16456: --- [~caozhiqiang], thanks for patch. {noformat} hbManager.startDecommission(node); + // Update cluster's numOfRacks + blockManager.getDatanodeManager().getNetworkTopology().remove(node); {noformat} I don't thing this is right way to remove node from topology. After starting decommissioning we shouldn't remove node, it is still part of cluster. > EC: Decommission a rack with only on dn will fail when the rack number is > equal with replication > > > Key: HDFS-16456 > URL: https://issues.apache.org/jira/browse/HDFS-16456 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, namenode >Affects Versions: 3.4.0 >Reporter: caozhiqiang >Priority: Critical > Attachments: HDFS-16456.001.patch > > > In below scenario, decommission will fail by TOO_MANY_NODES_ON_RACK reason: > # Enable EC policy, such as RS-6-3-1024k. > # The rack number in this cluster is equal with the replication number(9) > # A rack only has one DN, and decommission this DN. > The root cause is in > BlockPlacementPolicyRackFaultTolerant::getMaxNodesPerRack() function, it will > give a limit parameter maxNodesPerRack for choose targets. In this scenario, > the maxNodesPerRack is 1, which means each rack can only be chosen one > datanode. > {code:java} > protected int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) { >... > // If more replicas than racks, evenly spread the replicas. > // This calculation rounds up. > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > return new int[] {numOfReplicas, maxNodesPerRack}; > } {code} > int maxNodesPerRack = (totalNumOfReplicas - 1) / numOfRacks + 1; > here will be called, where totalNumOfReplicas=9 and numOfRacks=9 > When we decommission one dn which is only one node in its rack, the > chooseOnce() in BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder() > will throw NotEnoughReplicasException, but the exception will not be caught > and fail to fallback to chooseEvenlyFromRemainingRacks() function. > When decommission, after choose targets, verifyBlockPlacement() function will > return the total rack number contains the invalid rack, and > BlockPlacementStatusDefault::isPlacementPolicySatisfied() will return false > and it will also cause decommission fail. > {code:java} > public BlockPlacementStatus verifyBlockPlacement(DatanodeInfo[] locs, > int numberOfReplicas) { > if (locs == null) > locs = DatanodeDescriptor.EMPTY_ARRAY; > if (!clusterMap.hasClusterEverBeenMultiRack()) { > // only one rack > return new BlockPlacementStatusDefault(1, 1, 1); > } > // Count locations on different racks. > Set racks = new HashSet<>(); > for (DatanodeInfo dn : locs) { > racks.add(dn.getNetworkLocation()); > } > return new BlockPlacementStatusDefault(racks.size(), numberOfReplicas, > clusterMap.getNumOfRacks()); > } {code} > {code:java} > public boolean isPlacementPolicySatisfied() { > return requiredRacks <= currentRacks || currentRacks >= totalRacks; > }{code} > According to the above description, we should make the below modify to fix it: > # In startDecommission() or stopDecommission(), we should also change the > numOfRacks in class NetworkTopology. Or choose targets may fail for the > maxNodesPerRack is too small. And even choose targets success, > isPlacementPolicySatisfied will also return false cause decommission fail. > # In BlockPlacementPolicyRackFaultTolerant::chooseTargetInOrder(), the first > chooseOnce() function should also be put in try..catch..., or it will not > fallback to call chooseEvenlyFromRemainingRacks() when throw exception. > # In chooseEvenlyFromRemainingRacks(), this numResultsOflastChoose = > results.size(); code should be move to after chooseOnce(), or it will throw > lastException and make choose targets failed. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15863) RBF: Validation message to be corrected in FairnessPolicyController
[ https://issues.apache.org/jira/browse/HDFS-15863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17310156#comment-17310156 ] Surendra Singh Lilhore edited comment on HDFS-15863 at 3/28/21, 10:18 AM: -- +1 for v5. Triggered build : https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/559/ was (Author: surendrasingh): +1 for v5. > RBF: Validation message to be corrected in FairnessPolicyController > --- > > Key: HDFS-15863 > URL: https://issues.apache.org/jira/browse/HDFS-15863 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.4.0 >Reporter: Renukaprasad C >Assignee: Renukaprasad C >Priority: Minor > Attachments: HDFS-15863.001.patch, HDFS-15863.002.patch, > HDFS-15863.003.patch, HDFS-15863.004.patch, HDFS-15863.005.patch > > > org.apache.hadoop.hdfs.server.federation.fairness.StaticRouterRpcFairnessPolicyController#validateCount > When dfs.federation.router.handler.count is lessthan the total dedicated > handlers for all NS, then error message shows 0 & -ve values in error > message, instead can show the actual configured values. > Current message is : "Available handlers -5 lower than min 0 for nsId nn1" > This can be changed to: "Configured handlers > ${DFS_ROUTER_HANDLER_COUNT_KEY}=10 lower than min 15 for nsId nn1", where 10 > is hander count & 15 is sum of dedicated handler count. > Related to: HDFS-14090 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15863) RBF: Validation message to be corrected in FairnessPolicyController
[ https://issues.apache.org/jira/browse/HDFS-15863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17310156#comment-17310156 ] Surendra Singh Lilhore commented on HDFS-15863: --- +1 for v5. > RBF: Validation message to be corrected in FairnessPolicyController > --- > > Key: HDFS-15863 > URL: https://issues.apache.org/jira/browse/HDFS-15863 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.4.0 >Reporter: Renukaprasad C >Assignee: Renukaprasad C >Priority: Minor > Attachments: HDFS-15863.001.patch, HDFS-15863.002.patch, > HDFS-15863.003.patch, HDFS-15863.004.patch, HDFS-15863.005.patch > > > org.apache.hadoop.hdfs.server.federation.fairness.StaticRouterRpcFairnessPolicyController#validateCount > When dfs.federation.router.handler.count is lessthan the total dedicated > handlers for all NS, then error message shows 0 & -ve values in error > message, instead can show the actual configured values. > Current message is : "Available handlers -5 lower than min 0 for nsId nn1" > This can be changed to: "Configured handlers > ${DFS_ROUTER_HANDLER_COUNT_KEY}=10 lower than min 15 for nsId nn1", where 10 > is hander count & 15 is sum of dedicated handler count. > Related to: HDFS-14090 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15812) after deleting data of hbase table hdfs size is not decreasing
[ https://issues.apache.org/jira/browse/HDFS-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283149#comment-17283149 ] Surendra Singh Lilhore commented on HDFS-15812: --- [~satycse06], Please can you check the namenode log, what happened to hbase related files after deleting table ? > after deleting data of hbase table hdfs size is not decreasing > -- > > Key: HDFS-15812 > URL: https://issues.apache.org/jira/browse/HDFS-15812 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.0.2-alpha > Environment: HDP 3.1.4.0-315 > Hbase 2.0.2.3.1.4.0-315 >Reporter: Satya Gaurav >Priority: Major > > I am deleting the data from hbase table, it's deleting from hbase table but > the size of the hdfs directory is not reducing. Even I ran the major > compaction but after that also hdfs size didn't reduce. Any solution for this > issue? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15812) after deleting data of hbase table hdfs size is not decreasing
[ https://issues.apache.org/jira/browse/HDFS-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278552#comment-17278552 ] Surendra Singh Lilhore commented on HDFS-15812: --- [~satycse06], This doc may help to understand. You need to check HBase side deletion policy. [https://docs.cloudera.com/cdp-private-cloud-base/7.1.3/managing-hbase/topics/hbase-deletion.html] I don't see any problem from HDFS side. > after deleting data of hbase table hdfs size is not decreasing > -- > > Key: HDFS-15812 > URL: https://issues.apache.org/jira/browse/HDFS-15812 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.0.2-alpha > Environment: HDP 3.1.4.0-315 > Hbase 2.0.2.3.1.4.0-315 >Reporter: Satya Gaurav >Priority: Major > > I am deleting the data from hbase table, it's deleting from hbase table but > the size of the hdfs directory is not reducing. Even I ran the major > compaction but after that also hdfs size didn't reduce. Any solution for this > issue? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15812) after deleting data of hbase table hdfs size is not decreasing
[ https://issues.apache.org/jira/browse/HDFS-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277683#comment-17277683 ] Surendra Singh Lilhore commented on HDFS-15812: --- please send your query on [u...@hadoop.apache.org.|mailto:u...@hadoop.apache.org] > after deleting data of hbase table hdfs size is not decreasing > -- > > Key: HDFS-15812 > URL: https://issues.apache.org/jira/browse/HDFS-15812 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.0.2-alpha > Environment: HDP 3.1.4.0-315 > Hbase 2.0.2.3.1.4.0-315 >Reporter: Satya Gaurav >Priority: Major > > I am deleting the data from hbase table, it's deleting from hbase table but > the size of the hdfs directory is not reducing. Even I ran the major > compaction but after that also hdfs size didn't reduce. Any solution for this > issue? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15812) after deleting data of hbase table hdfs size is not decreasing
[ https://issues.apache.org/jira/browse/HDFS-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277682#comment-17277682 ] Surendra Singh Lilhore commented on HDFS-15812: --- [~satycse06], it will take time to delete data from hdfs if is moved to trash. > after deleting data of hbase table hdfs size is not decreasing > -- > > Key: HDFS-15812 > URL: https://issues.apache.org/jira/browse/HDFS-15812 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.0.2-alpha > Environment: HDP 3.1.4.0-315 > Hbase 2.0.2.3.1.4.0-315 >Reporter: Satya Gaurav >Priority: Major > > I am deleting the data from hbase table, it's deleting from hbase table but > the size of the hdfs directory is not reducing. Even I ran the major > compaction but after that also hdfs size didn't reduce. Any solution for this > issue? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13522) Support observer node from Router-Based Federation
[ https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191970#comment-17191970 ] Surendra Singh Lilhore commented on HDFS-13522: --- Hi [~hemanthboyina], In initial review I got two things, which need to be taken care. # Load balancing between multiple observer. # Webhdfs call, I think you may get NPE for webhdfs call. I will review this patch in detail. > Support observer node from Router-Based Federation > -- > > Key: HDFS-13522 > URL: https://issues.apache.org/jira/browse/HDFS-13522 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: federation, namenode >Reporter: Erik Krogen >Assignee: Chao Sun >Priority: Major > Attachments: HDFS-13522.001.patch, HDFS-13522_WIP.patch, RBF_ > Observer support.pdf, Router+Observer RPC clogging.png, > ShortTerm-Routers+Observer.png > > > Changes will need to occur to the router to support the new observer node. > One such change will be to make the router understand the observer state, > e.g. {{FederationNamenodeServiceState}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15476) Make AsyncStream class' executor_ member private
[ https://issues.apache.org/jira/browse/HDFS-15476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore resolved HDFS-15476. --- Resolution: Fixed Thanks for contribution [~Suraj Naik] > Make AsyncStream class' executor_ member private > > > Key: HDFS-15476 > URL: https://issues.apache.org/jira/browse/HDFS-15476 > Project: Hadoop HDFS > Issue Type: Improvement > Components: build, libhdfs++ >Reporter: Suraj Naik >Assignee: Suraj Naik >Priority: Minor > Fix For: 3.4.0 > > > As part of [HDFS-15385|https://issues.apache.org/jira/browse/HDFS-15385] the > boost library was upgraded. > The AsyncStream class has a getter function which returns the executor. > Keeping the executor member public makes the getter function's role > pointless. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15476) Make AsyncStream class' executor_ member private
[ https://issues.apache.org/jira/browse/HDFS-15476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17160606#comment-17160606 ] Surendra Singh Lilhore commented on HDFS-15476: --- Added [~Suraj Naik] in HDFS contributor list. > Make AsyncStream class' executor_ member private > > > Key: HDFS-15476 > URL: https://issues.apache.org/jira/browse/HDFS-15476 > Project: Hadoop HDFS > Issue Type: Improvement > Components: build, libhdfs++ >Reporter: Suraj Naik >Priority: Minor > Fix For: 3.4.0 > > > As part of [HDFS-15385|https://issues.apache.org/jira/browse/HDFS-15385] the > boost library was upgraded. > The AsyncStream class has a getter function which returns the executor. > Keeping the executor member public makes the getter function's role > pointless. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15067) Optimize heartbeat for large cluster
[ https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17156643#comment-17156643 ] Surendra Singh Lilhore commented on HDFS-15067: --- Thanks [~ayushtkn] & [~umamaheswararao] for review {quote}Let's say a DN does not have any work for some time and you started skipping heartbeats. When you are skipping, NN assigns some replication work to this node, they will just stay in NN side DatanodeDescriptor. Since there are no heartbeats received, that DN will not consume that work from NN right? So, assigned replication can be delayed? Am i missing something? {quote} Yes, max 30s delay (stale interval). {quote}We also report xceiver counts (and lot of other metrics) in heartbeats which will be used which choosing good nodes etc. I am wondering, whether we miss any approximation(far from original approximation)? {quote} Currently only for block write request (write xceiver) it will think some work received and start sending normal heartbeat. Can we consider read request also as a work and start sending normal heartbeat ?. {quote}I saw in your proposal that, at least one heartbeat in stale interval. I feel one hb may be risk as it can be delayed or failed due to nw fluctuations. So, it may be risk that you will declare that node as stale wrongly? {quote} yeah this is problem. Any suggestion for this ?, can we send two continue heart to solve this ? {quote}Does this proved some benefit in your cluster? I mean in response time etc. {quote} yes we got good benefit in 20k node cluster. In that one is cluster activation time (Active NN out of safemode with 20K node) reduced by 50%. > Optimize heartbeat for large cluster > > > Key: HDFS-15067 > URL: https://issues.apache.org/jira/browse/HDFS-15067 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, > HDFS-15067.03.patch, image-2020-01-09-18-00-49-556.png > > > In a large cluster Namenode spend some time in processing heartbeats. For > example, in 10K node cluster namenode process 10K RPC's for heartbeat in each > 3sec. This will impact the client response time. This heart beat can be > optimized. DN can start skipping one heart beat if no > work(Write/replication/Delete) is allocated from long time. DN can start > sending heart beat in 6 sec. Once the DN stating getting work from NN , it > can start sending heart beat normally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15067) Optimize heartbeat for large cluster
[ https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143903#comment-17143903 ] Surendra Singh Lilhore commented on HDFS-15067: --- Attached v3 patch. please review.. > Optimize heartbeat for large cluster > > > Key: HDFS-15067 > URL: https://issues.apache.org/jira/browse/HDFS-15067 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, > HDFS-15067.03.patch, image-2020-01-09-18-00-49-556.png > > > In a large cluster Namenode spend some time in processing heartbeats. For > example, in 10K node cluster namenode process 10K RPC's for heartbeat in each > 3sec. This will impact the client response time. This heart beat can be > optimized. DN can start skipping one heart beat if no > work(Write/replication/Delete) is allocated from long time. DN can start > sending heart beat in 6 sec. Once the DN stating getting work from NN , it > can start sending heart beat normally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15067) Optimize heartbeat for large cluster
[ https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15067: -- Attachment: HDFS-15067.03.patch > Optimize heartbeat for large cluster > > > Key: HDFS-15067 > URL: https://issues.apache.org/jira/browse/HDFS-15067 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, > HDFS-15067.03.patch, image-2020-01-09-18-00-49-556.png > > > In a large cluster Namenode spend some time in processing heartbeats. For > example, in 10K node cluster namenode process 10K RPC's for heartbeat in each > 3sec. This will impact the client response time. This heart beat can be > optimized. DN can start skipping one heart beat if no > work(Write/replication/Delete) is allocated from long time. DN can start > sending heart beat in 6 sec. Once the DN stating getting work from NN , it > can start sending heart beat normally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15375) Reconstruction Work should not happen for Corrupt Block
[ https://issues.apache.org/jira/browse/HDFS-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124099#comment-17124099 ] Surendra Singh Lilhore commented on HDFS-15375: --- Triggered one build to check the impact of this patch. > Reconstruction Work should not happen for Corrupt Block > --- > > Key: HDFS-15375 > URL: https://issues.apache.org/jira/browse/HDFS-15375 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15375-testrepro.patch, HDFS-15375.001.patch > > > In BlockManager#updateNeededReconstructions , while updating the > NeededReconstruction we are adding Pendingreconstruction blocks to live > replicas > {code:java} > int pendingNum = pendingReconstruction.getNumReplicas(block); > int curExpectedReplicas = getExpectedRedundancyNum(block); > if (!hasEnoughEffectiveReplicas(block, repl, pendingNum)) { > neededReconstruction.update(block, repl.liveReplicas() + > pendingNum,{code} > But if two replicas were in pending reconstruction (due to corruption) , and > if the third replica is corrupted the block should be in > QUEUE_WITH_CORRUPT_BLOCKS but because of above logic it was getting added in > QUEUE_LOW_REDUNDANCY , this makes the RedudancyMonitor to reconstruct a > corrupted block , which is wrong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15375) Reconstruction Work should not happen for Corrupt Block
[ https://issues.apache.org/jira/browse/HDFS-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124096#comment-17124096 ] Surendra Singh Lilhore commented on HDFS-15375: --- {quote}- neededReconstruction.update(block, repl.liveReplicas() + pendingNum,{quote} We can't remove {{pendingNum}} from here, it will create extra replication task if this count doesn't include pendingNum. In your case all the block are corrupted means live replica will be zero. You can add some logic based on live replica zero check. > Reconstruction Work should not happen for Corrupt Block > --- > > Key: HDFS-15375 > URL: https://issues.apache.org/jira/browse/HDFS-15375 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15375-testrepro.patch, HDFS-15375.001.patch > > > In BlockManager#updateNeededReconstructions , while updating the > NeededReconstruction we are adding Pendingreconstruction blocks to live > replicas > {code:java} > int pendingNum = pendingReconstruction.getNumReplicas(block); > int curExpectedReplicas = getExpectedRedundancyNum(block); > if (!hasEnoughEffectiveReplicas(block, repl, pendingNum)) { > neededReconstruction.update(block, repl.liveReplicas() + > pendingNum,{code} > But if two replicas were in pending reconstruction (due to corruption) , and > if the third replica is corrupted the block should be in > QUEUE_WITH_CORRUPT_BLOCKS but because of above logic it was getting added in > QUEUE_LOW_REDUNDANCY , this makes the RedudancyMonitor to reconstruct a > corrupted block , which is wrong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15375) Reconstruction Work should not happen for Corrupt Block
[ https://issues.apache.org/jira/browse/HDFS-15375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119717#comment-17119717 ] Surendra Singh Lilhore commented on HDFS-15375: --- [~hemanthboyina], thanks for patch. one doubt, without this fix how much time it will take to come out from QUEUE_LOW_REDUNDANCY if third replica also corrupted. > Reconstruction Work should not happen for Corrupt Block > --- > > Key: HDFS-15375 > URL: https://issues.apache.org/jira/browse/HDFS-15375 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15375-testrepro.patch, HDFS-15375.001.patch > > > In BlockManager#updateNeededReconstructions , while updating the > NeededReconstruction we are adding Pendingreconstruction blocks to live > replicas > {code:java} > int pendingNum = pendingReconstruction.getNumReplicas(block); > int curExpectedReplicas = getExpectedRedundancyNum(block); > if (!hasEnoughEffectiveReplicas(block, repl, pendingNum)) { > neededReconstruction.update(block, repl.liveReplicas() + > pendingNum,{code} > But if two replicas were in pending reconstruction (due to corruption) , and > if the third replica is corrupted the block should be in > QUEUE_WITH_CORRUPT_BLOCKS but because of above logic it was getting added in > QUEUE_LOW_REDUNDANCY , this makes the RedudancyMonitor to reconstruct a > corrupted block , which is wrong -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14762) "Path(Path/String parent, String child)" will fail when "child" contains ":"
[ https://issues.apache.org/jira/browse/HDFS-14762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110069#comment-17110069 ] Surendra Singh Lilhore commented on HDFS-14762: --- Hi [~ayushtkn], {quote}File Name used IPv6? What is the relation of name & IPv6? {quote} We are trying HDFS with IPv6. Datanode create block pool directory and the name of block pool contain IP of namenode. If the NN started with Ipv6 then this name contain ":" and same problem occur. > "Path(Path/String parent, String child)" will fail when "child" contains ":" > > > Key: HDFS-14762 > URL: https://issues.apache.org/jira/browse/HDFS-14762 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shixiong Zhu >Priority: Major > Attachments: HDFS-14762.001.patch, HDFS-14762.002.patch, > HDFS-14762.003.patch, HDFS-14762.004.patch > > > When the "child" parameter contains ":", "Path(Path/String parent, String > child)" will throw the following exception: > {code} > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: ... > {code} > Not sure if this is a legit bug. But the following places will hit this error > when seeing a Path with a file name containing ":": > https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java#L101 > https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Globber.java#L270 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Reopened] (HDFS-14762) "Path(Path/String parent, String child)" will fail when "child" contains ":"
[ https://issues.apache.org/jira/browse/HDFS-14762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore reopened HDFS-14762: --- We should handle this scenario. It is valid scenario. We faced same problem when in some file name used IPv6. > "Path(Path/String parent, String child)" will fail when "child" contains ":" > > > Key: HDFS-14762 > URL: https://issues.apache.org/jira/browse/HDFS-14762 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shixiong Zhu >Priority: Major > Attachments: HDFS-14762.001.patch, HDFS-14762.002.patch, > HDFS-14762.003.patch, HDFS-14762.004.patch > > > When the "child" parameter contains ":", "Path(Path/String parent, String > child)" will throw the following exception: > {code} > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: ... > {code} > Not sure if this is a legit bug. But the following places will hit this error > when seeing a Path with a file name containing ":": > https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java#L101 > https://github.com/apache/hadoop/blob/f9029c4070e8eb046b403f5cb6d0a132c5d58448/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Globber.java#L270 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14452) Make Op#valueOf() Public
[ https://issues.apache.org/jira/browse/HDFS-14452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17107990#comment-17107990 ] Surendra Singh Lilhore commented on HDFS-14452: --- +1 LGTM > Make Op#valueOf() Public > > > Key: HDFS-14452 > URL: https://issues.apache.org/jira/browse/HDFS-14452 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ipc >Affects Versions: 3.2.0 >Reporter: David Mollitor >Assignee: hemanthboyina >Priority: Minor > Labels: noob > Attachments: HDFS-14452.patch > > > Change signature of {{private static Op valueOf(byte code)}} to be public. > Right now, the only easy way to look up in Op is to pass in a {{DataInput}} > object, which is not all that flexible and efficient for other custom > implementations that want to store the Op code a different way. > https://github.com/apache/hadoop/blob/8c95cb9d6bef369fef6a8364f0c0764eba90e44a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/Op.java#L53 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15316) Deletion failure should not remove directory from snapshottables
[ https://issues.apache.org/jira/browse/HDFS-15316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15316: -- Fix Version/s: 3.4.0 Resolution: Fixed Status: Resolved (was: Patch Available) Thanks [~hemanthboyina] for contribution. Committed to trunk. Thanks [~ayushtkn] for review. > Deletion failure should not remove directory from snapshottables > > > Key: HDFS-15316 > URL: https://issues.apache.org/jira/browse/HDFS-15316 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15316.001.patch, HDFS-15316.002.patch > > > If deleting a directory doesn't succeeds , still we are removing directory > from snapshottables > this makes the system inconsistent , we will be able to create snapshots but > snapshot diff throws Directory is not snaphottable -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15316) Deletion failure should not remove directory from snapshottables
[ https://issues.apache.org/jira/browse/HDFS-15316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17099570#comment-17099570 ] Surendra Singh Lilhore commented on HDFS-15316: --- Thanks [~hemanthboyina] for patch. it is very rare scenario but good to handle. +1 Will commit tomorrow if no more comment. > Deletion failure should not remove directory from snapshottables > > > Key: HDFS-15316 > URL: https://issues.apache.org/jira/browse/HDFS-15316 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15316.001.patch, HDFS-15316.002.patch > > > If deleting a directory doesn't succeeds , still we are removing directory > from snapshottables > this makes the system inconsistent , we will be able to create snapshots but > snapshot diff throws Directory is not snaphottable -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.
[ https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15210: -- Fix Version/s: 3.4.0 Resolution: Fixed Status: Resolved (was: Patch Available) Thanks [~ayushtkn] for review. Committed to trunk. > EC : File write hanged when DN is shutdown by admin command. > > > Key: HDFS-15210 > URL: https://issues.apache.org/jira/browse/HDFS-15210 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Fix For: 3.4.0 > > Attachments: HDFS-15210.001.patch, HDFS-15210.002.patch, > HDFS-15210.003.patch, dump.txt > > > EC Blocks : blk_-9223372036854291632_10668910, > blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, > blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910 > > Two block DN restarted : blk_-9223372036854291630_10668910 & > blk_-9223372036854291632_10668910 > {code:java} > 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 > 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code} > > Restarted streams are stuck in below stacktrace : > {code} > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110) > at > org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at > org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.
[ https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091207#comment-17091207 ] Surendra Singh Lilhore commented on HDFS-15210: --- Attached v3 patch. > EC : File write hanged when DN is shutdown by admin command. > > > Key: HDFS-15210 > URL: https://issues.apache.org/jira/browse/HDFS-15210 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15210.001.patch, HDFS-15210.002.patch, > HDFS-15210.003.patch, dump.txt > > > EC Blocks : blk_-9223372036854291632_10668910, > blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, > blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910 > > Two block DN restarted : blk_-9223372036854291630_10668910 & > blk_-9223372036854291632_10668910 > {code:java} > 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 > 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code} > > Restarted streams are stuck in below stacktrace : > {code} > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110) > at > org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at > org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.
[ https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15210: -- Attachment: HDFS-15210.003.patch > EC : File write hanged when DN is shutdown by admin command. > > > Key: HDFS-15210 > URL: https://issues.apache.org/jira/browse/HDFS-15210 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15210.001.patch, HDFS-15210.002.patch, > HDFS-15210.003.patch, dump.txt > > > EC Blocks : blk_-9223372036854291632_10668910, > blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, > blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910 > > Two block DN restarted : blk_-9223372036854291630_10668910 & > blk_-9223372036854291632_10668910 > {code:java} > 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 > 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code} > > Restarted streams are stuck in below stacktrace : > {code} > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110) > at > org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at > org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15218) RBF: MountTableRefresherService failed to refresh other router MountTableEntries in secure mode.
[ https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15218: -- Fix Version/s: (was: 3.40) 3.4.0 > RBF: MountTableRefresherService failed to refresh other router > MountTableEntries in secure mode. > > > Key: HDFS-15218 > URL: https://issues.apache.org/jira/browse/HDFS-15218 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Fix For: 3.3.0, 3.4.0 > > Attachments: HDFS-15218.001.patch > > > {code:java} > 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed > to refresh mount table entries cache at router X:25020 | > MountTableRefresherThread.java:69 > java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort > XXX/XXX:0. Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284) > at > org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15218) RBF: MountTableRefresherService failed to refresh other router MountTableEntries in secure mode.
[ https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15218: -- Fix Version/s: 3.3.0 3.40 Resolution: Fixed Status: Resolved (was: Patch Available) Thanks [~brahmareddy] & [~elgoiri] for review. Committed to trunk, branch-3.3. > RBF: MountTableRefresherService failed to refresh other router > MountTableEntries in secure mode. > > > Key: HDFS-15218 > URL: https://issues.apache.org/jira/browse/HDFS-15218 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Fix For: 3.40, 3.3.0 > > Attachments: HDFS-15218.001.patch > > > {code:java} > 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed > to refresh mount table entries cache at router X:25020 | > MountTableRefresherThread.java:69 > java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort > XXX/XXX:0. Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284) > at > org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15218) RBF: MountTableRefresherService failed to refresh other router MountTableEntries in secure mode.
[ https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15218: -- Summary: RBF: MountTableRefresherService failed to refresh other router MountTableEntries in secure mode. (was: RBF: MountTableRefresherService failed to refresh other router mount table in secure mode.) > RBF: MountTableRefresherService failed to refresh other router > MountTableEntries in secure mode. > > > Key: HDFS-15218 > URL: https://issues.apache.org/jira/browse/HDFS-15218 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15218.001.patch > > > {code:java} > 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed > to refresh mount table entries cache at router X:25020 | > MountTableRefresherThread.java:69 > java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort > XXX/XXX:0. Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284) > at > org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15218) RBF: MountTableRefresherService failed to refresh other router mount table in secure mode.
[ https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15218: -- Summary: RBF: MountTableRefresherService failed to refresh other router mount table in secure mode. (was: RBF: MountTableRefresherService fail in secure cluster.) > RBF: MountTableRefresherService failed to refresh other router mount table in > secure mode. > -- > > Key: HDFS-15218 > URL: https://issues.apache.org/jira/browse/HDFS-15218 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15218.001.patch > > > {code:java} > 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed > to refresh mount table entries cache at router X:25020 | > MountTableRefresherThread.java:69 > java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort > XXX/XXX:0. Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284) > at > org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15218) RBF: MountTableRefresherService fail in secure cluster.
[ https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085080#comment-17085080 ] Surendra Singh Lilhore edited comment on HDFS-15218 at 4/16/20, 4:46 PM: - [~brahmareddy] shall we go ahead with commit? It is important for 3.3.0. was (Author: surendrasingh): [~brahmareddy] shall we go ahead with commit? It is important for 3.3.0. > RBF: MountTableRefresherService fail in secure cluster. > --- > > Key: HDFS-15218 > URL: https://issues.apache.org/jira/browse/HDFS-15218 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15218.001.patch > > > {code:java} > 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed > to refresh mount table entries cache at router X:25020 | > MountTableRefresherThread.java:69 > java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort > XXX/XXX:0. Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284) > at > org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15218) RBF: MountTableRefresherService fail in secure cluster.
[ https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17085080#comment-17085080 ] Surendra Singh Lilhore commented on HDFS-15218: --- [~brahmareddy] shall we go ahead with commit? It is important for 3.3.0. > RBF: MountTableRefresherService fail in secure cluster. > --- > > Key: HDFS-15218 > URL: https://issues.apache.org/jira/browse/HDFS-15218 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15218.001.patch > > > {code:java} > 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed > to refresh mount table entries cache at router X:25020 | > MountTableRefresherThread.java:69 > java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort > XXX/XXX:0. Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284) > at > org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.
[ https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17080583#comment-17080583 ] Surendra Singh Lilhore commented on HDFS-15210: --- Attached v2 patch > EC : File write hanged when DN is shutdown by admin command. > > > Key: HDFS-15210 > URL: https://issues.apache.org/jira/browse/HDFS-15210 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15210.001.patch, HDFS-15210.002.patch, dump.txt > > > EC Blocks : blk_-9223372036854291632_10668910, > blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, > blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910 > > Two block DN restarted : blk_-9223372036854291630_10668910 & > blk_-9223372036854291632_10668910 > {code:java} > 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 > 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code} > > Restarted streams are stuck in below stacktrace : > {code} > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110) > at > org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at > org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.
[ https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15210: -- Attachment: HDFS-15210.002.patch > EC : File write hanged when DN is shutdown by admin command. > > > Key: HDFS-15210 > URL: https://issues.apache.org/jira/browse/HDFS-15210 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15210.001.patch, HDFS-15210.002.patch, dump.txt > > > EC Blocks : blk_-9223372036854291632_10668910, > blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, > blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910 > > Two block DN restarted : blk_-9223372036854291630_10668910 & > blk_-9223372036854291632_10668910 > {code:java} > 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 > 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code} > > Restarted streams are stuck in below stacktrace : > {code} > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110) > at > org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at > org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15198) RBF: In Secure Mode, Router can't refresh other router's mountTableEntries
[ https://issues.apache.org/jira/browse/HDFS-15198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079557#comment-17079557 ] Surendra Singh Lilhore commented on HDFS-15198: --- {quote}Should we merge the code change in HDFS-15218 and the unit test here? {quote} I am ok with this. > RBF: In Secure Mode, Router can't refresh other router's mountTableEntries > -- > > Key: HDFS-15198 > URL: https://issues.apache.org/jira/browse/HDFS-15198 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Reporter: zhengchenyu >Assignee: zhengchenyu >Priority: Major > Attachments: HDFS-15198.001.patch, HDFS-15198.002.patch, > HDFS-15198.003.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > In issue HDFS-13443, update mount table cache imediately. The specified > router update their own mount table cache imediately, then update other's by > rpc protocol refreshMountTableEntries. But in secure mode, can't refresh > other's router's. In specified router's log, error like this > {code} > 2020-02-27 22:59:07,212 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server : > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > 2020-02-27 22:59:07,213 ERROR > org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread: > Failed to refresh mount table entries cache at router $host:8111 > java.io.IOException: DestHost:destPort host:8111 , LocalHost:localPort > $host/$ip:0. Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:288) > at > org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65) > 2020-02-27 22:59:07,214 INFO > org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver: Added > new mount point /test_11 to resolver > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15198) RBF: In Secure Mode, Router can't refresh other router's mountTableEntries
[ https://issues.apache.org/jira/browse/HDFS-15198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17079015#comment-17079015 ] Surendra Singh Lilhore commented on HDFS-15198: --- Please refer HDFS-15218 > RBF: In Secure Mode, Router can't refresh other router's mountTableEntries > -- > > Key: HDFS-15198 > URL: https://issues.apache.org/jira/browse/HDFS-15198 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Reporter: zhengchenyu >Assignee: zhengchenyu >Priority: Major > Attachments: HDFS-15198.001.patch, HDFS-15198.002.patch, > HDFS-15198.003.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > In issue HDFS-13443, update mount table cache imediately. The specified > router update their own mount table cache imediately, then update other's by > rpc protocol refreshMountTableEntries. But in secure mode, can't refresh > other's router's. In specified router's log, error like this > {code} > 2020-02-27 22:59:07,212 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server : > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > 2020-02-27 22:59:07,213 ERROR > org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread: > Failed to refresh mount table entries cache at router $host:8111 > java.io.IOException: DestHost:destPort host:8111 , LocalHost:localPort > $host/$ip:0. Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:288) > at > org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65) > 2020-02-27 22:59:07,214 INFO > org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver: Added > new mount point /test_11 to resolver > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11298) Add storage policy info in FileStatus
[ https://issues.apache.org/jira/browse/HDFS-11298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-11298: -- Resolution: Won't Fix Status: Resolved (was: Patch Available) > Add storage policy info in FileStatus > - > > Key: HDFS-11298 > URL: https://issues.apache.org/jira/browse/HDFS-11298 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 2.7.2 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-11298.001.patch > > > Its good to add storagePolicy field in FileStatus. We no need to call > getStoragePolicy() API to get the policy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15200) Delete Corrupt Replica Immediately Irrespective of Replicas On Stale Storage
[ https://issues.apache.org/jira/browse/HDFS-15200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060804#comment-17060804 ] Surendra Singh Lilhore commented on HDFS-15200: --- {quote}The default true was suggested by Akira Ajisaka above {quote} I agree with this, it should be true by default. > Delete Corrupt Replica Immediately Irrespective of Replicas On Stale Storage > - > > Key: HDFS-15200 > URL: https://issues.apache.org/jira/browse/HDFS-15200 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Attachments: HDFS-15200-01.patch, HDFS-15200-02.patch, > HDFS-15200-03.patch, HDFS-15200-04.patch, HDFS-15200-05.patch > > > Presently {{invalidateBlock(..)}} before adding a replica into invalidates, > checks whether any block replica is on stale storage, if any replica is on > stale storage, it postpones deletion of the replica. > Here : > {code:java} >// Check how many copies we have of the block > if (nr.replicasOnStaleNodes() > 0) { > blockLog.debug("BLOCK* invalidateBlocks: postponing " + > "invalidation of {} on {} because {} replica(s) are located on " + > "nodes with potentially out-of-date block reports", b, dn, > nr.replicasOnStaleNodes()); > postponeBlock(b.getCorrupted()); > return false; > {code} > > In case of corrupt replica, we can skip this logic and delete the corrupt > replica immediately, as a corrupt replica can't get corrected. > One outcome of this behavior presently is namenodes showing different block > states post failover, as: > If a replica is marked corrupt, the Active NN, will mark it as corrupt, and > mark it for deletion and remove it from corruptReplica's and > excessRedundancyMap. > If before the deletion of replica, Failover happens. > The standby Namenode will mark all the storages as stale. > Then will start processing IBR's, Now since the replica's would be on stale > storage, it will skip deletion, and removal from corruptReplica's > Hence both the namenode will show different numbers and different corrupt > replicas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15227) FSCK -upgradedomains is failing for upgradedomains when more than 2 million blocks present in hdfs and write in progress of some blocks
[ https://issues.apache.org/jira/browse/HDFS-15227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060656#comment-17060656 ] Surendra Singh Lilhore edited comment on HDFS-15227 at 3/17/20, 6:17 AM: - Thanks [~ayushtkn] for patch. +1 was (Author: surendrasingh): Thanks [~ayushtkn] for patch. +1. Just add comment in patch for null check scenario. > FSCK -upgradedomains is failing for upgradedomains when more than 2 million > blocks present in hdfs and write in progress of some blocks > --- > > Key: HDFS-15227 > URL: https://issues.apache.org/jira/browse/HDFS-15227 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: krishna reddy >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-15227-01.patch, TestToRepro.patch > > > FSCK -upgradedomains is failing for upgradedomains when more than 2 million > blocks present in hdfs and write in progress of some blocks > "hdfs fsck / -files -blocks -upgradedomains" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15227) FSCK -upgradedomains is failing for upgradedomains when more than 2 million blocks present in hdfs and write in progress of some blocks
[ https://issues.apache.org/jira/browse/HDFS-15227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060656#comment-17060656 ] Surendra Singh Lilhore commented on HDFS-15227: --- Thanks [~ayushtkn] for patch. +1. Just add comment in patch for null check scenario. > FSCK -upgradedomains is failing for upgradedomains when more than 2 million > blocks present in hdfs and write in progress of some blocks > --- > > Key: HDFS-15227 > URL: https://issues.apache.org/jira/browse/HDFS-15227 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: krishna reddy >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-15227-01.patch, TestToRepro.patch > > > FSCK -upgradedomains is failing for upgradedomains when more than 2 million > blocks present in hdfs and write in progress of some blocks > "hdfs fsck / -files -blocks -upgradedomains" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15211) EC: File write hangs during close in case of Exception during updatePipeline
[ https://issues.apache.org/jira/browse/HDFS-15211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15211: -- Fix Version/s: 3.2.2 3.1.4 3.3.0 Resolution: Fixed Status: Resolved (was: Patch Available) Thanks [~ayushtkn] for contribution. Committed to trunk, branch-3.2, branch-3.1. > EC: File write hangs during close in case of Exception during updatePipeline > > > Key: HDFS-15211 > URL: https://issues.apache.org/jira/browse/HDFS-15211 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.1, 3.3.0, 3.2.1 >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-15211-01.patch, HDFS-15211-02.patch, > HDFS-15211-03.patch, HDFS-15211-04.patch, HDFS-15211-05.patch, > TestToRepro-01.patch, Thread-Dump, Thread-Dump-02 > > > Ec file write hangs during file close, if there is a exception due to closure > of slow stream, and number of data streamers failed increase more than parity > block. > Since in the close, the Stream will try to flush all the healthy streamers, > but the streamers won't be having any result due to exception. and the > streamers will stay stuck. > Hence the close will also get stuck. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15211) EC: File write hangs during close in case of Exception during updatePipeline
[ https://issues.apache.org/jira/browse/HDFS-15211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059707#comment-17059707 ] Surendra Singh Lilhore commented on HDFS-15211: --- +1 > EC: File write hangs during close in case of Exception during updatePipeline > > > Key: HDFS-15211 > URL: https://issues.apache.org/jira/browse/HDFS-15211 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.1, 3.3.0, 3.2.1 >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Attachments: HDFS-15211-01.patch, HDFS-15211-02.patch, > HDFS-15211-03.patch, HDFS-15211-04.patch, HDFS-15211-05.patch, > TestToRepro-01.patch, Thread-Dump, Thread-Dump-02 > > > Ec file write hangs during file close, if there is a exception due to closure > of slow stream, and number of data streamers failed increase more than parity > block. > Since in the close, the Stream will try to flush all the healthy streamers, > but the streamers won't be having any result due to exception. and the > streamers will stay stuck. > Hence the close will also get stuck. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15220) FSCK calls are redirecting to Active NN
[ https://issues.apache.org/jira/browse/HDFS-15220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058423#comment-17058423 ] Surendra Singh Lilhore commented on HDFS-15220: --- [~weichiu], how fsck will do the msync ? it is http call. > FSCK calls are redirecting to Active NN > --- > > Key: HDFS-15220 > URL: https://issues.apache.org/jira/browse/HDFS-15220 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: krishna reddy >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: screenshot-1.png > > > Run any fsck except (-delete & - move) should go to ONN as it is read > operation > In below image spikes indicates when it ran fsck / -storagepolicies > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15220) FSCK calls are redirecting to Active NN
[ https://issues.apache.org/jira/browse/HDFS-15220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17058089#comment-17058089 ] Surendra Singh Lilhore commented on HDFS-15220: --- Fsck call should not be sent to observer. Always it should be sent to active namenode. User use this command to check the current state of server. > FSCK calls are redirecting to Active NN > --- > > Key: HDFS-15220 > URL: https://issues.apache.org/jira/browse/HDFS-15220 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: krishna reddy >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: screenshot-1.png > > > Run any fsck except -delete & - move should go to ONN as it is read operation > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15211) EC: File write hangs during close in case of Exception during updatePipeline
[ https://issues.apache.org/jira/browse/HDFS-15211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057941#comment-17057941 ] Surendra Singh Lilhore commented on HDFS-15211: --- Thanks [~ayushtkn]. Minor comment. Please remove unrelated changes. {code:java} // failures when sending the last packet. We actually do not need to - // bump GS for this kind of failure. Thus counting the total number - // of failures may be good enough. + // bump GS for this kind of failure. Thus counting the total + // number of failures may be good enough.{code} Other changes are good. > EC: File write hangs during close in case of Exception during updatePipeline > > > Key: HDFS-15211 > URL: https://issues.apache.org/jira/browse/HDFS-15211 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.1.1, 3.3.0, 3.2.1 >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Attachments: HDFS-15211-01.patch, HDFS-15211-02.patch, > HDFS-15211-03.patch, TestToRepro-01.patch, Thread-Dump, Thread-Dump-02 > > > Ec file write hangs during file close, if there is a exception due to closure > of slow stream, and number of data streamers failed increase more than parity > block. > Since in the close, the Stream will try to flush all the healthy streamers, > but the streamers won't be having any result due to exception. and the > streamers will stay stuck. > Hence the close will also get stuck. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId
[ https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057935#comment-17057935 ] Surendra Singh Lilhore edited comment on HDFS-14442 at 3/12/20, 1:48 PM: - Re-based test code before committing. Thanks [~Sushma_28] for contribution. Thanks [~xkrogen] & [~ayushtkn] for review. was (Author: surendrasingh): Thanks [~Sushma_28] for contribution. Thanks [~xkrogen] & [~ayushtkn] for review. > Disagreement between HAUtil.getAddressOfActive and > RpcInvocationHandler.getConnectionId > --- > > Key: HDFS-14442 > URL: https://issues.apache.org/jira/browse/HDFS-14442 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Erik Krogen >Assignee: Ravuri Sushma sree >Priority: Major > Fix For: 3.3.0, 3.2.2 > > Attachments: HDFS-14442.001.patch, HDFS-14442.002.patch, > HDFS-14442.003.patch, HDFS-14442.004.patch > > > While working on HDFS-14245, we noticed a discrepancy in some proxy-handling > code. > The description of {{RpcInvocationHandler.getConnectionId()}} states: > {code} > /** >* Returns the connection id associated with the InvocationHandler instance. >* @return ConnectionId >*/ > ConnectionId getConnectionId(); > {code} > It does not make any claims about whether this connection ID will be an > active proxy or not. Yet in {{HAUtil}} we have: > {code} > /** >* Get the internet address of the currently-active NN. This should rarely > be >* used, since callers of this method who connect directly to the NN using > the >* resulting InetSocketAddress will not be able to connect to the active NN > if >* a failover were to occur after this method has been called. >* >* @param fs the file system to get the active address of. >* @return the internet address of the currently-active NN. >* @throws IOException if an error occurs while resolving the active NN. >*/ > public static InetSocketAddress getAddressOfActive(FileSystem fs) > throws IOException { > if (!(fs instanceof DistributedFileSystem)) { > throw new IllegalArgumentException("FileSystem " + fs + " is not a > DFS."); > } > // force client address resolution. > fs.exists(new Path("/")); > DistributedFileSystem dfs = (DistributedFileSystem) fs; > DFSClient dfsClient = dfs.getClient(); > return RPC.getServerAddress(dfsClient.getNamenode()); > } > {code} > Where the call {{RPC.getServerAddress()}} eventually terminates into > {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> > {{RPC.getConnectionIdForProxy()}} -> > {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making > an incorrect assumption that {{RpcInvocationHandler}} will necessarily return > an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a > counter-example to this, since the current connection ID may be pointing at, > for example, an Observer NameNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId
[ https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-14442: -- Fix Version/s: 3.2.2 3.3.0 Resolution: Fixed Status: Resolved (was: Patch Available) Thanks [~Sushma_28] for contribution. Thanks [~xkrogen] & [~ayushtkn] for review. > Disagreement between HAUtil.getAddressOfActive and > RpcInvocationHandler.getConnectionId > --- > > Key: HDFS-14442 > URL: https://issues.apache.org/jira/browse/HDFS-14442 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Erik Krogen >Assignee: Ravuri Sushma sree >Priority: Major > Fix For: 3.3.0, 3.2.2 > > Attachments: HDFS-14442.001.patch, HDFS-14442.002.patch, > HDFS-14442.003.patch, HDFS-14442.004.patch > > > While working on HDFS-14245, we noticed a discrepancy in some proxy-handling > code. > The description of {{RpcInvocationHandler.getConnectionId()}} states: > {code} > /** >* Returns the connection id associated with the InvocationHandler instance. >* @return ConnectionId >*/ > ConnectionId getConnectionId(); > {code} > It does not make any claims about whether this connection ID will be an > active proxy or not. Yet in {{HAUtil}} we have: > {code} > /** >* Get the internet address of the currently-active NN. This should rarely > be >* used, since callers of this method who connect directly to the NN using > the >* resulting InetSocketAddress will not be able to connect to the active NN > if >* a failover were to occur after this method has been called. >* >* @param fs the file system to get the active address of. >* @return the internet address of the currently-active NN. >* @throws IOException if an error occurs while resolving the active NN. >*/ > public static InetSocketAddress getAddressOfActive(FileSystem fs) > throws IOException { > if (!(fs instanceof DistributedFileSystem)) { > throw new IllegalArgumentException("FileSystem " + fs + " is not a > DFS."); > } > // force client address resolution. > fs.exists(new Path("/")); > DistributedFileSystem dfs = (DistributedFileSystem) fs; > DFSClient dfsClient = dfs.getClient(); > return RPC.getServerAddress(dfsClient.getNamenode()); > } > {code} > Where the call {{RPC.getServerAddress()}} eventually terminates into > {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> > {{RPC.getConnectionIdForProxy()}} -> > {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making > an incorrect assumption that {{RpcInvocationHandler}} will necessarily return > an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a > counter-example to this, since the current connection ID may be pointing at, > for example, an Observer NameNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15198) RBF: In Secure Mode, Router can't refresh other router's mountTableEntries
[ https://issues.apache.org/jira/browse/HDFS-15198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056291#comment-17056291 ] Surendra Singh Lilhore commented on HDFS-15198: --- Thanks [~zhengchenyu] for patch. We can't change the RouterClient. {code:java} -this.ugi = UserGroupInformation.getCurrentUser(); +if (UserGroupInformation.isSecurityEnabled()) { + this.ugi = UserGroupInformation.getLoginUser(); +} else { + this.ugi = UserGroupInformation.getCurrentUser(); +} {code} It is used in RouterAdmin also and there it should be currentUser() only. Please refer DFSAdmin.java > RBF: In Secure Mode, Router can't refresh other router's mountTableEntries > -- > > Key: HDFS-15198 > URL: https://issues.apache.org/jira/browse/HDFS-15198 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Reporter: zhengchenyu >Assignee: zhengchenyu >Priority: Major > Attachments: HDFS-15198.001.patch, HDFS-15198.002.patch, > HDFS-15198.003.patch > > Original Estimate: 48h > Remaining Estimate: 48h > > In issue HDFS-13443, update mount table cache imediately. The specified > router update their own mount table cache imediately, then update other's by > rpc protocol refreshMountTableEntries. But in secure mode, can't refresh > other's router's. In specified router's log, error like this > {code} > 2020-02-27 22:59:07,212 WARN org.apache.hadoop.ipc.Client: Exception > encountered while connecting to the server : > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > 2020-02-27 22:59:07,213 ERROR > org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread: > Failed to refresh mount table entries cache at router $host:8111 > java.io.IOException: DestHost:destPort host:8111 , LocalHost:localPort > $host/$ip:0. Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:288) > at > org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65) > 2020-02-27 22:59:07,214 INFO > org.apache.hadoop.hdfs.server.federation.resolver.MountTableResolver: Added > new mount point /test_11 to resolver > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15218) RBF: MountTableRefresherService fail in secure cluster.
[ https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056285#comment-17056285 ] Surendra Singh Lilhore commented on HDFS-15218: --- [~elgoiri], yes it is same. > RBF: MountTableRefresherService fail in secure cluster. > --- > > Key: HDFS-15218 > URL: https://issues.apache.org/jira/browse/HDFS-15218 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15218.001.patch > > > {code:java} > 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed > to refresh mount table entries cache at router X:25020 | > MountTableRefresherThread.java:69 > java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort > XXX/XXX:0. Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284) > at > org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15218) RBF: MountTableRefresherService fail in secure cluster.
[ https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15218: -- Status: Patch Available (was: Open) > RBF: MountTableRefresherService fail in secure cluster. > --- > > Key: HDFS-15218 > URL: https://issues.apache.org/jira/browse/HDFS-15218 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15218.001.patch > > > {code:java} > 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed > to refresh mount table entries cache at router X:25020 | > MountTableRefresherThread.java:69 > java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort > XXX/XXX:0. Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284) > at > org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15218) RBF: MountTableRefresherService fail in secure cluster.
[ https://issues.apache.org/jira/browse/HDFS-15218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15218: -- Attachment: HDFS-15218.001.patch > RBF: MountTableRefresherService fail in secure cluster. > --- > > Key: HDFS-15218 > URL: https://issues.apache.org/jira/browse/HDFS-15218 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15218.001.patch > > > {code:java} > 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed > to refresh mount table entries cache at router X:25020 | > MountTableRefresherThread.java:69 > java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort > XXX/XXX:0. Failed on local exception: java.io.IOException: > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284) > at > org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.
[ https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15210: -- Status: Patch Available (was: Open) > EC : File write hanged when DN is shutdown by admin command. > > > Key: HDFS-15210 > URL: https://issues.apache.org/jira/browse/HDFS-15210 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15210.001.patch, dump.txt > > > EC Blocks : blk_-9223372036854291632_10668910, > blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, > blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910 > > Two block DN restarted : blk_-9223372036854291630_10668910 & > blk_-9223372036854291632_10668910 > {code:java} > 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 > 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code} > > Restarted streams are stuck in below stacktrace : > {code} > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110) > at > org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at > org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.
[ https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15210: -- Attachment: HDFS-15210.001.patch > EC : File write hanged when DN is shutdown by admin command. > > > Key: HDFS-15210 > URL: https://issues.apache.org/jira/browse/HDFS-15210 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15210.001.patch, dump.txt > > > EC Blocks : blk_-9223372036854291632_10668910, > blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, > blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910 > > Two block DN restarted : blk_-9223372036854291630_10668910 & > blk_-9223372036854291632_10668910 > {code:java} > 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 > 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code} > > Restarted streams are stuck in below stacktrace : > {code} > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110) > at > org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at > org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.
[ https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15135: -- Fix Version/s: 3.2.2 3.3.0 Resolution: Fixed Status: Resolved (was: Patch Available) Thanks [~Sushma_28] for contribution. > EC : ArrayIndexOutOfBoundsException in > BlockRecoveryWorker#RecoveryTaskStriped. > --- > > Key: HDFS-15135 > URL: https://issues.apache.org/jira/browse/HDFS-15135 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Surendra Singh Lilhore >Assignee: Ravuri Sushma sree >Priority: Major > Fix For: 3.3.0, 3.2.2 > > Attachments: HDFS-15135-branch-3.2.001.patch, > HDFS-15135-branch-3.2.002.patch, HDFS-15135.001.patch, HDFS-15135.002.patch, > HDFS-15135.003.patch, HDFS-15135.004.patch, HDFS-15135.005.patch > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 8 >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) >at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15218) RBF : MountTableRefresherService fail in secure cluster.
Surendra Singh Lilhore created HDFS-15218: - Summary: RBF : MountTableRefresherService fail in secure cluster. Key: HDFS-15218 URL: https://issues.apache.org/jira/browse/HDFS-15218 Project: Hadoop HDFS Issue Type: Bug Components: rbf Affects Versions: 3.1.1 Reporter: Surendra Singh Lilhore Assignee: Surendra Singh Lilhore {code:java} 2020-03-09 12:43:50,082 | ERROR | MountTableRefresh_linux-133:25020 | Failed to refresh mount table entries cache at router X:25020 | MountTableRefresherThread.java:69 java.io.IOException: DestHost:destPort X:25020 , LocalHost:localPort XXX/XXX:0. Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at org.apache.hadoop.hdfs.protocolPB.RouterAdminProtocolTranslatorPB.refreshMountTableEntries(RouterAdminProtocolTranslatorPB.java:284) at org.apache.hadoop.hdfs.server.federation.router.MountTableRefresherThread.run(MountTableRefresherThread.java:65) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.
[ https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055688#comment-17055688 ] Surendra Singh Lilhore commented on HDFS-15135: --- +1, will merge today. > EC : ArrayIndexOutOfBoundsException in > BlockRecoveryWorker#RecoveryTaskStriped. > --- > > Key: HDFS-15135 > URL: https://issues.apache.org/jira/browse/HDFS-15135 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Surendra Singh Lilhore >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-15135-branch-3.2.001.patch, > HDFS-15135-branch-3.2.002.patch, HDFS-15135.001.patch, HDFS-15135.002.patch, > HDFS-15135.003.patch, HDFS-15135.004.patch, HDFS-15135.005.patch > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 8 >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) >at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.
[ https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054328#comment-17054328 ] Surendra Singh Lilhore commented on HDFS-15135: --- Please fix the check-style issues. > EC : ArrayIndexOutOfBoundsException in > BlockRecoveryWorker#RecoveryTaskStriped. > --- > > Key: HDFS-15135 > URL: https://issues.apache.org/jira/browse/HDFS-15135 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Surendra Singh Lilhore >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-15135-branch-3.2.001.patch, HDFS-15135.001.patch, > HDFS-15135.002.patch, HDFS-15135.003.patch, HDFS-15135.004.patch, > HDFS-15135.005.patch > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 8 >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) >at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.
[ https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15210: -- Attachment: dump.txt > EC : File write hanged when DN is shutdown by admin command. > > > Key: HDFS-15210 > URL: https://issues.apache.org/jira/browse/HDFS-15210 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: dump.txt > > > EC Blocks : blk_-9223372036854291632_10668910, > blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, > blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910 > > Two block DN restarted : blk_-9223372036854291630_10668910 & > blk_-9223372036854291632_10668910 > {code:java} > 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 > 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code} > > Restarted streams are stuck in below stacktrace : > {code} > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110) > at > org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at > org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.
[ https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15210: -- Description: EC Blocks : blk_-9223372036854291632_10668910, blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910 Two block DN restarted : blk_-9223372036854291630_10668910 & blk_-9223372036854291632_10668910 {code:java} 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code} Restarted streams are stuck in below stacktrace : {code} java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110) at org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46) {code} was: EC Blocks : blk_-9223372036854291632_10668910, blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910 Two block DN restarted : blk_-9223372036854291630_10668910 & blk_-9223372036854291632_10668910 {code:java} 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code} Restarted streams are stuck in below stacktrace : {noformat} java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110) at org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46){noformat} > EC : File write hanged when DN is shutdown by admin command. > > > Key: HDFS-15210 > URL: https://issues.apache.org/jira/browse/HDFS-15210 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > > EC Blocks : blk_-9223372036854291632_10668910, > blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, > blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910 > > Two block DN restarted : blk_-9223372036854291630_10668910 & > blk_-9223372036854291632_10668910 > {code:java} > 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 > 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code} > > Restarted streams are stuck in below stacktrace : > {code} > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110) > at > org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at > org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15210) EC : File write hand when DN is shutdown by admin command.
Surendra Singh Lilhore created HDFS-15210: - Summary: EC : File write hand when DN is shutdown by admin command. Key: HDFS-15210 URL: https://issues.apache.org/jira/browse/HDFS-15210 Project: Hadoop HDFS Issue Type: Bug Components: ec Affects Versions: 3.1.1 Reporter: Surendra Singh Lilhore Assignee: Surendra Singh Lilhore EC Blocks : blk_-9223372036854291632_10668910, blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910 Two block DN restarted : blk_-9223372036854291630_10668910 & blk_-9223372036854291632_10668910 {code:java} 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code} Restarted streams are stuck in below stacktrace : {noformat} java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110) at org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140) at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540) at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276) at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46){noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15210) EC : File write hanged when DN is shutdown by admin command.
[ https://issues.apache.org/jira/browse/HDFS-15210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15210: -- Summary: EC : File write hanged when DN is shutdown by admin command. (was: EC : File write hand when DN is shutdown by admin command.) > EC : File write hanged when DN is shutdown by admin command. > > > Key: HDFS-15210 > URL: https://issues.apache.org/jira/browse/HDFS-15210 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > > EC Blocks : blk_-9223372036854291632_10668910, > blk_-9223372036854291631_10668910, blk_-9223372036854291630_10668910, > blk_-9223372036854291629_10668910, blk_-9223372036854291628_10668910 > > Two block DN restarted : blk_-9223372036854291630_10668910 & > blk_-9223372036854291632_10668910 > {code:java} > 2020-03-03 18:12:17,074 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 > 2020-03-03 18:13:39,469 DEBUG hdfs.DataStreamer: DFSClient seqno: -2 reply: > OOB_RESTART downstreamAckTimeNanos: 0 flag: 8 {code} > > Restarted streams are stuck in below stacktrace : > {noformat} > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at > org.apache.hadoop.hdfs.DFSStripedOutputStream$MultipleBlockingQueue.take(DFSStripedOutputStream.java:110) > at > org.apache.hadoop.hdfs.StripedDataStreamer.setupPipelineInternal(StripedDataStreamer.java:140) > at > org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1540) > at > org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1276) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:669) at > org.apache.hadoop.hdfs.StripedDataStreamer.run(StripedDataStreamer.java:46){noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14442) Disagreement between HAUtil.getAddressOfActive and RpcInvocationHandler.getConnectionId
[ https://issues.apache.org/jira/browse/HDFS-14442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051819#comment-17051819 ] Surendra Singh Lilhore commented on HDFS-14442: --- +1 {quote}v003 patch LGTM, I will commit once I get a chance to verify the tests locally. {quote} [~xkrogen], any comment ? > Disagreement between HAUtil.getAddressOfActive and > RpcInvocationHandler.getConnectionId > --- > > Key: HDFS-14442 > URL: https://issues.apache.org/jira/browse/HDFS-14442 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.0 >Reporter: Erik Krogen >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-14442.001.patch, HDFS-14442.002.patch, > HDFS-14442.003.patch, HDFS-14442.004.patch > > > While working on HDFS-14245, we noticed a discrepancy in some proxy-handling > code. > The description of {{RpcInvocationHandler.getConnectionId()}} states: > {code} > /** >* Returns the connection id associated with the InvocationHandler instance. >* @return ConnectionId >*/ > ConnectionId getConnectionId(); > {code} > It does not make any claims about whether this connection ID will be an > active proxy or not. Yet in {{HAUtil}} we have: > {code} > /** >* Get the internet address of the currently-active NN. This should rarely > be >* used, since callers of this method who connect directly to the NN using > the >* resulting InetSocketAddress will not be able to connect to the active NN > if >* a failover were to occur after this method has been called. >* >* @param fs the file system to get the active address of. >* @return the internet address of the currently-active NN. >* @throws IOException if an error occurs while resolving the active NN. >*/ > public static InetSocketAddress getAddressOfActive(FileSystem fs) > throws IOException { > if (!(fs instanceof DistributedFileSystem)) { > throw new IllegalArgumentException("FileSystem " + fs + " is not a > DFS."); > } > // force client address resolution. > fs.exists(new Path("/")); > DistributedFileSystem dfs = (DistributedFileSystem) fs; > DFSClient dfsClient = dfs.getClient(); > return RPC.getServerAddress(dfsClient.getNamenode()); > } > {code} > Where the call {{RPC.getServerAddress()}} eventually terminates into > {{RpcInvocationHandler#getConnectionId()}}, via {{RPC.getServerAddress()}} -> > {{RPC.getConnectionIdForProxy()}} -> > {{RpcInvocationHandler#getConnectionId()}}. {{HAUtil}} appears to be making > an incorrect assumption that {{RpcInvocationHandler}} will necessarily return > an _active_ connection ID. {{ObserverReadProxyProvider}} demonstrates a > counter-example to this, since the current connection ID may be pointing at, > for example, an Observer NameNode. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14977) Quota Usage and Content summary are not same in Truncate with Snapshot
[ https://issues.apache.org/jira/browse/HDFS-14977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050947#comment-17050947 ] Surendra Singh Lilhore commented on HDFS-14977: --- +1 > Quota Usage and Content summary are not same in Truncate with Snapshot > --- > > Key: HDFS-14977 > URL: https://issues.apache.org/jira/browse/HDFS-14977 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-14977.001.patch, HDFS-14977.002.patch, > HDFS-14977.003.patch > > > steps : hdfs dfs -mkdir /dir > hdfs dfs -put file /dir (file size = 10bytes) > hdfs dfsadmin -allowSnapshot /dir > hdfs dfs -createSnapshot /dir s1 > space consumed with Quotausage and Content Summary is 30bytes > hdfs dfs -truncate -w 5 /dir/file > space consumed with Quotausage , Content Summary is 45 bytes > hdfs dfs -deleteSnapshot /dir s1 > space consumed with Quotausage is 45bytes and Content Summary is 15bytes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15200) Delete Corrupt Replica Immediately Irrespective of Replicas On Stale Storage
[ https://issues.apache.org/jira/browse/HDFS-15200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17049202#comment-17049202 ] Surendra Singh Lilhore commented on HDFS-15200: --- I feel we can delete corrupt replica because no chance of getting corrected it. As stale storage replica will be reported live in next BR, hopefully :). [~arp], [~aajisaka], [~weichiu] any thought on this ? > Delete Corrupt Replica Immediately Irrespective of Replicas On Stale Storage > - > > Key: HDFS-15200 > URL: https://issues.apache.org/jira/browse/HDFS-15200 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > > Presently {{invalidateBlock(..)}} before adding a replica into invalidates, > checks whether any block replica is on stale storage, if any replica is on > stale storage, it postpones deletion of the replica. > Here : > {code:java} >// Check how many copies we have of the block > if (nr.replicasOnStaleNodes() > 0) { > blockLog.debug("BLOCK* invalidateBlocks: postponing " + > "invalidation of {} on {} because {} replica(s) are located on " + > "nodes with potentially out-of-date block reports", b, dn, > nr.replicasOnStaleNodes()); > postponeBlock(b.getCorrupted()); > return false; > {code} > > In case of corrupt replica, we can skip this logic and delete the corrupt > replica immediately, as a corrupt replica can't get corrected. > One outcome of this behavior presently is namenodes showing different block > states post failover, as: > If a replica is marked corrupt, the Active NN, will mark it as corrupt, and > mark it for deletion and remove it from corruptReplica's and > excessRedundancyMap. > If before the deletion of replica, Failover happens. > The standby Namenode will mark all the storages as stale. > Then will start processing IBR's, Now since the replica's would be on stale > storage, it will skip deletion, and removal from corruptReplica's > Hence both the namenode will show different numbers and different corrupt > replicas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15159) Prevent adding same DN multiple times in PendingReconstructionBlocks
[ https://issues.apache.org/jira/browse/HDFS-15159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048640#comment-17048640 ] Surendra Singh Lilhore commented on HDFS-15159: --- [~hemanthboyina], Thanks for patch. Better add test here. You can mock DN commands and asset the scheduled replica targets. > Prevent adding same DN multiple times in PendingReconstructionBlocks > > > Key: HDFS-15159 > URL: https://issues.apache.org/jira/browse/HDFS-15159 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15159.001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14977) Quota Usage and Content summary are not same in Truncate with Snapshot
[ https://issues.apache.org/jira/browse/HDFS-14977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048637#comment-17048637 ] Surendra Singh Lilhore edited comment on HDFS-14977 at 3/1/20 5:23 PM: --- Thanks [~hemanthboyina] for patch. Changes looks good. Some comments for test code. Simplify the variables like below. Remove string variables. {code:java} Path root = new Path("/"); Path dirPath = new Path(root,"dir"); assertTrue(fs.mkdirs(dirPath));; Path filePath = new Path(dirPath, "file"); {code} [~elgoiri], Can we remove {{csSpaceConsumed, qoSpaceConsumed}} variables and add function call in assert like below ? {code:java} assertEquals(fs.getContentSummary(root).getSpaceConsumed(), fs.getQuotaUsage(root).getSpaceConsumed());{code} was (Author: surendrasingh): Thanks [~hemanthboyina] for patch. Changes looks good. Some comments for test code. Simplify the variables like below. Remove string variables. {code:java} Path root = new Path("/"); Path dirPath = new Path(root,"dir"); assertTrue(fs.mkdirs(dirPath));; Path filePath = new Path(dirPath, "file"); {code} [~elgoiri], Can we remove \{{ csSpaceConsumed, qoSpaceConsumed}} variables and add function call in assert like below ? {code:java} assertEquals(fs.getContentSummary(root).getSpaceConsumed(), fs.getQuotaUsage(root).getSpaceConsumed());{code} > Quota Usage and Content summary are not same in Truncate with Snapshot > --- > > Key: HDFS-14977 > URL: https://issues.apache.org/jira/browse/HDFS-14977 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-14977.001.patch, HDFS-14977.002.patch > > > steps : hdfs dfs -mkdir /dir > hdfs dfs -put file /dir (file size = 10bytes) > hdfs dfsadmin -allowSnapshot /dir > hdfs dfs -createSnapshot /dir s1 > space consumed with Quotausage and Content Summary is 30bytes > hdfs dfs -truncate -w 5 /dir/file > space consumed with Quotausage , Content Summary is 45 bytes > hdfs dfs -deleteSnapshot /dir s1 > space consumed with Quotausage is 45bytes and Content Summary is 15bytes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14977) Quota Usage and Content summary are not same in Truncate with Snapshot
[ https://issues.apache.org/jira/browse/HDFS-14977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17048637#comment-17048637 ] Surendra Singh Lilhore commented on HDFS-14977: --- Thanks [~hemanthboyina] for patch. Changes looks good. Some comments for test code. Simplify the variables like below. Remove string variables. {code:java} Path root = new Path("/"); Path dirPath = new Path(root,"dir"); assertTrue(fs.mkdirs(dirPath));; Path filePath = new Path(dirPath, "file"); {code} [~elgoiri], Can we remove \{{ csSpaceConsumed, qoSpaceConsumed}} variables and add function call in assert like below ? {code:java} assertEquals(fs.getContentSummary(root).getSpaceConsumed(), fs.getQuotaUsage(root).getSpaceConsumed());{code} > Quota Usage and Content summary are not same in Truncate with Snapshot > --- > > Key: HDFS-14977 > URL: https://issues.apache.org/jira/browse/HDFS-14977 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: hemanthboyina >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-14977.001.patch, HDFS-14977.002.patch > > > steps : hdfs dfs -mkdir /dir > hdfs dfs -put file /dir (file size = 10bytes) > hdfs dfsadmin -allowSnapshot /dir > hdfs dfs -createSnapshot /dir s1 > space consumed with Quotausage and Content Summary is 30bytes > hdfs dfs -truncate -w 5 /dir/file > space consumed with Quotausage , Content Summary is 45 bytes > hdfs dfs -deleteSnapshot /dir s1 > space consumed with Quotausage is 45bytes and Content Summary is 15bytes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15199) NPE in BlockSender
[ https://issues.apache.org/jira/browse/HDFS-15199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15199: -- Fix Version/s: 3.2.2 3.1.4 3.3.0 Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) Committed to trunk, branch-3.2, branch-3.1. > NPE in BlockSender > -- > > Key: HDFS-15199 > URL: https://issues.apache.org/jira/browse/HDFS-15199 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-15199-01.patch > > > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:662) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:819) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:766) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:607) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290) > at java.lang.Thread.run(Thread.java:748) > 2020-02-28 11:49:13,357 [stripedRead-0] INFO datanode.DataNode > (StripedBlockReader.java:call(182)) - Premature EOF reading from > org.apache.hadoop.net.SocketInputStream@8a99d11 > 2020-02-28 11:49:13,362 [ResponseProcessor for block > BP-1162371257-10.19.127.112-1582870703783:blk_-9223372036854775774_1004] WARN > hdfs.DataStreamer (DataStreamer.java:run(1217)) - Exception for > BP-1162371257-10.19.127.112-1582870703783:blk > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15199) NPE in BlockSender
[ https://issues.apache.org/jira/browse/HDFS-15199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047607#comment-17047607 ] Surendra Singh Lilhore commented on HDFS-15199: --- Thanks [~ayushtkn] for contribution. > NPE in BlockSender > -- > > Key: HDFS-15199 > URL: https://issues.apache.org/jira/browse/HDFS-15199 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-15199-01.patch > > > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:662) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:819) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:766) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:607) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290) > at java.lang.Thread.run(Thread.java:748) > 2020-02-28 11:49:13,357 [stripedRead-0] INFO datanode.DataNode > (StripedBlockReader.java:call(182)) - Premature EOF reading from > org.apache.hadoop.net.SocketInputStream@8a99d11 > 2020-02-28 11:49:13,362 [ResponseProcessor for block > BP-1162371257-10.19.127.112-1582870703783:blk_-9223372036854775774_1004] WARN > hdfs.DataStreamer (DataStreamer.java:run(1217)) - Exception for > BP-1162371257-10.19.127.112-1582870703783:blk > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15199) NPE in BlockSender
[ https://issues.apache.org/jira/browse/HDFS-15199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17047574#comment-17047574 ] Surendra Singh Lilhore commented on HDFS-15199: --- +1 > NPE in BlockSender > -- > > Key: HDFS-15199 > URL: https://issues.apache.org/jira/browse/HDFS-15199 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-15199-01.patch > > > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:662) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:819) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:766) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:607) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:152) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:104) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290) > at java.lang.Thread.run(Thread.java:748) > 2020-02-28 11:49:13,357 [stripedRead-0] INFO datanode.DataNode > (StripedBlockReader.java:call(182)) - Premature EOF reading from > org.apache.hadoop.net.SocketInputStream@8a99d11 > 2020-02-28 11:49:13,362 [ResponseProcessor for block > BP-1162371257-10.19.127.112-1582870703783:blk_-9223372036854775774_1004] WARN > hdfs.DataStreamer (DataStreamer.java:run(1217)) - Exception for > BP-1162371257-10.19.127.112-1582870703783:blk > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15167) Block Report Interval shouldn't be reset apart from first Block Report
[ https://issues.apache.org/jira/browse/HDFS-15167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15167: -- Fix Version/s: 3.3.0 Resolution: Fixed Status: Resolved (was: Patch Available) Committed to trunk. Thanks [~elgoiri] for review and [~ayushtkn] Contribution. > Block Report Interval shouldn't be reset apart from first Block Report > -- > > Key: HDFS-15167 > URL: https://issues.apache.org/jira/browse/HDFS-15167 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Fix For: 3.3.0 > > Attachments: HDFS-15167-01.patch, HDFS-15167-02.patch, > HDFS-15167-03.patch, HDFS-15167-04.patch, HDFS-15167-05.patch, > HDFS-15167-06.patch, HDFS-15167-07.patch, HDFS-15167-08.patch > > > Presently BlockReport interval is reset even in case the BR is manually > triggered or BR is triggered for diskError. > Which isn't required. As per the comment also, it is intended for first BR > only : > {code:java} > // If we have sent the first set of block reports, then wait a random > // time before we start the periodic block reports. > if (resetBlockReportTime) { > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15167) Block Report Interval shouldn't be reset apart from first Block Report
[ https://issues.apache.org/jira/browse/HDFS-15167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17046242#comment-17046242 ] Surendra Singh Lilhore commented on HDFS-15167: --- +1 > Block Report Interval shouldn't be reset apart from first Block Report > -- > > Key: HDFS-15167 > URL: https://issues.apache.org/jira/browse/HDFS-15167 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-15167-01.patch, HDFS-15167-02.patch, > HDFS-15167-03.patch, HDFS-15167-04.patch, HDFS-15167-05.patch, > HDFS-15167-06.patch, HDFS-15167-07.patch, HDFS-15167-08.patch > > > Presently BlockReport interval is reset even in case the BR is manually > triggered or BR is triggered for diskError. > Which isn't required. As per the comment also, it is intended for first BR > only : > {code:java} > // If we have sent the first set of block reports, then wait a random > // time before we start the periodic block reports. > if (resetBlockReportTime) { > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15167) Block Report Interval shouldn't be reset apart from first Block Report
[ https://issues.apache.org/jira/browse/HDFS-15167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037730#comment-17037730 ] Surendra Singh Lilhore commented on HDFS-15167: --- Thanks [~ayushtkn] for patch. One doubt, Do we need to use {{resetBlockReportTime}} in {{scheduleBlockReport()}}? > Block Report Interval shouldn't be reset apart from first Block Report > -- > > Key: HDFS-15167 > URL: https://issues.apache.org/jira/browse/HDFS-15167 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Major > Attachments: HDFS-15167-01.patch, HDFS-15167-02.patch, > HDFS-15167-03.patch, HDFS-15167-04.patch, HDFS-15167-05.patch > > > Presently BlockReport interval is reset even in case the BR is manually > triggered or BR is triggered for diskError. > Which isn't required. As per the comment also, it is intended for first BR > only : > {code:java} > // If we have sent the first set of block reports, then wait a random > // time before we start the periodic block reports. > if (resetBlockReportTime) { > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.
[ https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037723#comment-17037723 ] Surendra Singh Lilhore edited comment on HDFS-15135 at 2/16/20 7:20 AM: +1 Committed to trunk. [~Sushma_28], please attached the patch for branch-3.2. Need to rebase test code. was (Author: surendrasingh): Committed to trunk. [~Sushma_28], please attached the patch for branch-3.2. Need to rebase test code. > EC : ArrayIndexOutOfBoundsException in > BlockRecoveryWorker#RecoveryTaskStriped. > --- > > Key: HDFS-15135 > URL: https://issues.apache.org/jira/browse/HDFS-15135 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Surendra Singh Lilhore >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch, > HDFS-15135.003.patch, HDFS-15135.004.patch, HDFS-15135.005.patch > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 8 >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) >at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.
[ https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037723#comment-17037723 ] Surendra Singh Lilhore commented on HDFS-15135: --- Committed to trunk. [~Sushma_28], please attached the patch for branch-3.2. Need to rebase test code. > EC : ArrayIndexOutOfBoundsException in > BlockRecoveryWorker#RecoveryTaskStriped. > --- > > Key: HDFS-15135 > URL: https://issues.apache.org/jira/browse/HDFS-15135 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Surendra Singh Lilhore >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch, > HDFS-15135.003.patch, HDFS-15135.004.patch, HDFS-15135.005.patch > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 8 >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) >at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.
[ https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036390#comment-17036390 ] Surendra Singh Lilhore commented on HDFS-15135: --- new build triggered. > EC : ArrayIndexOutOfBoundsException in > BlockRecoveryWorker#RecoveryTaskStriped. > --- > > Key: HDFS-15135 > URL: https://issues.apache.org/jira/browse/HDFS-15135 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Surendra Singh Lilhore >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch, > HDFS-15135.003.patch, HDFS-15135.004.patch, HDFS-15135.005.patch > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 8 >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) >at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.
[ https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15086: -- Fix Version/s: 3.2.2 3.1.4 3.3.0 Resolution: Fixed Status: Resolved (was: Patch Available) Committed to branch-3.2 & branch-3.1 > Block scheduled counter never get decremet if the block got deleted before > replication. > --- > > Key: HDFS-15086 > URL: https://issues.apache.org/jira/browse/HDFS-15086 > Project: Hadoop HDFS > Issue Type: Improvement > Components: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: hemanthboyina >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch, > HDFS-15086.003.patch, HDFS-15086.004.patch, HDFS-15086.005.patch > > > If the block is scheduled for replication and same file get deleted then this > type of block will be reported as a bad block from DN. > For this failed replication work scheduled block counter never get decrement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.
[ https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036144#comment-17036144 ] Surendra Singh Lilhore commented on HDFS-15086: --- Committed to trunk. > Block scheduled counter never get decremet if the block got deleted before > replication. > --- > > Key: HDFS-15086 > URL: https://issues.apache.org/jira/browse/HDFS-15086 > Project: Hadoop HDFS > Issue Type: Improvement > Components: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch, > HDFS-15086.003.patch, HDFS-15086.004.patch, HDFS-15086.005.patch > > > If the block is scheduled for replication and same file get deleted then this > type of block will be reported as a bad block from DN. > For this failed replication work scheduled block counter never get decrement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.
[ https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036127#comment-17036127 ] Surendra Singh Lilhore commented on HDFS-15135: --- please handle checkstyle.. > EC : ArrayIndexOutOfBoundsException in > BlockRecoveryWorker#RecoveryTaskStriped. > --- > > Key: HDFS-15135 > URL: https://issues.apache.org/jira/browse/HDFS-15135 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Surendra Singh Lilhore >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch, > HDFS-15135.003.patch, HDFS-15135.004.patch, HDFS-15135.005.patch > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 8 >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) >at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.
[ https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17035925#comment-17035925 ] Surendra Singh Lilhore commented on HDFS-15086: --- +1 > Block scheduled counter never get decremet if the block got deleted before > replication. > --- > > Key: HDFS-15086 > URL: https://issues.apache.org/jira/browse/HDFS-15086 > Project: Hadoop HDFS > Issue Type: Improvement > Components: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch, > HDFS-15086.003.patch, HDFS-15086.004.patch, HDFS-15086.005.patch > > > If the block is scheduled for replication and same file get deleted then this > type of block will be reported as a bad block from DN. > For this failed replication work scheduled block counter never get decrement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.
[ https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17033617#comment-17033617 ] Surendra Singh Lilhore commented on HDFS-15086: --- triggred new build.. > Block scheduled counter never get decremet if the block got deleted before > replication. > --- > > Key: HDFS-15086 > URL: https://issues.apache.org/jira/browse/HDFS-15086 > Project: Hadoop HDFS > Issue Type: Improvement > Components: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch, > HDFS-15086.003.patch > > > If the block is scheduled for replication and same file get deleted then this > type of block will be reported as a bad block from DN. > For this failed replication work scheduled block counter never get decrement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.
[ https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032351#comment-17032351 ] Surendra Singh Lilhore commented on HDFS-15086: --- One more thing, can you create new Jira for this. It is not related to this jira. {code:java} + List targets = + pendingReconstruction.getTargets(rw.getBlock()); + if (targets != null) { +for (DatanodeStorageInfo dn : targets) { + if (!excludedNodes.contains(dn.getDatanodeDescriptor())) { +excludedNodes.add(dn.getDatanodeDescriptor()); + } +} + } {code} > Block scheduled counter never get decremet if the block got deleted before > replication. > --- > > Key: HDFS-15086 > URL: https://issues.apache.org/jira/browse/HDFS-15086 > Project: Hadoop HDFS > Issue Type: Improvement > Components: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch > > > If the block is scheduled for replication and same file get deleted then this > type of block will be reported as a bad block from DN. > For this failed replication work scheduled block counter never get decrement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.
[ https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore reassigned HDFS-15135: - Assignee: Surendra Singh Lilhore (was: Ravuri Sushma sree) > EC : ArrayIndexOutOfBoundsException in > BlockRecoveryWorker#RecoveryTaskStriped. > --- > > Key: HDFS-15135 > URL: https://issues.apache.org/jira/browse/HDFS-15135 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 8 >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) >at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.
[ https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore reassigned HDFS-15135: - Assignee: Ravuri Sushma sree (was: Surendra Singh Lilhore) > EC : ArrayIndexOutOfBoundsException in > BlockRecoveryWorker#RecoveryTaskStriped. > --- > > Key: HDFS-15135 > URL: https://issues.apache.org/jira/browse/HDFS-15135 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Surendra Singh Lilhore >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 8 >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) >at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.
[ https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032275#comment-17032275 ] Surendra Singh Lilhore edited comment on HDFS-15135 at 2/7/20 10:13 AM: Thanks [~Sushma_28] for patch. Changes looks good. Some comments related to test case : # Move your UT in {{TestBlockRecovery}} class. # No need to add LOG in test case, just add comment instead of log. # Handle whitespace and check-style issue. was (Author: surendrasingh): Thanks [~Sushma_28] for patch. Changes looks good. Some comments related to test case : # Move your UT in {{TestBlockRecovery}} class. # No need to add LOG in test case, just add comment instead of log. > EC : ArrayIndexOutOfBoundsException in > BlockRecoveryWorker#RecoveryTaskStriped. > --- > > Key: HDFS-15135 > URL: https://issues.apache.org/jira/browse/HDFS-15135 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Surendra Singh Lilhore >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 8 >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) >at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.
[ https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032275#comment-17032275 ] Surendra Singh Lilhore commented on HDFS-15135: --- Thanks [~Sushma_28] for patch. Changes looks good. Some comments related to test case : # Move your UT in {{TestBlockRecovery}} class. # No need to add LOG in test case, just add comment instead of log. > EC : ArrayIndexOutOfBoundsException in > BlockRecoveryWorker#RecoveryTaskStriped. > --- > > Key: HDFS-15135 > URL: https://issues.apache.org/jira/browse/HDFS-15135 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Surendra Singh Lilhore >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-15135.001.patch, HDFS-15135.002.patch > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 8 >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) >at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.
[ https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032230#comment-17032230 ] Surendra Singh Lilhore commented on HDFS-15086: --- Thanks [~hemanthboyina] for patch, Changes looks good, Some comments. # Please add comment some place regarding changes like {{DatanodeManager}}, {{BlockManager.computeReconstructionWorkForBlocks()}}. # In UT get the filesystem object inside the try block, {{cluster.getFileSystem()}} throws IOException. > Block scheduled counter never get decremet if the block got deleted before > replication. > --- > > Key: HDFS-15086 > URL: https://issues.apache.org/jira/browse/HDFS-15086 > Project: Hadoop HDFS > Issue Type: Improvement > Components: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch > > > If the block is scheduled for replication and same file get deleted then this > type of block will be reported as a bad block from DN. > For this failed replication work scheduled block counter never get decrement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15086) Block scheduled counter never get decremet if the block got deleted before replication.
[ https://issues.apache.org/jira/browse/HDFS-15086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031825#comment-17031825 ] Surendra Singh Lilhore commented on HDFS-15086: --- Thanks [~hemanthboyina] , I will review it tomorrow. > Block scheduled counter never get decremet if the block got deleted before > replication. > --- > > Key: HDFS-15086 > URL: https://issues.apache.org/jira/browse/HDFS-15086 > Project: Hadoop HDFS > Issue Type: Improvement > Components: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-15086.001.patch, HDFS-15086.002.patch > > > If the block is scheduled for replication and same file get deleted then this > type of block will be reported as a bad block from DN. > For this failed replication work scheduled block counter never get decrement. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.
[ https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030392#comment-17030392 ] Surendra Singh Lilhore commented on HDFS-15135: --- [~Sushma_28], try if you can add UT for lease recovery. > EC : ArrayIndexOutOfBoundsException in > BlockRecoveryWorker#RecoveryTaskStriped. > --- > > Key: HDFS-15135 > URL: https://issues.apache.org/jira/browse/HDFS-15135 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding >Reporter: Surendra Singh Lilhore >Assignee: Ravuri Sushma sree >Priority: Major > Attachments: HDFS-15135.001.patch > > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 8 >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) >at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15133) Use rocksdb to store NameNode inode and blockInfo
[ https://issues.apache.org/jira/browse/HDFS-15133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020765#comment-17020765 ] Surendra Singh Lilhore commented on HDFS-15133: --- bq. The RDBStore and TypedTable can be responsible for the kv store manager, so we can starts all work by the moving the code of RDBStore related to hadoop-common, so that ozone and hdfs or yarn and other component can use this wonderful feature without any more effort. [~maobaolong], good idea (y). > Use rocksdb to store NameNode inode and blockInfo > - > > Key: HDFS-15133 > URL: https://issues.apache.org/jira/browse/HDFS-15133 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: maobaolong >Priority: Major > > Maybe we don't need checkpoint to a fsimage file, the rocksdb checkpoint can > achieve the same request. > This is ozone and alluxio way to manage meta data of master node. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13532) RBF: Adding security
[ https://issues.apache.org/jira/browse/HDFS-13532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-13532: -- Fix Version/s: 3.3.0 > RBF: Adding security > > > Key: HDFS-13532 > URL: https://issues.apache.org/jira/browse/HDFS-13532 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Íñigo Goiri >Assignee: CR Hota >Priority: Major > Fix For: 3.3.0 > > Attachments: RBF _ Security delegation token thoughts.pdf, RBF _ > Security delegation token thoughts_updated.pdf, RBF _ Security delegation > token thoughts_updated_2.pdf, RBF-DelegationToken-Approach1b.pdf, RBF_ > Security delegation token thoughts_updated_3.pdf, Security_for_Router-based > Federation_design_doc.pdf > > > HDFS Router based federation should support security. This includes > authentication and delegation tokens. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.
[ https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020058#comment-17020058 ] Surendra Singh Lilhore commented on HDFS-15135: --- {code:java} // notify Namenode the new size and locations final DatanodeID[] newLocs = new DatanodeID[totalBlkNum]; final String[] newStorages = new String[totalBlkNum]; for (int i = 0; i < blockIndices.length; i++) { newLocs[blockIndices[i]] = DatanodeID.EMPTY_DATANODE_ID; newStorages[blockIndices[i]] = ""; } {code} "blockIndices[i]" called on wrong index.. > EC : ArrayIndexOutOfBoundsException in > BlockRecoveryWorker#RecoveryTaskStriped. > --- > > Key: HDFS-15135 > URL: https://issues.apache.org/jira/browse/HDFS-15135 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Surendra Singh Lilhore >Priority: Major > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 8 >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) >at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.
Surendra Singh Lilhore created HDFS-15135: - Summary: EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped. Key: HDFS-15135 URL: https://issues.apache.org/jira/browse/HDFS-15135 Project: Hadoop HDFS Issue Type: Bug Reporter: Surendra Singh Lilhore {noformat} java.lang.ArrayIndexOutOfBoundsException: 8 at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15135) EC : ArrayIndexOutOfBoundsException in BlockRecoveryWorker#RecoveryTaskStriped.
[ https://issues.apache.org/jira/browse/HDFS-15135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15135: -- Description: {noformat} java.lang.ArrayIndexOutOfBoundsException: 8 at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) at java.lang.Thread.run(Thread.java:745) {noformat} was: {noformat} java.lang.ArrayIndexOutOfBoundsException: 8 at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) at java.lang.Thread.run(Thread.java:745) {noformat} > EC : ArrayIndexOutOfBoundsException in > BlockRecoveryWorker#RecoveryTaskStriped. > --- > > Key: HDFS-15135 > URL: https://issues.apache.org/jira/browse/HDFS-15135 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Surendra Singh Lilhore >Priority: Major > > {noformat} > java.lang.ArrayIndexOutOfBoundsException: 8 >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskStriped.recover(BlockRecoveryWorker.java:464) >at > org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:602) >at java.lang.Thread.run(Thread.java:745) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15092) TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed
[ https://issues.apache.org/jira/browse/HDFS-15092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019847#comment-17019847 ] Surendra Singh Lilhore commented on HDFS-15092: --- Changes LGTM, I triggered jenkins build. > TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed > - > > Key: HDFS-15092 > URL: https://issues.apache.org/jira/browse/HDFS-15092 > Project: Hadoop HDFS > Issue Type: Test > Components: test >Affects Versions: 3.3.0 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Minor > Attachments: HDFS-15092.001.patch, HDFS-15092.002.patch > > > TestRedudantBlocks#testProcessOverReplicatedAndRedudantBlock sometimes failed > {quote} > java.lang.AssertionError: > Expected :5 > Actual :4 > > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.hadoop.hdfs.server.namenode.TestRedudantBlocks.testProcessOverReplicatedAndRedudantBlock(TestRedudantBlocks.java:138) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:51) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {quote} > Maybe we should increase sleep time -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15067) Optimize heartbeat for large cluster
[ https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013555#comment-17013555 ] Surendra Singh Lilhore commented on HDFS-15067: --- Thanks [~ayushtkn] . {quote} I am of the opinion, rather than having two logics, have one. default value can be like a fallback, you don't configure or you configure it wrong, I go back to say x, rather than having two logics {quote} Will check this, will try to use some fix number. {quote}This condition checks in layman terms that if the known active turned to standby, in this case Ideally we should reset the heartbeats for all the bps, so that the new active can be identified, otherwise the bps tracking the standby will be at max dn interval, so it will be delayed in identifying the new active. {quote} Agree with you, need to handle this, I will update in next patch with remaining UT's and documentation. > Optimize heartbeat for large cluster > > > Key: HDFS-15067 > URL: https://issues.apache.org/jira/browse/HDFS-15067 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, > image-2020-01-09-18-00-49-556.png > > > In a large cluster Namenode spend some time in processing heartbeats. For > example, in 10K node cluster namenode process 10K RPC's for heartbeat in each > 3sec. This will impact the client response time. This heart beat can be > optimized. DN can start skipping one heart beat if no > work(Write/replication/Delete) is allocated from long time. DN can start > sending heart beat in 6 sec. Once the DN stating getting work from NN , it > can start sending heart beat normally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15067) Optimize heartbeat for large cluster
[ https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17013221#comment-17013221 ] Surendra Singh Lilhore commented on HDFS-15067: --- Thanks [~ayushtkn] for review. {quote}I guess the standby/observer namenode will not be sending any response to the datanode, so the heartbeat interval for the standby shall always be the max configured, Just a opinion, the standby and observer, will in anyway, reach to max skip interval, may be we can shoot them directly to the max value post first heart beat rather than going exponentially. {quote} Do you think it will give some benefits ?. Standby/Observer anyway not doing anything, sending extra heartbeat by independent thread will not cost anything . {quote} I think in case of failover, we should reset the counter to start, {quote} handled. {quote}In case of Connection Exception, or any connection issues {quote} handled {quote}For the default value the number has 3 in the defaults, in case of invalid that shoots to {{StaleInterval - 1 HeartBeat}} both seems at quite extremes, the first being at the lower and the later being at the higher, I think we can keep something is percent to stale interval, may be 40% or 50% to stale interval. {quote} Admin should handle this configuration only if he know the NN and DN communication pattern. Configuring wrong thing in big cluster is not accepted and if he configured also he should correct it when he think system is behaving abnormally. I don't think configuring in percentage is good idea. heartbeats are major thing and it should be counted in numbers only. For example if doctor gives you some pills and if he asked you to take 10% of pills daily, You need to calculate and find out how many pills you need to take, but doctor don't know what result you got after your calculation and you are taking correct number of pills are not. Based on configured heartbeat interval he can easily find out how many max heartbeat we should skip even in worst case to run system normally. Admin should try to skip minimum heartbeat to delay some other operation. I feel 3 heartbeats are ideal based on 3sec heartbeat interval. {quote}nit : in case of change in value specified, there should be a warn log, stating specified value is more then stale interval, using default of.. {quote} handled. > Optimize heartbeat for large cluster > > > Key: HDFS-15067 > URL: https://issues.apache.org/jira/browse/HDFS-15067 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, > image-2020-01-09-18-00-49-556.png > > > In a large cluster Namenode spend some time in processing heartbeats. For > example, in 10K node cluster namenode process 10K RPC's for heartbeat in each > 3sec. This will impact the client response time. This heart beat can be > optimized. DN can start skipping one heart beat if no > work(Write/replication/Delete) is allocated from long time. DN can start > sending heart beat in 6 sec. Once the DN stating getting work from NN , it > can start sending heart beat normally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15067) Optimize heartbeat for large cluster
[ https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15067: -- Attachment: HDFS-15067.02.patch > Optimize heartbeat for large cluster > > > Key: HDFS-15067 > URL: https://issues.apache.org/jira/browse/HDFS-15067 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15067.01.patch, HDFS-15067.02.patch, > image-2020-01-09-18-00-49-556.png > > > In a large cluster Namenode spend some time in processing heartbeats. For > example, in 10K node cluster namenode process 10K RPC's for heartbeat in each > 3sec. This will impact the client response time. This heart beat can be > optimized. DN can start skipping one heart beat if no > work(Write/replication/Delete) is allocated from long time. DN can start > sending heart beat in 6 sec. Once the DN stating getting work from NN , it > can start sending heart beat normally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15067) Optimize heartbeat for large cluster
[ https://issues.apache.org/jira/browse/HDFS-15067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Surendra Singh Lilhore updated HDFS-15067: -- Issue Type: New Feature (was: Improvement) > Optimize heartbeat for large cluster > > > Key: HDFS-15067 > URL: https://issues.apache.org/jira/browse/HDFS-15067 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: Surendra Singh Lilhore >Priority: Major > Attachments: HDFS-15067.01.patch, image-2020-01-09-18-00-49-556.png > > > In a large cluster Namenode spend some time in processing heartbeats. For > example, in 10K node cluster namenode process 10K RPC's for heartbeat in each > 3sec. This will impact the client response time. This heart beat can be > optimized. DN can start skipping one heart beat if no > work(Write/replication/Delete) is allocated from long time. DN can start > sending heart beat in 6 sec. Once the DN stating getting work from NN , it > can start sending heart beat normally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org