[jira] [Assigned] (HDFS-17222) fix Federation doc
[ https://issues.apache.org/jira/browse/HDFS-17222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang reassigned HDFS-17222: --- Assignee: (was: yanbin.zhang) > fix Federation doc > -- > > Key: HDFS-17222 > URL: https://issues.apache.org/jira/browse/HDFS-17222 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: yanbin.zhang >Priority: Trivial > > Fix some wrong symbols in Federation.md. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17222) fix Federation doc
[ https://issues.apache.org/jira/browse/HDFS-17222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang resolved HDFS-17222. - Resolution: Not A Problem > fix Federation doc > -- > > Key: HDFS-17222 > URL: https://issues.apache.org/jira/browse/HDFS-17222 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: yanbin.zhang >Priority: Trivial > > Fix some wrong symbols in Federation.md. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-17222) fix Federation doc
[ https://issues.apache.org/jira/browse/HDFS-17222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang reassigned HDFS-17222: --- Assignee: yanbin.zhang > fix Federation doc > -- > > Key: HDFS-17222 > URL: https://issues.apache.org/jira/browse/HDFS-17222 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Trivial > > Fix some wrong symbols in Federation.md. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17222) fix Federation doc
yanbin.zhang created HDFS-17222: --- Summary: fix Federation doc Key: HDFS-17222 URL: https://issues.apache.org/jira/browse/HDFS-17222 Project: Hadoop HDFS Issue Type: Improvement Reporter: yanbin.zhang Fix some wrong symbols in Federation.md. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16525) System.err should be used when error occurs in multiple methods in DFSAdmin class
[ https://issues.apache.org/jira/browse/HDFS-16525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522032#comment-17522032 ] yanbin.zhang commented on HDFS-16525: - [~weichiu] [~ferhui] Could you please give some comments or suggestions. > System.err should be used when error occurs in multiple methods in DFSAdmin > class > - > > Key: HDFS-16525 > URL: https://issues.apache.org/jira/browse/HDFS-16525 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsadmin >Affects Versions: 3.3.2 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > System.err should be used when error occurs in multiple methods in DFSAdmin > class,as follows: > {code:java} > //DFSAdmin#refreshCallQueue > ... > try{ > proxy.getProxy().refreshCallQueue(); > System.out.println("Refresh call queue successful for " > + proxy.getAddress()); > }catch (IOException ioe){ > System.out.println("Refresh call queue failed for " > + proxy.getAddress()); > exceptions.add(ioe); > } > ... > {code} > The test method closed first in TestDFSAdminWithHA also needs to be > modified,otherwise an error will be reported,similar to the following: > {code:java} > [ERROR] Failures: > [ERROR] > TestDFSAdminWithHA.testRefreshCallQueueNN1DownNN2Up:726->assertOutputMatches:77 > Expected output to match 'Refresh call queue failed for.* > Refresh call queue successful for.* > ' but err_output was: > Refresh call queue failed for localhost/127.0.0.1:12876 > refreshCallQueue: Call From h110/10.1.234.110 to localhost:12876 failed on > connection exception: java.net.ConnectException: Connection refused; For more > details see: http://wiki.apache.org/hadoop/ConnectionRefused and output was: > Refresh call queue successful for localhost/127.0.0.1:12878{code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16450) Give priority to releasing DNs with less free space
[ https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522029#comment-17522029 ] yanbin.zhang commented on HDFS-16450: - OK, I'll try to do it. > Give priority to releasing DNs with less free space > --- > > Key: HDFS-16450 > URL: https://issues.apache.org/jira/browse/HDFS-16450 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Attachments: HDFS-16450.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When deleting redundant replicas, the one with the least free space should be > prioritized. > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else { > return null; > } > excessTypes.remove(storage.getStorageType()); > return storage; {code} > Change the above logic to the following: > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else { > return null; > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16457) Make fs.getspaceused.classname reconfigurable
[ https://issues.apache.org/jira/browse/HDFS-16457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513260#comment-17513260 ] yanbin.zhang edited comment on HDFS-16457 at 4/8/22 1:50 AM: - [~tasanuma] thanks for the review and merge. was (Author: it_singer): Dear God, can you help me to review my code, it took a long time to complete, I don't want to waste my time! [~weichiu] [~hexiaoqiao] [~csun] > Make fs.getspaceused.classname reconfigurable > - > > Key: HDFS-16457 > URL: https://issues.apache.org/jira/browse/HDFS-16457 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 6h 40m > Remaining Estimate: 0h > > Now if we want to switch fs.getspaceused.classname we need to restart the > NameNode. It would be convenient if we can switch it at runtime. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16525) System.err should be used when error occurs in multiple methods in DFSAdmin class
[ https://issues.apache.org/jira/browse/HDFS-16525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang updated HDFS-16525: Description: System.err should be used when error occurs in multiple methods in DFSAdmin class,as follows: {code:java} //DFSAdmin#refreshCallQueue ... try{ proxy.getProxy().refreshCallQueue(); System.out.println("Refresh call queue successful for " + proxy.getAddress()); }catch (IOException ioe){ System.out.println("Refresh call queue failed for " + proxy.getAddress()); exceptions.add(ioe); } ... {code} The test method closed first in TestDFSAdminWithHA also needs to be modified,otherwise an error will be reported,similar to the following: {code:java} [ERROR] Failures: [ERROR] TestDFSAdminWithHA.testRefreshCallQueueNN1DownNN2Up:726->assertOutputMatches:77 Expected output to match 'Refresh call queue failed for.* Refresh call queue successful for.* ' but err_output was: Refresh call queue failed for localhost/127.0.0.1:12876 refreshCallQueue: Call From h110/10.1.234.110 to localhost:12876 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused and output was: Refresh call queue successful for localhost/127.0.0.1:12878{code} was: System.err should be used when error occurs in multiple methods in DFSAdmin class,as follows: {code:java} //DFSAdmin#refreshCallQueue ... try{ proxy.getProxy().refreshCallQueue(); System.out.println("Refresh call queue successful for " + proxy.getAddress()); }catch (IOException ioe){ System.out.println("Refresh call queue failed for " + proxy.getAddress()); exceptions.add(ioe); } ... {code} The test method closed first in TestDFSAdminWithHA also needs to be modified, otherwise an error will be reported: > System.err should be used when error occurs in multiple methods in DFSAdmin > class > - > > Key: HDFS-16525 > URL: https://issues.apache.org/jira/browse/HDFS-16525 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsadmin >Affects Versions: 3.3.2 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > > System.err should be used when error occurs in multiple methods in DFSAdmin > class,as follows: > {code:java} > //DFSAdmin#refreshCallQueue > ... > try{ > proxy.getProxy().refreshCallQueue(); > System.out.println("Refresh call queue successful for " > + proxy.getAddress()); > }catch (IOException ioe){ > System.out.println("Refresh call queue failed for " > + proxy.getAddress()); > exceptions.add(ioe); > } > ... > {code} > The test method closed first in TestDFSAdminWithHA also needs to be > modified,otherwise an error will be reported,similar to the following: > {code:java} > [ERROR] Failures: > [ERROR] > TestDFSAdminWithHA.testRefreshCallQueueNN1DownNN2Up:726->assertOutputMatches:77 > Expected output to match 'Refresh call queue failed for.* > Refresh call queue successful for.* > ' but err_output was: > Refresh call queue failed for localhost/127.0.0.1:12876 > refreshCallQueue: Call From h110/10.1.234.110 to localhost:12876 failed on > connection exception: java.net.ConnectException: Connection refused; For more > details see: http://wiki.apache.org/hadoop/ConnectionRefused and output was: > Refresh call queue successful for localhost/127.0.0.1:12878{code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16525) System.err should be used when error occurs in multiple methods in DFSAdmin class
[ https://issues.apache.org/jira/browse/HDFS-16525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang updated HDFS-16525: Description: System.err should be used when error occurs in multiple methods in DFSAdmin class,as follows: {code:java} //DFSAdmin#refreshCallQueue ... try{ proxy.getProxy().refreshCallQueue(); System.out.println("Refresh call queue successful for " + proxy.getAddress()); }catch (IOException ioe){ System.out.println("Refresh call queue failed for " + proxy.getAddress()); exceptions.add(ioe); } ... {code} The test method closed first in TestDFSAdminWithHA also needs to be modified, otherwise an error will be reported: was: System.err should be used when error occurs in multiple methods in DFSAdmin class,as follows: {code:java} //DFSAdmin#refreshCallQueue ... try{ proxy.getProxy().refreshCallQueue(); System.out.println("Refresh call queue successful for " + proxy.getAddress()); }catch (IOException ioe){ System.out.println("Refresh call queue failed for " + proxy.getAddress()); exceptions.add(ioe); } ... {code} > System.err should be used when error occurs in multiple methods in DFSAdmin > class > - > > Key: HDFS-16525 > URL: https://issues.apache.org/jira/browse/HDFS-16525 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsadmin >Affects Versions: 3.3.2 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > > System.err should be used when error occurs in multiple methods in DFSAdmin > class,as follows: > {code:java} > //DFSAdmin#refreshCallQueue > ... > try{ > proxy.getProxy().refreshCallQueue(); > System.out.println("Refresh call queue successful for " > + proxy.getAddress()); > }catch (IOException ioe){ > System.out.println("Refresh call queue failed for " > + proxy.getAddress()); > exceptions.add(ioe); > } > ... > {code} > The test method closed first in TestDFSAdminWithHA also needs to be modified, > otherwise an error will be reported: -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16525) System.err should be used when error occurs in multiple methods in DFSAdmin class
[ https://issues.apache.org/jira/browse/HDFS-16525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang updated HDFS-16525: Description: System.err should be used when error occurs in multiple methods in DFSAdmin class,as follows: {code:java} //DFSAdmin#refreshCallQueue ... try{ proxy.getProxy().refreshCallQueue(); System.out.println("Refresh call queue successful for " + proxy.getAddress()); }catch (IOException ioe){ System.out.println("Refresh call queue failed for " + proxy.getAddress()); exceptions.add(ioe); } ... {code} was: System.err should be used when error occurs in multiple methods in DFSAdmin class,as follows: {code:java} //DFSAdmin#refreshCallQueue ... try{ proxy.getProxy().refreshCallQueue(); System.out.println("Refresh call queue successful for " + proxy.getAddress()); }catch (IOException ioe){ System.out.println("Refresh call queue failed for " + proxy.getAddress()); exceptions.add(ioe); } ... {code} > System.err should be used when error occurs in multiple methods in DFSAdmin > class > - > > Key: HDFS-16525 > URL: https://issues.apache.org/jira/browse/HDFS-16525 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsadmin >Affects Versions: 3.3.2 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > > System.err should be used when error occurs in multiple methods in DFSAdmin > class,as follows: > {code:java} > //DFSAdmin#refreshCallQueue > ... > try{ > proxy.getProxy().refreshCallQueue(); > System.out.println("Refresh call queue successful for " > + proxy.getAddress()); > }catch (IOException ioe){ > System.out.println("Refresh call queue failed for " > + proxy.getAddress()); > exceptions.add(ioe); > } > ... > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16525) System.err should be used when error occurs in multiple methods in DFSAdmin class
[ https://issues.apache.org/jira/browse/HDFS-16525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang updated HDFS-16525: Description: System.err should be used when error occurs in multiple methods in DFSAdmin class,as follows: {code:java} //DFSAdmin#refreshCallQueue ... try{ proxy.getProxy().refreshCallQueue(); System.out.println("Refresh call queue successful for " + proxy.getAddress()); }catch (IOException ioe){ System.out.println("Refresh call queue failed for " + proxy.getAddress()); exceptions.add(ioe); } ... {code} > System.err should be used when error occurs in multiple methods in DFSAdmin > class > - > > Key: HDFS-16525 > URL: https://issues.apache.org/jira/browse/HDFS-16525 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsadmin >Affects Versions: 3.3.2 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > > System.err should be used when error occurs in multiple methods in DFSAdmin > class,as follows: > {code:java} > //DFSAdmin#refreshCallQueue > ... > try{ > proxy.getProxy().refreshCallQueue(); > System.out.println("Refresh call queue successful for " > + proxy.getAddress()); > }catch (IOException ioe){ > System.out.println("Refresh call queue failed for " > + proxy.getAddress()); > exceptions.add(ioe); > } > ... > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16525) System.err should be used when error occurs in multiple methods in DFSAdmin class
yanbin.zhang created HDFS-16525: --- Summary: System.err should be used when error occurs in multiple methods in DFSAdmin class Key: HDFS-16525 URL: https://issues.apache.org/jira/browse/HDFS-16525 Project: Hadoop HDFS Issue Type: Bug Components: dfsadmin Affects Versions: 3.3.2 Reporter: yanbin.zhang Assignee: yanbin.zhang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16457) Make fs.getspaceused.classname reconfigurable
[ https://issues.apache.org/jira/browse/HDFS-16457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513260#comment-17513260 ] yanbin.zhang commented on HDFS-16457: - Dear God, can you help me to review my code, it took a long time to complete, I don't want to waste my time! [~weichiu] [~hexiaoqiao] [~csun] > Make fs.getspaceused.classname reconfigurable > - > > Key: HDFS-16457 > URL: https://issues.apache.org/jira/browse/HDFS-16457 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: pull-request-available > Time Spent: 2h 10m > Remaining Estimate: 0h > > Now if we want to switch fs.getspaceused.classname we need to restart the > NameNode. It would be convenient if we can switch it at runtime. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely
[ https://issues.apache.org/jira/browse/HDFS-16064 ] yanbin.zhang deleted comment on HDFS-16064: - was (Author: it_singer): I think your root cause may not be here, we never seem to have this problem during our downline process. > HDFS-721 causes DataNode decommissioning to get stuck indefinitely > -- > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Priority: Major > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > {code} > Being stuck in decommissioning state forever is not an intended behavior of > DataNode decommissioning > A few potential solutions: > * Address the root cause of the problem which is an inconsistent state > between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 > * Detect when datanode decommissioning is stuck due to lack of available > datanodes for satisfying the minimum replication factor, then recover by > re-enabling the datanodes being decommissioned > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511159#comment-17511159 ] yanbin.zhang commented on HDFS-16064: - I think your root cause may not be here, we never seem to have this problem during our downline process. > HDFS-721 causes DataNode decommissioning to get stuck indefinitely > -- > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Priority: Major > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > {code} > Being stuck in decommissioning state forever is not an intended behavior of > DataNode decommissioning > A few potential solutions: > * Address the root cause of the problem which is an inconsistent state > between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 > * Detect when datanode decommissioning is stuck due to lack of available > datanodes for satisfying the minimum replication factor, then recover by > re-enabling the datanodes being decommissioned > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511158#comment-17511158 ] yanbin.zhang commented on HDFS-16064: - I think your root cause may not be here, we never seem to have this problem during our downline process. > HDFS-721 causes DataNode decommissioning to get stuck indefinitely > -- > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Priority: Major > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > {code} > Being stuck in decommissioning state forever is not an intended behavior of > DataNode decommissioning > A few potential solutions: > * Address the root cause of the problem which is an inconsistent state > between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 > * Detect when datanode decommissioning is stuck due to lack of available > datanodes for satisfying the minimum replication factor, then recover by > re-enabling the datanodes being decommissioned > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15646) Track failing tests in HDFS
[ https://issues.apache.org/jira/browse/HDFS-15646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510995#comment-17510995 ] yanbin.zhang commented on HDFS-15646: - [~ayushtkn] Yes, I submitted a ut yesterday and it passed, but there are a lot of other errors not related to my ut. > Track failing tests in HDFS > --- > > Key: HDFS-15646 > URL: https://issues.apache.org/jira/browse/HDFS-15646 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Reporter: Ahmed Hussein >Priority: Blocker > > There are several Units that are consistently failing on Yetus for a log > period of time. > The list keeps growing and it is driving the repository into unstable > status. Qbt reports more than *40 failing unit tests* on average. > Personally, over the last week, with every submitted patch, I have to spend a > considerable time looking at the same stack trace to double check whether or > not the patch contributes to those failures. > I found out that the majority of those tests were failing for quite sometime > but +no Jiras were filed+. > The main problem of those consistent failures is that they have side effect > on the runtime of the other Junits by sucking up resources such as memory and > ports. > {{StripedFile}} and {{EC}} tests in particular are 100% show-ups in the list > of bad tests. > I looked at those tests and they certainly need some improvements (i.e., > HDFS-15459). Is any one interested in those test cases? Can we just turn them > off? > I like to give some heads-up that we need some more collaboration to enforce > the stability of the code set. > * For all developers, please, {color:#ff}file a Jira once you see a > failing test whether it is unrelated to your patch or not{color}. This gives > heads-up to other developers about the potential failures. Please do not stop > at commenting on your patch "_+this is unrelated to my work+_". > * Volunteer to dedicate more time on fixing flaky tests. > * Periodically, make sure that the list of failing tests does not exceed a > certain number of tests. We have Qbt reports to monitor that, but there is no > follow up on its status. > * We should consider aggressive strategies such as blocking any merges until > the code is brought back to stability. > * We need a clear and well-defined process to address Yetus issues: > configuration, investigating running out of memory, slowness..etc. > * Turn-off the Junits within the modules that are not being actively used in > the community (i.e., EC, stripedFiles, or..etc.). > > CC: [~aajisaka], [~elgoiri], [~kihwal], [~daryn], [~weichiu] > Do you guys have any thoughts on the current status of the HDFS ? > > +The following list is a quick list of failing Junits from Qbt reports:+ > > !https://ci-hadoop.apache.org/static/0ead8630/images/16x16/document_add.png! > [org.apache.hadoop.crypto.key.kms.server.TestKMS.testKMSProviderCaching|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/299/testReport/org.apache.hadoop.crypto.key.kms.server/TestKMS/testKMSProviderCaching/]1.5 > > sec[1|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/299/] > !https://ci-hadoop.apache.org/static/0ead8630/images/16x16/document_add.png! > [org.apache.hadoop.fs.azure.TestBlobMetadata.testFolderMetadata|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/299/testReport/org.apache.hadoop.fs.azure/TestBlobMetadata/testFolderMetadata/]42 > > ms[3|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/297/] > !https://ci-hadoop.apache.org/static/0ead8630/images/16x16/document_add.png! > [org.apache.hadoop.fs.azure.TestBlobMetadata.testFirstContainerVersionMetadata|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/299/testReport/org.apache.hadoop.fs.azure/TestBlobMetadata/testFirstContainerVersionMetadata/]46 > > ms[3|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/297/] > !https://ci-hadoop.apache.org/static/0ead8630/images/16x16/document_add.png! > [org.apache.hadoop.fs.azure.TestBlobMetadata.testPermissionMetadata|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/299/testReport/org.apache.hadoop.fs.azure/TestBlobMetadata/testPermissionMetadata/]27 > > ms[3|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/297/] > !https://ci-hadoop.apache.org/static/0ead8630/images/16x16/document_add.png! > [org.apache.hadoop.fs.azure.TestBlobMetadata.testOldPermissionMetadata|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/299/testReport/org.apache.hadoop.fs.azure/TestBlobMetadata/testOldPermissionMetadata/]19 > > ms[3|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/297/] >
[jira] [Work stopped] (HDFS-16450) Give priority to releasing DNs with less free space
[ https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16450 stopped by yanbin.zhang. --- > Give priority to releasing DNs with less free space > --- > > Key: HDFS-16450 > URL: https://issues.apache.org/jira/browse/HDFS-16450 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Attachments: HDFS-16450.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When deleting redundant replicas, the one with the least free space should be > prioritized. > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else { > return null; > } > excessTypes.remove(storage.getStorageType()); > return storage; {code} > Change the above logic to the following: > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else { > return null; > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16457) Make fs.getspaceused.classname reconfigurable
yanbin.zhang created HDFS-16457: --- Summary: Make fs.getspaceused.classname reconfigurable Key: HDFS-16457 URL: https://issues.apache.org/jira/browse/HDFS-16457 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 3.3.0 Reporter: yanbin.zhang Assignee: yanbin.zhang Now if we want to switch fs.getspaceused.classname we need to restart the NameNode. It would be convenient if we can switch it at runtime. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16450) Give priority to releasing DNs with less free space
[ https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16450 started by yanbin.zhang. --- > Give priority to releasing DNs with less free space > --- > > Key: HDFS-16450 > URL: https://issues.apache.org/jira/browse/HDFS-16450 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Attachments: HDFS-16450.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When deleting redundant replicas, the one with the least free space should be > prioritized. > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else { > return null; > } > excessTypes.remove(storage.getStorageType()); > return storage; {code} > Change the above logic to the following: > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else { > return null; > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Reopened] (HDFS-16450) Give priority to releasing DNs with less free space
[ https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang reopened HDFS-16450: - > Give priority to releasing DNs with less free space > --- > > Key: HDFS-16450 > URL: https://issues.apache.org/jira/browse/HDFS-16450 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Attachments: HDFS-16450.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When deleting redundant replicas, the one with the least free space should be > prioritized. > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else { > return null; > } > excessTypes.remove(storage.getStorageType()); > return storage; {code} > Change the above logic to the following: > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else { > return null; > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16450) Give priority to releasing DNs with less free space
[ https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang updated HDFS-16450: Labels: (was: 无) > Give priority to releasing DNs with less free space > --- > > Key: HDFS-16450 > URL: https://issues.apache.org/jira/browse/HDFS-16450 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Attachments: HDFS-16450.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When deleting redundant replicas, the one with the least free space should be > prioritized. > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else { > return null; > } > excessTypes.remove(storage.getStorageType()); > return storage; {code} > Change the above logic to the following: > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else { > return null; > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16450) Give priority to releasing DNs with less free space
[ https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang updated HDFS-16450: Fix Version/s: 3.3.2 > Give priority to releasing DNs with less free space > --- > > Key: HDFS-16450 > URL: https://issues.apache.org/jira/browse/HDFS-16450 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: 无 > Fix For: 3.3.2 > > Attachments: HDFS-16450.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When deleting redundant replicas, the one with the least free space should be > prioritized. > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else { > return null; > } > excessTypes.remove(storage.getStorageType()); > return storage; {code} > Change the above logic to the following: > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else { > return null; > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16450) Give priority to releasing DNs with less free space
[ https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang updated HDFS-16450: Fix Version/s: (was: 3.3.2) > Give priority to releasing DNs with less free space > --- > > Key: HDFS-16450 > URL: https://issues.apache.org/jira/browse/HDFS-16450 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: 无 > Attachments: HDFS-16450.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When deleting redundant replicas, the one with the least free space should be > prioritized. > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else { > return null; > } > excessTypes.remove(storage.getStorageType()); > return storage; {code} > Change the above logic to the following: > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else { > return null; > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission
[ https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang updated HDFS-14920: Labels: (was: 无) > Erasure Coding: Decommission may hang If one or more datanodes are out of > service during decommission > --- > > Key: HDFS-14920 > URL: https://issues.apache.org/jira/browse/HDFS-14920 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Hui Fei >Assignee: Hui Fei >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch, > HDFS-14920.003.patch, HDFS-14920.004.patch, HDFS-14920.005.patch > > > Decommission test hangs in our clusters. > Have seen the messages as follow > {quote} > 2019-10-22 15:58:51,514 TRACE > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block > blk_-9223372035600425840_372987973 numExpected=9, numLive=5 > 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: > blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, > maintenance replicas: 0, live entering maintenance replicas: 0, excess > replicas: 0, Is Open File: false, Datanodes having this block: > 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 > 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 > 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current > datanode decommissioning: true, Is current datanode entering maintenance: > false > 2019-10-22 15:58:51,514 DEBUG > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node > 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate > to finish Decommission In Progress > {quote} > After digging the source code and cluster log, guess it happens as follow > steps. > # Storage strategy is RS-6-3-1024k. > # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8. b0 is from > datanode dn0, b1 is from datanode dn1, ...etc > # At the beginning dn0 is in decommission progress, b0 is replicated > successfully, and dn0 is still in decommission progress. > # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of > service, so need to reconstruct, and create ErasureCodingWork to do it, in > the ErasureCodingWork, additionalReplRequired is 4 > # Because hasAllInternalBlocks is false, Will call > ErasureCodingWork#addTaskToDatanode -> > DatanodeDescriptor#addBlockToBeErasureCoded, and send > BlockECReconstructionInfo task to Datanode > # DataNode can not reconstruction the block because targets is 4, greater > than 3( parity number). > There is a problem as follow, from BlockManager.java#scheduleReconstruction > {code} > // should reconstruct all the internal blocks before scheduling > // replication task for decommissioning node(s). > if (additionalReplRequired - numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas() > 0) { > additionalReplRequired = additionalReplRequired - > numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas(); > } > {code} > Should reconstruction firstly and then replicate for decommissioning. Because > numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's > wrong, > numReplicas.decommissioning() should be 3, it should exclude live replica. > If so, additionalReplRequired will be 1, reconstruction will schedule as > expected. After that, decommission goes on. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission
[ https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang updated HDFS-14920: Labels: 无 (was: pull) > Erasure Coding: Decommission may hang If one or more datanodes are out of > service during decommission > --- > > Key: HDFS-14920 > URL: https://issues.apache.org/jira/browse/HDFS-14920 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Hui Fei >Assignee: Hui Fei >Priority: Major > Labels: 无 > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch, > HDFS-14920.003.patch, HDFS-14920.004.patch, HDFS-14920.005.patch > > > Decommission test hangs in our clusters. > Have seen the messages as follow > {quote} > 2019-10-22 15:58:51,514 TRACE > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block > blk_-9223372035600425840_372987973 numExpected=9, numLive=5 > 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: > blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, > maintenance replicas: 0, live entering maintenance replicas: 0, excess > replicas: 0, Is Open File: false, Datanodes having this block: > 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 > 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 > 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current > datanode decommissioning: true, Is current datanode entering maintenance: > false > 2019-10-22 15:58:51,514 DEBUG > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node > 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate > to finish Decommission In Progress > {quote} > After digging the source code and cluster log, guess it happens as follow > steps. > # Storage strategy is RS-6-3-1024k. > # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8. b0 is from > datanode dn0, b1 is from datanode dn1, ...etc > # At the beginning dn0 is in decommission progress, b0 is replicated > successfully, and dn0 is still in decommission progress. > # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of > service, so need to reconstruct, and create ErasureCodingWork to do it, in > the ErasureCodingWork, additionalReplRequired is 4 > # Because hasAllInternalBlocks is false, Will call > ErasureCodingWork#addTaskToDatanode -> > DatanodeDescriptor#addBlockToBeErasureCoded, and send > BlockECReconstructionInfo task to Datanode > # DataNode can not reconstruction the block because targets is 4, greater > than 3( parity number). > There is a problem as follow, from BlockManager.java#scheduleReconstruction > {code} > // should reconstruct all the internal blocks before scheduling > // replication task for decommissioning node(s). > if (additionalReplRequired - numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas() > 0) { > additionalReplRequired = additionalReplRequired - > numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas(); > } > {code} > Should reconstruction firstly and then replicate for decommissioning. Because > numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's > wrong, > numReplicas.decommissioning() should be 3, it should exclude live replica. > If so, additionalReplRequired will be 1, reconstruction will schedule as > expected. After that, decommission goes on. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission
[ https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang updated HDFS-14920: Labels: pull (was: ) > Erasure Coding: Decommission may hang If one or more datanodes are out of > service during decommission > --- > > Key: HDFS-14920 > URL: https://issues.apache.org/jira/browse/HDFS-14920 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Hui Fei >Assignee: Hui Fei >Priority: Major > Labels: pull > Fix For: 3.3.0, 3.1.4, 3.2.2 > > Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch, > HDFS-14920.003.patch, HDFS-14920.004.patch, HDFS-14920.005.patch > > > Decommission test hangs in our clusters. > Have seen the messages as follow > {quote} > 2019-10-22 15:58:51,514 TRACE > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block > blk_-9223372035600425840_372987973 numExpected=9, numLive=5 > 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: > blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, > maintenance replicas: 0, live entering maintenance replicas: 0, excess > replicas: 0, Is Open File: false, Datanodes having this block: > 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 > 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 > 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current > datanode decommissioning: true, Is current datanode entering maintenance: > false > 2019-10-22 15:58:51,514 DEBUG > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node > 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate > to finish Decommission In Progress > {quote} > After digging the source code and cluster log, guess it happens as follow > steps. > # Storage strategy is RS-6-3-1024k. > # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8. b0 is from > datanode dn0, b1 is from datanode dn1, ...etc > # At the beginning dn0 is in decommission progress, b0 is replicated > successfully, and dn0 is still in decommission progress. > # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of > service, so need to reconstruct, and create ErasureCodingWork to do it, in > the ErasureCodingWork, additionalReplRequired is 4 > # Because hasAllInternalBlocks is false, Will call > ErasureCodingWork#addTaskToDatanode -> > DatanodeDescriptor#addBlockToBeErasureCoded, and send > BlockECReconstructionInfo task to Datanode > # DataNode can not reconstruction the block because targets is 4, greater > than 3( parity number). > There is a problem as follow, from BlockManager.java#scheduleReconstruction > {code} > // should reconstruct all the internal blocks before scheduling > // replication task for decommissioning node(s). > if (additionalReplRequired - numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas() > 0) { > additionalReplRequired = additionalReplRequired - > numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas(); > } > {code} > Should reconstruction firstly and then replicate for decommissioning. Because > numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's > wrong, > numReplicas.decommissioning() should be 3, it should exclude live replica. > If so, additionalReplRequired will be 1, reconstruction will schedule as > expected. After that, decommission goes on. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16450) Give priority to releasing DNs with less free space
[ https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang updated HDFS-16450: Labels: 无 (was: patch) > Give priority to releasing DNs with less free space > --- > > Key: HDFS-16450 > URL: https://issues.apache.org/jira/browse/HDFS-16450 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: 无 > Attachments: HDFS-16450.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When deleting redundant replicas, the one with the least free space should be > prioritized. > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else { > return null; > } > excessTypes.remove(storage.getStorageType()); > return storage; {code} > Change the above logic to the following: > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else { > return null; > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16450) Give priority to releasing DNs with less free space
[ https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang updated HDFS-16450: Labels: patch (was: pull-request-available) > Give priority to releasing DNs with less free space > --- > > Key: HDFS-16450 > URL: https://issues.apache.org/jira/browse/HDFS-16450 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: patch > Attachments: HDFS-16450.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When deleting redundant replicas, the one with the least free space should be > prioritized. > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else { > return null; > } > excessTypes.remove(storage.getStorageType()); > return storage; {code} > Change the above logic to the following: > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else { > return null; > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16450) Give priority to releasing DNs with less free space
[ https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang resolved HDFS-16450. - Resolution: Done > Give priority to releasing DNs with less free space > --- > > Key: HDFS-16450 > URL: https://issues.apache.org/jira/browse/HDFS-16450 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16450.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When deleting redundant replicas, the one with the least free space should be > prioritized. > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else { > return null; > } > excessTypes.remove(storage.getStorageType()); > return storage; {code} > Change the above logic to the following: > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else { > return null; > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16450) Give priority to releasing DNs with less free space
[ https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489575#comment-17489575 ] yanbin.zhang edited comment on HDFS-16450 at 2/14/22, 1:52 AM: --- Could you please give some comments or suggestions. [~weichiu] [~hexiaoqiao] [~ferhui] was (Author: it_singer): Could you please give some comments or suggestions?[~weichiu] > Give priority to releasing DNs with less free space > --- > > Key: HDFS-16450 > URL: https://issues.apache.org/jira/browse/HDFS-16450 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16450.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When deleting redundant replicas, the one with the least free space should be > prioritized. > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else { > return null; > } > excessTypes.remove(storage.getStorageType()); > return storage; {code} > Change the above logic to the following: > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else { > return null; > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16450) Give priority to releasing DNs with less free space
[ https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang updated HDFS-16450: Description: When deleting redundant replicas, the one with the least free space should be prioritized. {code:java} //BlockPlacementPolicyDefault#chooseReplicaToDelete final DatanodeStorageInfo storage; if (oldestHeartbeatStorage != null) { storage = oldestHeartbeatStorage; } else if (minSpaceStorage != null) { storage = minSpaceStorage; } else { return null; } excessTypes.remove(storage.getStorageType()); return storage; {code} Change the above logic to the following: {code:java} //BlockPlacementPolicyDefault#chooseReplicaToDelete final DatanodeStorageInfo storage; if (minSpaceStorage != null) { storage = minSpaceStorage; } else if (oldestHeartbeatStorage != null) { storage = oldestHeartbeatStorage; } else { return null; } {code} was: When deleting redundant replicas, the one with the least free space should be prioritized. {code:java} //BlockPlacementPolicyDefault#chooseReplicaToDelete final DatanodeStorageInfo storage; if (oldestHeartbeatStorage != null) { storage = oldestHeartbeatStorage; } else if (minSpaceStorage != null) { storage = minSpaceStorage; } else { return null; } excessTypes.remove(storage.getStorageType()); return storage; {code} Change the above logic to the following: {code:java} //BlockPlacementPolicyDefault#chooseReplicaToDelete final DatanodeStorageInfo storage; if (minSpaceStorage != null) { storage = minSpaceStorage; } else if (oldestHeartbeatStorage != null) { storage = oldestHeartbeatStorage; } else { return null; } {code} > Give priority to releasing DNs with less free space > --- > > Key: HDFS-16450 > URL: https://issues.apache.org/jira/browse/HDFS-16450 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16450.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When deleting redundant replicas, the one with the least free space should be > prioritized. > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else { > return null; > } > excessTypes.remove(storage.getStorageType()); > return storage; {code} > Change the above logic to the following: > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else { > return null; > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16450) Give priority to releasing DNs with less free space
[ https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16450 started by yanbin.zhang. --- > Give priority to releasing DNs with less free space > --- > > Key: HDFS-16450 > URL: https://issues.apache.org/jira/browse/HDFS-16450 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16450.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When deleting redundant replicas, the one with the least free space should be > prioritized. > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else { > return null; > } > excessTypes.remove(storage.getStorageType()); > return storage; {code} > Change the above logic to the following: > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else { > return null; > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16450) Give priority to releasing DNs with less free space
[ https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489575#comment-17489575 ] yanbin.zhang commented on HDFS-16450: - Could you please give some comments or suggestions?[~weichiu] > Give priority to releasing DNs with less free space > --- > > Key: HDFS-16450 > URL: https://issues.apache.org/jira/browse/HDFS-16450 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16450.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When deleting redundant replicas, the one with the least free space should be > prioritized. > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else { > return null; > } > excessTypes.remove(storage.getStorageType()); > return storage; {code} > Change the above logic to the following: > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else { > return null; > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16450) Give priority to releasing DNs with less free space
[ https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang updated HDFS-16450: Attachment: HDFS-16450.001.patch > Give priority to releasing DNs with less free space > --- > > Key: HDFS-16450 > URL: https://issues.apache.org/jira/browse/HDFS-16450 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: pull-request-available > Attachments: HDFS-16450.001.patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > When deleting redundant replicas, the one with the least free space should be > prioritized. > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else { > return null; > } > excessTypes.remove(storage.getStorageType()); > return storage; {code} > Change the above logic to the following: > {code:java} > //BlockPlacementPolicyDefault#chooseReplicaToDelete > final DatanodeStorageInfo storage; > if (minSpaceStorage != null) { > storage = minSpaceStorage; > } else if (oldestHeartbeatStorage != null) { > storage = oldestHeartbeatStorage; > } else { > return null; > } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16450) Give priority to releasing DNs with less free space
yanbin.zhang created HDFS-16450: --- Summary: Give priority to releasing DNs with less free space Key: HDFS-16450 URL: https://issues.apache.org/jira/browse/HDFS-16450 Project: Hadoop HDFS Issue Type: Improvement Components: hdfs Affects Versions: 3.3.0 Reporter: yanbin.zhang Assignee: yanbin.zhang When deleting redundant replicas, the one with the least free space should be prioritized. {code:java} //BlockPlacementPolicyDefault#chooseReplicaToDelete final DatanodeStorageInfo storage; if (oldestHeartbeatStorage != null) { storage = oldestHeartbeatStorage; } else if (minSpaceStorage != null) { storage = minSpaceStorage; } else { return null; } excessTypes.remove(storage.getStorageType()); return storage; {code} Change the above logic to the following: {code:java} //BlockPlacementPolicyDefault#chooseReplicaToDelete final DatanodeStorageInfo storage; if (minSpaceStorage != null) { storage = minSpaceStorage; } else if (oldestHeartbeatStorage != null) { storage = oldestHeartbeatStorage; } else { return null; } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work stopped] (HDFS-16186) Datanode kicks out hard disk logic optimization
[ https://issues.apache.org/jira/browse/HDFS-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16186 stopped by yanbin.zhang. --- > Datanode kicks out hard disk logic optimization > --- > > Key: HDFS-16186 > URL: https://issues.apache.org/jira/browse/HDFS-16186 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.1.2 > Environment: In the hadoop cluster, a certain hard disk in a certain > Datanode has a problem, but the datanode of hdfs did not kick out the hard > disk in time, causing the datanode to become a slow node >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: pull-request-available > Time Spent: 3h > Remaining Estimate: 0h > > 2021-08-24 08:56:10,456 WARN datanode.DataNode > (BlockSender.java:readChecksum(681)) - Could not read or failed to verify > checksum for data at offset 113115136 for block > BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709 > java.io.IOException: Input/output error > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:876) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at java.io.DataInputStream.read(DataInputStream.java:149) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:210) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:679) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:588) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:803) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:750) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.scanBlock(VolumeScanner.java:448) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:558) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633) > 2021-08-24 08:56:11,121 WARN datanode.VolumeScanner > (VolumeScanner.java:handle(292)) - Reporting bad > BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709 on > /data11/hdfs/data -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-16186) Datanode kicks out hard disk logic optimization
[ https://issues.apache.org/jira/browse/HDFS-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-16186 started by yanbin.zhang. --- > Datanode kicks out hard disk logic optimization > --- > > Key: HDFS-16186 > URL: https://issues.apache.org/jira/browse/HDFS-16186 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.1.2 > Environment: In the hadoop cluster, a certain hard disk in a certain > Datanode has a problem, but the datanode of hdfs did not kick out the hard > disk in time, causing the datanode to become a slow node >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: pull-request-available > Time Spent: 3h > Remaining Estimate: 0h > > 2021-08-24 08:56:10,456 WARN datanode.DataNode > (BlockSender.java:readChecksum(681)) - Could not read or failed to verify > checksum for data at offset 113115136 for block > BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709 > java.io.IOException: Input/output error > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:876) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at java.io.DataInputStream.read(DataInputStream.java:149) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:210) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:679) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:588) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:803) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:750) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.scanBlock(VolumeScanner.java:448) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:558) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633) > 2021-08-24 08:56:11,121 WARN datanode.VolumeScanner > (VolumeScanner.java:handle(292)) - Reporting bad > BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709 on > /data11/hdfs/data -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16186) Datanode kicks out hard disk logic optimization
[ https://issues.apache.org/jira/browse/HDFS-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang reassigned HDFS-16186: --- Assignee: yanbin.zhang > Datanode kicks out hard disk logic optimization > --- > > Key: HDFS-16186 > URL: https://issues.apache.org/jira/browse/HDFS-16186 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.1.2 > Environment: In the hadoop cluster, a certain hard disk in a certain > Datanode has a problem, but the datanode of hdfs did not kick out the hard > disk in time, causing the datanode to become a slow node >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Major > Labels: pull-request-available > Time Spent: 3h > Remaining Estimate: 0h > > 2021-08-24 08:56:10,456 WARN datanode.DataNode > (BlockSender.java:readChecksum(681)) - Could not read or failed to verify > checksum for data at offset 113115136 for block > BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709 > java.io.IOException: Input/output error > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:876) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at java.io.DataInputStream.read(DataInputStream.java:149) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:210) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:679) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:588) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:803) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:750) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.scanBlock(VolumeScanner.java:448) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:558) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633) > 2021-08-24 08:56:11,121 WARN datanode.VolumeScanner > (VolumeScanner.java:handle(292)) - Reporting bad > BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709 on > /data11/hdfs/data -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-5920) Support rollback of rolling upgrade in NameNode and JournalNodes
[ https://issues.apache.org/jira/browse/HDFS-5920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang reassigned HDFS-5920: -- Assignee: (was: yanbin.zhang) > Support rollback of rolling upgrade in NameNode and JournalNodes > > > Key: HDFS-5920 > URL: https://issues.apache.org/jira/browse/HDFS-5920 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: journal-node, namenode >Reporter: Jing Zhao >Priority: Major > Attachments: HDFS-5920.000.patch, HDFS-5920.000.patch, > HDFS-5920.001.patch, HDFS-5920.002.patch, HDFS-5920.003.patch > > > This jira provides rollback functionality for NameNode and JournalNode in > rolling upgrade. > Currently the proposed rollback for rolling upgrade is: > 1. Shutdown both NN > 2. Start one of the NN using "-rollingUpgrade rollback" option > 3. This NN will load the special fsimage right before the upgrade marker, > then discard all the editlog segments after the txid of the fsimage > 4. The NN will also send RPC requests to all the JNs to discard editlog > segments. This call expects response from all the JNs. The NN will keep > running if the call succeeds. > 5. We start the other NN using bootstrapstandby rather than "-rollingUpgrade > rollback" -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-5920) Support rollback of rolling upgrade in NameNode and JournalNodes
[ https://issues.apache.org/jira/browse/HDFS-5920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang reassigned HDFS-5920: -- Assignee: yanbin.zhang (was: Jing Zhao) > Support rollback of rolling upgrade in NameNode and JournalNodes > > > Key: HDFS-5920 > URL: https://issues.apache.org/jira/browse/HDFS-5920 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: journal-node, namenode >Reporter: Jing Zhao >Assignee: yanbin.zhang >Priority: Major > Attachments: HDFS-5920.000.patch, HDFS-5920.000.patch, > HDFS-5920.001.patch, HDFS-5920.002.patch, HDFS-5920.003.patch > > > This jira provides rollback functionality for NameNode and JournalNode in > rolling upgrade. > Currently the proposed rollback for rolling upgrade is: > 1. Shutdown both NN > 2. Start one of the NN using "-rollingUpgrade rollback" option > 3. This NN will load the special fsimage right before the upgrade marker, > then discard all the editlog segments after the txid of the fsimage > 4. The NN will also send RPC requests to all the JNs to discard editlog > segments. This call expects response from all the JNs. The NN will keep > running if the call succeeds. > 5. We start the other NN using bootstrapstandby rather than "-rollingUpgrade > rollback" -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16437) ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.
[ https://issues.apache.org/jira/browse/HDFS-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487639#comment-17487639 ] yanbin.zhang commented on HDFS-16437: - Thank you [~weichiu] for the merge and suggestion, thanks. > ReverseXML processor doesn't accept XML files without the SnapshotDiffSection. > -- > > Key: HDFS-16437 > URL: https://issues.apache.org/jira/browse/HDFS-16437 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1, 3.3.0 >Reporter: yanbin.zhang >Assignee: yanbin.zhang >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > In a cluster environment without snapshot, if you want to convert back to > fsimage through the generated xml, an error will be reported. > {code:java} > //代码占位符 > [test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml > -o fsimage_0257220 > OfflineImageReconstructor failed: FSImage XML ended prematurely, without > including section(s) SnapshotDiffSection > java.io.IOException: FSImage XML ended prematurely, without including > section(s) SnapshotDiffSection > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149) > 22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16437) ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.
[ https://issues.apache.org/jira/browse/HDFS-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486782#comment-17486782 ] yanbin.zhang edited comment on HDFS-16437 at 2/4/22, 1:58 AM: -- Yes, I'm singer-bin, thank you [~weichiu] ! was (Author: it_singer): Yes, I'm singer-bin, thank you > ReverseXML processor doesn't accept XML files without the SnapshotDiffSection. > -- > > Key: HDFS-16437 > URL: https://issues.apache.org/jira/browse/HDFS-16437 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1, 3.3.0 >Reporter: yanbin.zhang >Priority: Critical > Labels: pull-request-available > Time Spent: 4h 20m > Remaining Estimate: 0h > > In a cluster environment without snapshot, if you want to convert back to > fsimage through the generated xml, an error will be reported. > {code:java} > //代码占位符 > [test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml > -o fsimage_0257220 > OfflineImageReconstructor failed: FSImage XML ended prematurely, without > including section(s) SnapshotDiffSection > java.io.IOException: FSImage XML ended prematurely, without including > section(s) SnapshotDiffSection > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149) > 22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16437) ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.
[ https://issues.apache.org/jira/browse/HDFS-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486782#comment-17486782 ] yanbin.zhang commented on HDFS-16437: - Yes, I'm singer-bin, thank you > ReverseXML processor doesn't accept XML files without the SnapshotDiffSection. > -- > > Key: HDFS-16437 > URL: https://issues.apache.org/jira/browse/HDFS-16437 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1, 3.3.0 >Reporter: yanbin.zhang >Priority: Critical > Labels: pull-request-available > Time Spent: 4h 20m > Remaining Estimate: 0h > > In a cluster environment without snapshot, if you want to convert back to > fsimage through the generated xml, an error will be reported. > {code:java} > //代码占位符 > [test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml > -o fsimage_0257220 > OfflineImageReconstructor failed: FSImage XML ended prematurely, without > including section(s) SnapshotDiffSection > java.io.IOException: FSImage XML ended prematurely, without including > section(s) SnapshotDiffSection > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149) > 22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16437) ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.
[ https://issues.apache.org/jira/browse/HDFS-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481622#comment-17481622 ] yanbin.zhang commented on HDFS-16437: - There is already a solution for this problem, please wait. > ReverseXML processor doesn't accept XML files without the SnapshotDiffSection. > -- > > Key: HDFS-16437 > URL: https://issues.apache.org/jira/browse/HDFS-16437 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1, 3.3.0 >Reporter: yanbin.zhang >Priority: Critical > > In a cluster environment without snapshot, if you want to convert back to > fsimage through the generated xml, an error will be reported. > {code:java} > //代码占位符 > [test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml > -o fsimage_0257220 > OfflineImageReconstructor failed: FSImage XML ended prematurely, without > including section(s) SnapshotDiffSection > java.io.IOException: FSImage XML ended prematurely, without including > section(s) SnapshotDiffSection > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149) > 22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16437) ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.
[ https://issues.apache.org/jira/browse/HDFS-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yanbin.zhang updated HDFS-16437: Description: In a cluster environment without snapshot, if you want to convert back to fsimage through the generated xml, an error will be reported. {code:java} //代码占位符 [test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml -o fsimage_0257220 OfflineImageReconstructor failed: FSImage XML ended prematurely, without including section(s) SnapshotDiffSection java.io.IOException: FSImage XML ended prematurely, without including section(s) SnapshotDiffSection at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149) 22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException {code} was:In a cluster environment without snapshot, if you want to convert back to fsimage through the generated xml, an error will be reported Environment: (was: {code:java} //代码占位符 [test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml -o fsimage_0257220 OfflineImageReconstructor failed: FSImage XML ended prematurely, without including section(s) SnapshotDiffSection java.io.IOException: FSImage XML ended prematurely, without including section(s) SnapshotDiffSection at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149) 22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException {code}) > ReverseXML processor doesn't accept XML files without the SnapshotDiffSection. > -- > > Key: HDFS-16437 > URL: https://issues.apache.org/jira/browse/HDFS-16437 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1, 3.3.0 >Reporter: yanbin.zhang >Priority: Critical > > In a cluster environment without snapshot, if you want to convert back to > fsimage through the generated xml, an error will be reported. > {code:java} > //代码占位符 > [test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml > -o fsimage_0257220 > OfflineImageReconstructor failed: FSImage XML ended prematurely, without > including section(s) SnapshotDiffSection > java.io.IOException: FSImage XML ended prematurely, without including > section(s) SnapshotDiffSection > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211) > at > org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149) > 22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16437) ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.
yanbin.zhang created HDFS-16437: --- Summary: ReverseXML processor doesn't accept XML files without the SnapshotDiffSection. Key: HDFS-16437 URL: https://issues.apache.org/jira/browse/HDFS-16437 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 3.3.0, 3.1.1 Environment: {code:java} //代码占位符 [test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml -o fsimage_0257220 OfflineImageReconstructor failed: FSImage XML ended prematurely, without including section(s) SnapshotDiffSection java.io.IOException: FSImage XML ended prematurely, without including section(s) SnapshotDiffSection at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211) at org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149) 22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException {code} Reporter: yanbin.zhang In a cluster environment without snapshot, if you want to convert back to fsimage through the generated xml, an error will be reported -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16186) Datanode kicks out hard disk logic optimization
[ https://issues.apache.org/jira/browse/HDFS-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405111#comment-17405111 ] yanbin.zhang commented on HDFS-16186: - It took two days from the disk error to the datanode kicking out the hard disk, because org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker.ResultHandler#markHealthy was successfully bypassed during the check, and the datanode became a slow node > Datanode kicks out hard disk logic optimization > --- > > Key: HDFS-16186 > URL: https://issues.apache.org/jira/browse/HDFS-16186 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.1.2 > Environment: In the hadoop cluster, a certain hard disk in a certain > Datanode has a problem, but the datanode of hdfs did not kick out the hard > disk in time, causing the datanode to become a slow node >Reporter: yanbin.zhang >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > 2021-08-24 08:56:10,456 WARN datanode.DataNode > (BlockSender.java:readChecksum(681)) - Could not read or failed to verify > checksum for data at offset 113115136 for block > BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709 > java.io.IOException: Input/output error > at java.io.FileInputStream.readBytes(Native Method) > at java.io.FileInputStream.read(FileInputStream.java:255) > at > org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:876) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > at java.io.DataInputStream.read(DataInputStream.java:149) > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:210) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:679) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:588) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:803) > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:750) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.scanBlock(VolumeScanner.java:448) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:558) > at > org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633) > 2021-08-24 08:56:11,121 WARN datanode.VolumeScanner > (VolumeScanner.java:handle(292)) - Reporting bad > BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709 on > /data11/hdfs/data -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16186) Datanode kicks out hard disk logic optimization
yanbin.zhang created HDFS-16186: --- Summary: Datanode kicks out hard disk logic optimization Key: HDFS-16186 URL: https://issues.apache.org/jira/browse/HDFS-16186 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 3.1.2 Environment: In the hadoop cluster, a certain hard disk in a certain Datanode has a problem, but the datanode of hdfs did not kick out the hard disk in time, causing the datanode to become a slow node Reporter: yanbin.zhang 2021-08-24 08:56:10,456 WARN datanode.DataNode (BlockSender.java:readChecksum(681)) - Could not read or failed to verify checksum for data at offset 113115136 for block BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709 java.io.IOException: Input/output error at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:255) at org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:876) at java.io.FilterInputStream.read(FilterInputStream.java:133) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:210) at org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90) at org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:679) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:588) at org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:803) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:750) at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.scanBlock(VolumeScanner.java:448) at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:558) at org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633) 2021-08-24 08:56:11,121 WARN datanode.VolumeScanner (VolumeScanner.java:handle(292)) - Reporting bad BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709 on /data11/hdfs/data -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org