from:"yanbin.zhang \(Jira\)"

[jira] [Assigned] (HDFS-17222) fix Federation doc

2023-10-12 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang reassigned HDFS-17222:
---

Assignee: (was: yanbin.zhang)

> fix Federation doc
> --
>
> Key: HDFS-17222
> URL: https://issues.apache.org/jira/browse/HDFS-17222
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: yanbin.zhang
>Priority: Trivial
>
> Fix some wrong symbols in Federation.md.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-17222) fix Federation doc

2023-10-12 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang resolved HDFS-17222.
-
Resolution: Not A Problem

> fix Federation doc
> --
>
> Key: HDFS-17222
> URL: https://issues.apache.org/jira/browse/HDFS-17222
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: yanbin.zhang
>Priority: Trivial
>
> Fix some wrong symbols in Federation.md.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Assigned] (HDFS-17222) fix Federation doc

2023-10-12 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-17222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang reassigned HDFS-17222:
---

Assignee: yanbin.zhang

> fix Federation doc
> --
>
> Key: HDFS-17222
> URL: https://issues.apache.org/jira/browse/HDFS-17222
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Trivial
>
> Fix some wrong symbols in Federation.md.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-17222) fix Federation doc

2023-10-12 Thread yanbin.zhang (Jira)

yanbin.zhang created HDFS-17222:
---

 Summary: fix Federation doc
 Key: HDFS-17222
 URL: https://issues.apache.org/jira/browse/HDFS-17222
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: yanbin.zhang


Fix some wrong symbols in Federation.md.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16525) System.err should be used when error occurs in multiple methods in DFSAdmin class

2022-04-13 Thread yanbin.zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522032#comment-17522032
 ] 

yanbin.zhang commented on HDFS-16525:
-

[~weichiu] [~ferhui] Could you please give some comments or suggestions.

> System.err should be used when error occurs in multiple methods in DFSAdmin 
> class
> -
>
> Key: HDFS-16525
> URL: https://issues.apache.org/jira/browse/HDFS-16525
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsadmin
>Affects Versions: 3.3.2
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> System.err should be used when error occurs in multiple methods in DFSAdmin 
> class,as follows:
> {code:java}
> //DFSAdmin#refreshCallQueue
> ...
> try{
>   proxy.getProxy().refreshCallQueue();
>   System.out.println("Refresh call queue successful for "
>   + proxy.getAddress());
> }catch (IOException ioe){
>   System.out.println("Refresh call queue failed for "
>   + proxy.getAddress());
>   exceptions.add(ioe);
> }
> ...
> {code}
> The test method closed first in TestDFSAdminWithHA also needs to be 
> modified,otherwise an error will be reported,similar to the following:
> {code:java}
> [ERROR] Failures:
> [ERROR]   
> TestDFSAdminWithHA.testRefreshCallQueueNN1DownNN2Up:726->assertOutputMatches:77
>  Expected output to match 'Refresh call queue failed for.*
> Refresh call queue successful for.*
> ' but err_output was:
> Refresh call queue failed for localhost/127.0.0.1:12876
> refreshCallQueue: Call From h110/10.1.234.110 to localhost:12876 failed on 
> connection exception: java.net.ConnectException: Connection refused; For more 
> details see:  http://wiki.apache.org/hadoop/ConnectionRefused and output was:
> Refresh call queue successful for localhost/127.0.0.1:12878{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16450) Give priority to releasing DNs with less free space

2022-04-13 Thread yanbin.zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17522029#comment-17522029
 ] 

yanbin.zhang commented on HDFS-16450:
-

OK, I'll try to do it.

> Give priority to releasing DNs with less free space
> ---
>
> Key: HDFS-16450
> URL: https://issues.apache.org/jira/browse/HDFS-16450
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
> Attachments: HDFS-16450.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When deleting redundant replicas, the one with the least free space should be 
> prioritized.
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else {
>   return null;
> }
> excessTypes.remove(storage.getStorageType());
> return storage; {code}
> Change the above logic to the following:
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else {
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-16457) Make fs.getspaceused.classname reconfigurable

2022-04-07 Thread yanbin.zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513260#comment-17513260
 ] 

yanbin.zhang edited comment on HDFS-16457 at 4/8/22 1:50 AM:
-

[~tasanuma] thanks for the review and merge.


was (Author: it_singer):
Dear God, can you help me to review my code, it took a long time to complete, I 
don't want to waste my time! [~weichiu] [~hexiaoqiao] [~csun] 

> Make fs.getspaceused.classname reconfigurable
> -
>
> Key: HDFS-16457
> URL: https://issues.apache.org/jira/browse/HDFS-16457
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 6h 40m
>  Remaining Estimate: 0h
>
> Now if we want to switch fs.getspaceused.classname we need to restart the 
> NameNode. It would be convenient if we can switch it at runtime.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16525) System.err should be used when error occurs in multiple methods in DFSAdmin class

2022-03-30 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang updated HDFS-16525:

Description: 
System.err should be used when error occurs in multiple methods in DFSAdmin 
class,as follows:
{code:java}
//DFSAdmin#refreshCallQueue
...
try{
  proxy.getProxy().refreshCallQueue();
  System.out.println("Refresh call queue successful for "
  + proxy.getAddress());
}catch (IOException ioe){
  System.out.println("Refresh call queue failed for "
  + proxy.getAddress());
  exceptions.add(ioe);
}
...
{code}
The test method closed first in TestDFSAdminWithHA also needs to be 
modified,otherwise an error will be reported,similar to the following:
{code:java}
[ERROR] Failures:
[ERROR]   
TestDFSAdminWithHA.testRefreshCallQueueNN1DownNN2Up:726->assertOutputMatches:77 
Expected output to match 'Refresh call queue failed for.*
Refresh call queue successful for.*
' but err_output was:
Refresh call queue failed for localhost/127.0.0.1:12876
refreshCallQueue: Call From h110/10.1.234.110 to localhost:12876 failed on 
connection exception: java.net.ConnectException: Connection refused; For more 
details see:  http://wiki.apache.org/hadoop/ConnectionRefused and output was:
Refresh call queue successful for localhost/127.0.0.1:12878{code}

  was:
System.err should be used when error occurs in multiple methods in DFSAdmin 
class，as follows:
{code:java}
//DFSAdmin#refreshCallQueue
...
try{
  proxy.getProxy().refreshCallQueue();
  System.out.println("Refresh call queue successful for "
  + proxy.getAddress());
}catch (IOException ioe){
  System.out.println("Refresh call queue failed for "
  + proxy.getAddress());
  exceptions.add(ioe);
}
...
{code}

The test method closed first in TestDFSAdminWithHA also needs to be modified, 
otherwise an error will be reported:


> System.err should be used when error occurs in multiple methods in DFSAdmin 
> class
> -
>
> Key: HDFS-16525
> URL: https://issues.apache.org/jira/browse/HDFS-16525
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsadmin
>Affects Versions: 3.3.2
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>
> System.err should be used when error occurs in multiple methods in DFSAdmin 
> class,as follows:
> {code:java}
> //DFSAdmin#refreshCallQueue
> ...
> try{
>   proxy.getProxy().refreshCallQueue();
>   System.out.println("Refresh call queue successful for "
>   + proxy.getAddress());
> }catch (IOException ioe){
>   System.out.println("Refresh call queue failed for "
>   + proxy.getAddress());
>   exceptions.add(ioe);
> }
> ...
> {code}
> The test method closed first in TestDFSAdminWithHA also needs to be 
> modified,otherwise an error will be reported,similar to the following:
> {code:java}
> [ERROR] Failures:
> [ERROR]   
> TestDFSAdminWithHA.testRefreshCallQueueNN1DownNN2Up:726->assertOutputMatches:77
>  Expected output to match 'Refresh call queue failed for.*
> Refresh call queue successful for.*
> ' but err_output was:
> Refresh call queue failed for localhost/127.0.0.1:12876
> refreshCallQueue: Call From h110/10.1.234.110 to localhost:12876 failed on 
> connection exception: java.net.ConnectException: Connection refused; For more 
> details see:  http://wiki.apache.org/hadoop/ConnectionRefused and output was:
> Refresh call queue successful for localhost/127.0.0.1:12878{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16525) System.err should be used when error occurs in multiple methods in DFSAdmin class

2022-03-30 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang updated HDFS-16525:

Description: 
System.err should be used when error occurs in multiple methods in DFSAdmin 
class，as follows:
{code:java}
//DFSAdmin#refreshCallQueue
...
try{
  proxy.getProxy().refreshCallQueue();
  System.out.println("Refresh call queue successful for "
  + proxy.getAddress());
}catch (IOException ioe){
  System.out.println("Refresh call queue failed for "
  + proxy.getAddress());
  exceptions.add(ioe);
}
...
{code}

The test method closed first in TestDFSAdminWithHA also needs to be modified, 
otherwise an error will be reported:

  was:
System.err should be used when error occurs in multiple methods in DFSAdmin 
class，as follows:
{code:java}
//DFSAdmin#refreshCallQueue
...
try{
  proxy.getProxy().refreshCallQueue();
  System.out.println("Refresh call queue successful for "
  + proxy.getAddress());
}catch (IOException ioe){
  System.out.println("Refresh call queue failed for "
  + proxy.getAddress());
  exceptions.add(ioe);
}
...
{code}


> System.err should be used when error occurs in multiple methods in DFSAdmin 
> class
> -
>
> Key: HDFS-16525
> URL: https://issues.apache.org/jira/browse/HDFS-16525
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsadmin
>Affects Versions: 3.3.2
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>
> System.err should be used when error occurs in multiple methods in DFSAdmin 
> class，as follows:
> {code:java}
> //DFSAdmin#refreshCallQueue
> ...
> try{
>   proxy.getProxy().refreshCallQueue();
>   System.out.println("Refresh call queue successful for "
>   + proxy.getAddress());
> }catch (IOException ioe){
>   System.out.println("Refresh call queue failed for "
>   + proxy.getAddress());
>   exceptions.add(ioe);
> }
> ...
> {code}
> The test method closed first in TestDFSAdminWithHA also needs to be modified, 
> otherwise an error will be reported:



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16525) System.err should be used when error occurs in multiple methods in DFSAdmin class

2022-03-30 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang updated HDFS-16525:

Description: 
System.err should be used when error occurs in multiple methods in DFSAdmin 
class，as follows:
{code:java}
//DFSAdmin#refreshCallQueue
...
try{
  proxy.getProxy().refreshCallQueue();
  System.out.println("Refresh call queue successful for "
  + proxy.getAddress());
}catch (IOException ioe){
  System.out.println("Refresh call queue failed for "
  + proxy.getAddress());
  exceptions.add(ioe);
}
...
{code}

  was:
System.err should be used when error occurs in multiple methods in DFSAdmin 
class，as follows:
{code:java}
//DFSAdmin#refreshCallQueue
...
try{
  proxy.getProxy().refreshCallQueue();
  System.out.println("Refresh call queue successful for "
  + proxy.getAddress());
}catch (IOException ioe){
  System.out.println("Refresh call queue failed for "
  + proxy.getAddress());
  exceptions.add(ioe);
}
...

{code}


> System.err should be used when error occurs in multiple methods in DFSAdmin 
> class
> -
>
> Key: HDFS-16525
> URL: https://issues.apache.org/jira/browse/HDFS-16525
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsadmin
>Affects Versions: 3.3.2
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>
> System.err should be used when error occurs in multiple methods in DFSAdmin 
> class，as follows:
> {code:java}
> //DFSAdmin#refreshCallQueue
> ...
> try{
>   proxy.getProxy().refreshCallQueue();
>   System.out.println("Refresh call queue successful for "
>   + proxy.getAddress());
> }catch (IOException ioe){
>   System.out.println("Refresh call queue failed for "
>   + proxy.getAddress());
>   exceptions.add(ioe);
> }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16525) System.err should be used when error occurs in multiple methods in DFSAdmin class

2022-03-30 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang updated HDFS-16525:

Description: 
System.err should be used when error occurs in multiple methods in DFSAdmin 
class，as follows:
{code:java}
//DFSAdmin#refreshCallQueue
...
try{
  proxy.getProxy().refreshCallQueue();
  System.out.println("Refresh call queue successful for "
  + proxy.getAddress());
}catch (IOException ioe){
  System.out.println("Refresh call queue failed for "
  + proxy.getAddress());
  exceptions.add(ioe);
}
...

{code}

> System.err should be used when error occurs in multiple methods in DFSAdmin 
> class
> -
>
> Key: HDFS-16525
> URL: https://issues.apache.org/jira/browse/HDFS-16525
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsadmin
>Affects Versions: 3.3.2
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>
> System.err should be used when error occurs in multiple methods in DFSAdmin 
> class，as follows:
> {code:java}
> //DFSAdmin#refreshCallQueue
> ...
> try{
>   proxy.getProxy().refreshCallQueue();
>   System.out.println("Refresh call queue successful for "
>   + proxy.getAddress());
> }catch (IOException ioe){
>   System.out.println("Refresh call queue failed for "
>   + proxy.getAddress());
>   exceptions.add(ioe);
> }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-16525) System.err should be used when error occurs in multiple methods in DFSAdmin class

2022-03-30 Thread yanbin.zhang (Jira)

yanbin.zhang created HDFS-16525:
---

 Summary: System.err should be used when error occurs in multiple 
methods in DFSAdmin class
 Key: HDFS-16525
 URL: https://issues.apache.org/jira/browse/HDFS-16525
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: dfsadmin
Affects Versions: 3.3.2
Reporter: yanbin.zhang
Assignee: yanbin.zhang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16457) Make fs.getspaceused.classname reconfigurable

2022-03-28 Thread yanbin.zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513260#comment-17513260
 ] 

yanbin.zhang commented on HDFS-16457:
-

Dear God, can you help me to review my code, it took a long time to complete, I 
don't want to waste my time! [~weichiu] [~hexiaoqiao] [~csun] 

> Make fs.getspaceused.classname reconfigurable
> -
>
> Key: HDFS-16457
> URL: https://issues.apache.org/jira/browse/HDFS-16457
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Now if we want to switch fs.getspaceused.classname we need to restart the 
> NameNode. It would be convenient if we can switch it at runtime.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely

2022-03-23 Thread yanbin.zhang (Jira)



[ https://issues.apache.org/jira/browse/HDFS-16064 ]


yanbin.zhang deleted comment on HDFS-16064:
-

was (Author: it_singer):
I think your root cause may not be here, we never seem to have this problem 
during our downline process.

> HDFS-721 causes DataNode decommissioning to get stuck indefinitely
> --
>
> Key: HDFS-16064
> URL: https://issues.apache.org/jira/browse/HDFS-16064
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 3.2.1
>Reporter: Kevin Wikant
>Priority: Major
>
> Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
> non-issue under the assumption that if the namenode & a datanode get into an 
> inconsistent state for a given block pipeline, there should be another 
> datanode available to replicate the block to
> While testing datanode decommissioning using "dfs.exclude.hosts", I have 
> encountered a scenario where the decommissioning gets stuck indefinitely
> Below is the progression of events:
>  * there are initially 4 datanodes DN1, DN2, DN3, DN4
>  * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
>  * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in 
> order to satisfy their minimum replication factor of 2
>  * during this replication process 
> https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes 
> the following inconsistent state:
>  ** DN3 thinks it has the block pipeline in FINALIZED state
>  ** the namenode does not think DN3 has the block pipeline
> {code:java}
> 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
> DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
> dst: /DN3:9866; 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
> {code}
>  * the replication is attempted again, but:
>  ** DN4 has the block
>  ** DN1 and/or DN2 have the block, but don't count towards the minimum 
> replication factor because they are being decommissioned
>  ** DN3 does not have the block & cannot have the block replicated to it 
> because of HDFS-721
>  * the namenode repeatedly tries to replicate the block to DN3 & repeatedly 
> fails, this continues indefinitely
>  * therefore DN4 is the only live datanode with the block & the minimum 
> replication factor of 2 cannot be satisfied
>  * because the minimum replication factor cannot be satisfied for the 
> block(s) being moved off DN1 & DN2, the datanode decommissioning can never be 
> completed 
> {code:java}
> 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> ...
> 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> {code}
> Being stuck in decommissioning state forever is not an intended behavior of 
> DataNode decommissioning
> A few potential solutions:
>  * Address the root cause of the problem which is an inconsistent state 
> between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721
>  * Detect when datanode decommissioning is stuck due to lack of available 
> datanodes for satisfying the minimum replication factor, then recover by 
> re-enabling the datanodes being decommissioned
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely

2022-03-23 Thread yanbin.zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511159#comment-17511159
 ] 

yanbin.zhang commented on HDFS-16064:
-

I think your root cause may not be here, we never seem to have this problem 
during our downline process.

> HDFS-721 causes DataNode decommissioning to get stuck indefinitely
> --
>
> Key: HDFS-16064
> URL: https://issues.apache.org/jira/browse/HDFS-16064
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 3.2.1
>Reporter: Kevin Wikant
>Priority: Major
>
> Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
> non-issue under the assumption that if the namenode & a datanode get into an 
> inconsistent state for a given block pipeline, there should be another 
> datanode available to replicate the block to
> While testing datanode decommissioning using "dfs.exclude.hosts", I have 
> encountered a scenario where the decommissioning gets stuck indefinitely
> Below is the progression of events:
>  * there are initially 4 datanodes DN1, DN2, DN3, DN4
>  * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
>  * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in 
> order to satisfy their minimum replication factor of 2
>  * during this replication process 
> https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes 
> the following inconsistent state:
>  ** DN3 thinks it has the block pipeline in FINALIZED state
>  ** the namenode does not think DN3 has the block pipeline
> {code:java}
> 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
> DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
> dst: /DN3:9866; 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
> {code}
>  * the replication is attempted again, but:
>  ** DN4 has the block
>  ** DN1 and/or DN2 have the block, but don't count towards the minimum 
> replication factor because they are being decommissioned
>  ** DN3 does not have the block & cannot have the block replicated to it 
> because of HDFS-721
>  * the namenode repeatedly tries to replicate the block to DN3 & repeatedly 
> fails, this continues indefinitely
>  * therefore DN4 is the only live datanode with the block & the minimum 
> replication factor of 2 cannot be satisfied
>  * because the minimum replication factor cannot be satisfied for the 
> block(s) being moved off DN1 & DN2, the datanode decommissioning can never be 
> completed 
> {code:java}
> 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> ...
> 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> {code}
> Being stuck in decommissioning state forever is not an intended behavior of 
> DataNode decommissioning
> A few potential solutions:
>  * Address the root cause of the problem which is an inconsistent state 
> between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721
>  * Detect when datanode decommissioning is stuck due to lack of available 
> datanodes for satisfying the minimum replication factor, then recover by 
> re-enabling the datanodes being decommissioned
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely

2022-03-23 Thread yanbin.zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17511158#comment-17511158
 ] 

yanbin.zhang commented on HDFS-16064:
-

I think your root cause may not be here, we never seem to have this problem 
during our downline process.

> HDFS-721 causes DataNode decommissioning to get stuck indefinitely
> --
>
> Key: HDFS-16064
> URL: https://issues.apache.org/jira/browse/HDFS-16064
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 3.2.1
>Reporter: Kevin Wikant
>Priority: Major
>
> Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
> non-issue under the assumption that if the namenode & a datanode get into an 
> inconsistent state for a given block pipeline, there should be another 
> datanode available to replicate the block to
> While testing datanode decommissioning using "dfs.exclude.hosts", I have 
> encountered a scenario where the decommissioning gets stuck indefinitely
> Below is the progression of events:
>  * there are initially 4 datanodes DN1, DN2, DN3, DN4
>  * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
>  * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in 
> order to satisfy their minimum replication factor of 2
>  * during this replication process 
> https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes 
> the following inconsistent state:
>  ** DN3 thinks it has the block pipeline in FINALIZED state
>  ** the namenode does not think DN3 has the block pipeline
> {code:java}
> 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
> DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
> dst: /DN3:9866; 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
> {code}
>  * the replication is attempted again, but:
>  ** DN4 has the block
>  ** DN1 and/or DN2 have the block, but don't count towards the minimum 
> replication factor because they are being decommissioned
>  ** DN3 does not have the block & cannot have the block replicated to it 
> because of HDFS-721
>  * the namenode repeatedly tries to replicate the block to DN3 & repeatedly 
> fails, this continues indefinitely
>  * therefore DN4 is the only live datanode with the block & the minimum 
> replication factor of 2 cannot be satisfied
>  * because the minimum replication factor cannot be satisfied for the 
> block(s) being moved off DN1 & DN2, the datanode decommissioning can never be 
> completed 
> {code:java}
> 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> ...
> 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> {code}
> Being stuck in decommissioning state forever is not an intended behavior of 
> DataNode decommissioning
> A few potential solutions:
>  * Address the root cause of the problem which is an inconsistent state 
> between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721
>  * Detect when datanode decommissioning is stuck due to lack of available 
> datanodes for satisfying the minimum replication factor, then recover by 
> re-enabling the datanodes being decommissioned
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15646) Track failing tests in HDFS

2022-03-22 Thread yanbin.zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-15646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17510995#comment-17510995
 ] 

yanbin.zhang commented on HDFS-15646:
-

[~ayushtkn] Yes, I submitted a ut yesterday and it passed, but there are a lot 
of other errors not related to my ut.

 

> Track failing tests in HDFS
> ---
>
> Key: HDFS-15646
> URL: https://issues.apache.org/jira/browse/HDFS-15646
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Ahmed Hussein
>Priority: Blocker
>
> There are several Units that are consistently failing on Yetus for a log 
> period of time.
>  The list keeps growing and it is driving the repository into unstable 
> status. Qbt  reports more than *40 failing unit tests* on average.
> Personally, over the last week, with every submitted patch, I have to spend a 
> considerable time looking at the same stack trace to double check whether or 
> not the patch contributes to those failures.
> I found out that the majority of those tests were failing for quite sometime 
> but +no Jiras were filed+.
> The main problem of those consistent failures is that they have side effect 
> on the runtime of the other Junits by sucking up resources such as memory and 
> ports.
> {{StripedFile}} and {{EC}} tests in particular are 100% show-ups in the list 
> of bad tests.
>  I looked at those tests and they certainly need some improvements (i.e., 
> HDFS-15459). Is any one interested in those test cases? Can we just turn them 
> off?
> I like to give some heads-up that we need some more collaboration to enforce 
> the stability of the code set.
>  * For all developers, please, {color:#ff}file a Jira once you see a 
> failing test whether it is unrelated to your patch or not{color}. This gives 
> heads-up to other developers about the potential failures. Please do not stop 
> at commenting on your patch "_+this is unrelated to my work+_".
>  * Volunteer to dedicate more time on fixing flaky tests.
>  * Periodically, make sure that the list of failing tests does not exceed a 
> certain number of tests. We have Qbt reports to monitor that, but there is no 
> follow up on its status.
>  * We should consider aggressive strategies such as blocking any merges until 
> the code is brought back to stability.
>  * We need a clear and well-defined process to address Yetus issues: 
> configuration, investigating running out of memory, slowness..etc.
>  * Turn-off the Junits within the modules that are not being actively used in 
> the community (i.e., EC, stripedFiles, or..etc.). 
>  
> CC: [~aajisaka], [~elgoiri], [~kihwal], [~daryn], [~weichiu]
> Do you guys have any thoughts on the current status of the HDFS ?
>  
> +The following list is a quick list of failing Junits from Qbt reports:+
>  
> !https://ci-hadoop.apache.org/static/0ead8630/images/16x16/document_add.png!  
> [org.apache.hadoop.crypto.key.kms.server.TestKMS.testKMSProviderCaching|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/299/testReport/org.apache.hadoop.crypto.key.kms.server/TestKMS/testKMSProviderCaching/]1.5
>  
> sec[1|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/299/]
> !https://ci-hadoop.apache.org/static/0ead8630/images/16x16/document_add.png!  
> [org.apache.hadoop.fs.azure.TestBlobMetadata.testFolderMetadata|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/299/testReport/org.apache.hadoop.fs.azure/TestBlobMetadata/testFolderMetadata/]42
>  
> ms[3|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/297/]
> !https://ci-hadoop.apache.org/static/0ead8630/images/16x16/document_add.png!  
> [org.apache.hadoop.fs.azure.TestBlobMetadata.testFirstContainerVersionMetadata|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/299/testReport/org.apache.hadoop.fs.azure/TestBlobMetadata/testFirstContainerVersionMetadata/]46
>  
> ms[3|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/297/]
> !https://ci-hadoop.apache.org/static/0ead8630/images/16x16/document_add.png!  
> [org.apache.hadoop.fs.azure.TestBlobMetadata.testPermissionMetadata|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/299/testReport/org.apache.hadoop.fs.azure/TestBlobMetadata/testPermissionMetadata/]27
>  
> ms[3|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/297/]
> !https://ci-hadoop.apache.org/static/0ead8630/images/16x16/document_add.png!  
> [org.apache.hadoop.fs.azure.TestBlobMetadata.testOldPermissionMetadata|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/299/testReport/org.apache.hadoop.fs.azure/TestBlobMetadata/testOldPermissionMetadata/]19
>  
> ms[3|https://ci-hadoop.apache.org/job/hadoop-qbt-trunk-java8-linux-x86_64/297/]
>

[jira] [Work stopped] (HDFS-16450) Give priority to releasing DNs with less free space

2022-02-28 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16450 stopped by yanbin.zhang.
---
> Give priority to releasing DNs with less free space
> ---
>
> Key: HDFS-16450
> URL: https://issues.apache.org/jira/browse/HDFS-16450
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
> Attachments: HDFS-16450.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When deleting redundant replicas, the one with the least free space should be 
> prioritized.
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else {
>   return null;
> }
> excessTypes.remove(storage.getStorageType());
> return storage; {code}
> Change the above logic to the following:
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else {
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-16457) Make fs.getspaceused.classname reconfigurable

2022-02-15 Thread yanbin.zhang (Jira)

yanbin.zhang created HDFS-16457:
---

 Summary: Make fs.getspaceused.classname reconfigurable
 Key: HDFS-16457
 URL: https://issues.apache.org/jira/browse/HDFS-16457
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 3.3.0
Reporter: yanbin.zhang
Assignee: yanbin.zhang


Now if we want to switch fs.getspaceused.classname we need to restart the 
NameNode. It would be convenient if we can switch it at runtime.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work started] (HDFS-16450) Give priority to releasing DNs with less free space

2022-02-14 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16450 started by yanbin.zhang.
---
> Give priority to releasing DNs with less free space
> ---
>
> Key: HDFS-16450
> URL: https://issues.apache.org/jira/browse/HDFS-16450
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
> Attachments: HDFS-16450.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When deleting redundant replicas, the one with the least free space should be 
> prioritized.
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else {
>   return null;
> }
> excessTypes.remove(storage.getStorageType());
> return storage; {code}
> Change the above logic to the following:
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else {
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Reopened] (HDFS-16450) Give priority to releasing DNs with less free space

2022-02-14 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang reopened HDFS-16450:
-

> Give priority to releasing DNs with less free space
> ---
>
> Key: HDFS-16450
> URL: https://issues.apache.org/jira/browse/HDFS-16450
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
> Attachments: HDFS-16450.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When deleting redundant replicas, the one with the least free space should be 
> prioritized.
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else {
>   return null;
> }
> excessTypes.remove(storage.getStorageType());
> return storage; {code}
> Change the above logic to the following:
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else {
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16450) Give priority to releasing DNs with less free space

2022-02-14 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang updated HDFS-16450:

Labels:   (was: 无)

> Give priority to releasing DNs with less free space
> ---
>
> Key: HDFS-16450
> URL: https://issues.apache.org/jira/browse/HDFS-16450
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
> Attachments: HDFS-16450.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When deleting redundant replicas, the one with the least free space should be 
> prioritized.
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else {
>   return null;
> }
> excessTypes.remove(storage.getStorageType());
> return storage; {code}
> Change the above logic to the following:
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else {
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16450) Give priority to releasing DNs with less free space

2022-02-14 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang updated HDFS-16450:

Fix Version/s: 3.3.2

> Give priority to releasing DNs with less free space
> ---
>
> Key: HDFS-16450
> URL: https://issues.apache.org/jira/browse/HDFS-16450
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: 无
> Fix For: 3.3.2
>
> Attachments: HDFS-16450.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When deleting redundant replicas, the one with the least free space should be 
> prioritized.
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else {
>   return null;
> }
> excessTypes.remove(storage.getStorageType());
> return storage; {code}
> Change the above logic to the following:
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else {
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16450) Give priority to releasing DNs with less free space

2022-02-14 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang updated HDFS-16450:

Fix Version/s: (was: 3.3.2)

> Give priority to releasing DNs with less free space
> ---
>
> Key: HDFS-16450
> URL: https://issues.apache.org/jira/browse/HDFS-16450
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: 无
> Attachments: HDFS-16450.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When deleting redundant replicas, the one with the least free space should be 
> prioritized.
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else {
>   return null;
> }
> excessTypes.remove(storage.getStorageType());
> return storage; {code}
> Change the above logic to the following:
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else {
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2022-02-14 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang updated HDFS-14920:

Labels:   (was: 无)

> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Hui Fei
>Assignee: Hui Fei
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch, 
> HDFS-14920.003.patch, HDFS-14920.004.patch, HDFS-14920.005.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8. b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is still in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // replication task for decommissioning node(s).
>   if (additionalReplRequired - numReplicas.decommissioning() -
>   numReplicas.liveEnteringMaintenanceReplicas() > 0) {
> additionalReplRequired = additionalReplRequired -
> numReplicas.decommissioning() -
> numReplicas.liveEnteringMaintenanceReplicas();
>   }
> {code}
> Should reconstruction firstly and then replicate for decommissioning. Because 
> numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's 
> wrong,
> numReplicas.decommissioning() should be 3, it should exclude live replica. 
> If so, additionalReplRequired will be 1, reconstruction will schedule as 
> expected. After that, decommission goes on.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2022-02-14 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang updated HDFS-14920:

Labels: 无  (was: pull)

> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Hui Fei
>Assignee: Hui Fei
>Priority: Major
>  Labels: 无
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch, 
> HDFS-14920.003.patch, HDFS-14920.004.patch, HDFS-14920.005.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8. b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is still in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // replication task for decommissioning node(s).
>   if (additionalReplRequired - numReplicas.decommissioning() -
>   numReplicas.liveEnteringMaintenanceReplicas() > 0) {
> additionalReplRequired = additionalReplRequired -
> numReplicas.decommissioning() -
> numReplicas.liveEnteringMaintenanceReplicas();
>   }
> {code}
> Should reconstruction firstly and then replicate for decommissioning. Because 
> numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's 
> wrong,
> numReplicas.decommissioning() should be 3, it should exclude live replica. 
> If so, additionalReplRequired will be 1, reconstruction will schedule as 
> expected. After that, decommission goes on.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2022-02-14 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang updated HDFS-14920:

Labels: pull  (was: )

> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Hui Fei
>Assignee: Hui Fei
>Priority: Major
>  Labels: pull
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch, 
> HDFS-14920.003.patch, HDFS-14920.004.patch, HDFS-14920.005.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8. b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is still in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // replication task for decommissioning node(s).
>   if (additionalReplRequired - numReplicas.decommissioning() -
>   numReplicas.liveEnteringMaintenanceReplicas() > 0) {
> additionalReplRequired = additionalReplRequired -
> numReplicas.decommissioning() -
> numReplicas.liveEnteringMaintenanceReplicas();
>   }
> {code}
> Should reconstruction firstly and then replicate for decommissioning. Because 
> numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's 
> wrong,
> numReplicas.decommissioning() should be 3, it should exclude live replica. 
> If so, additionalReplRequired will be 1, reconstruction will schedule as 
> expected. After that, decommission goes on.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16450) Give priority to releasing DNs with less free space

2022-02-14 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang updated HDFS-16450:

Labels: 无  (was: patch)

> Give priority to releasing DNs with less free space
> ---
>
> Key: HDFS-16450
> URL: https://issues.apache.org/jira/browse/HDFS-16450
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: 无
> Attachments: HDFS-16450.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When deleting redundant replicas, the one with the least free space should be 
> prioritized.
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else {
>   return null;
> }
> excessTypes.remove(storage.getStorageType());
> return storage; {code}
> Change the above logic to the following:
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else {
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16450) Give priority to releasing DNs with less free space

2022-02-14 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang updated HDFS-16450:

Labels: patch  (was: pull-request-available)

> Give priority to releasing DNs with less free space
> ---
>
> Key: HDFS-16450
> URL: https://issues.apache.org/jira/browse/HDFS-16450
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: patch
> Attachments: HDFS-16450.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When deleting redundant replicas, the one with the least free space should be 
> prioritized.
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else {
>   return null;
> }
> excessTypes.remove(storage.getStorageType());
> return storage; {code}
> Change the above logic to the following:
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else {
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-16450) Give priority to releasing DNs with less free space

2022-02-14 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang resolved HDFS-16450.
-
Resolution: Done

> Give priority to releasing DNs with less free space
> ---
>
> Key: HDFS-16450
> URL: https://issues.apache.org/jira/browse/HDFS-16450
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16450.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When deleting redundant replicas, the one with the least free space should be 
> prioritized.
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else {
>   return null;
> }
> excessTypes.remove(storage.getStorageType());
> return storage; {code}
> Change the above logic to the following:
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else {
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-16450) Give priority to releasing DNs with less free space

2022-02-13 Thread yanbin.zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489575#comment-17489575
 ] 

yanbin.zhang edited comment on HDFS-16450 at 2/14/22, 1:52 AM:
---

Could you please give some comments or suggestions. [~weichiu] [~hexiaoqiao] 
[~ferhui] 


was (Author: it_singer):
Could you please give some comments or suggestions?[~weichiu] 

> Give priority to releasing DNs with less free space
> ---
>
> Key: HDFS-16450
> URL: https://issues.apache.org/jira/browse/HDFS-16450
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16450.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When deleting redundant replicas, the one with the least free space should be 
> prioritized.
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else {
>   return null;
> }
> excessTypes.remove(storage.getStorageType());
> return storage; {code}
> Change the above logic to the following:
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else {
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16450) Give priority to releasing DNs with less free space

2022-02-10 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang updated HDFS-16450:

Description: 
When deleting redundant replicas, the one with the least free space should be 
prioritized.
{code:java}
//BlockPlacementPolicyDefault#chooseReplicaToDelete
final DatanodeStorageInfo storage;
if (oldestHeartbeatStorage != null) {
  storage = oldestHeartbeatStorage;
} else if (minSpaceStorage != null) {
  storage = minSpaceStorage;
} else {
  return null;
}
excessTypes.remove(storage.getStorageType());
return storage; {code}
Change the above logic to the following:
{code:java}
//BlockPlacementPolicyDefault#chooseReplicaToDelete
final DatanodeStorageInfo storage;
if (minSpaceStorage != null) {
  storage = minSpaceStorage;
} else if (oldestHeartbeatStorage != null) {
  storage = oldestHeartbeatStorage;
} else {
  return null;
} {code}

  was:
When deleting redundant replicas, the one with the least free space should be 
prioritized.
{code:java}
//BlockPlacementPolicyDefault#chooseReplicaToDelete

final DatanodeStorageInfo storage;
if (oldestHeartbeatStorage != null) {
  storage = oldestHeartbeatStorage;
} else if (minSpaceStorage != null) {
  storage = minSpaceStorage;
} else {
  return null;
}
excessTypes.remove(storage.getStorageType());
return storage; {code}
Change the above logic to the following:
{code:java}
//BlockPlacementPolicyDefault#chooseReplicaToDelete
final DatanodeStorageInfo storage;
if (minSpaceStorage != null) {
  storage = minSpaceStorage;
} else if (oldestHeartbeatStorage != null) {
  storage = oldestHeartbeatStorage;
} else {
  return null;
} {code}


> Give priority to releasing DNs with less free space
> ---
>
> Key: HDFS-16450
> URL: https://issues.apache.org/jira/browse/HDFS-16450
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16450.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When deleting redundant replicas, the one with the least free space should be 
> prioritized.
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else {
>   return null;
> }
> excessTypes.remove(storage.getStorageType());
> return storage; {code}
> Change the above logic to the following:
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else {
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work started] (HDFS-16450) Give priority to releasing DNs with less free space

2022-02-09 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16450 started by yanbin.zhang.
---
> Give priority to releasing DNs with less free space
> ---
>
> Key: HDFS-16450
> URL: https://issues.apache.org/jira/browse/HDFS-16450
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16450.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When deleting redundant replicas, the one with the least free space should be 
> prioritized.
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else {
>   return null;
> }
> excessTypes.remove(storage.getStorageType());
> return storage; {code}
> Change the above logic to the following:
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else {
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16450) Give priority to releasing DNs with less free space

2022-02-09 Thread yanbin.zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489575#comment-17489575
 ] 

yanbin.zhang commented on HDFS-16450:
-

Could you please give some comments or suggestions?[~weichiu] 

> Give priority to releasing DNs with less free space
> ---
>
> Key: HDFS-16450
> URL: https://issues.apache.org/jira/browse/HDFS-16450
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16450.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When deleting redundant replicas, the one with the least free space should be 
> prioritized.
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else {
>   return null;
> }
> excessTypes.remove(storage.getStorageType());
> return storage; {code}
> Change the above logic to the following:
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else {
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16450) Give priority to releasing DNs with less free space

2022-02-09 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang updated HDFS-16450:

Attachment: HDFS-16450.001.patch

> Give priority to releasing DNs with less free space
> ---
>
> Key: HDFS-16450
> URL: https://issues.apache.org/jira/browse/HDFS-16450
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDFS-16450.001.patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> When deleting redundant replicas, the one with the least free space should be 
> prioritized.
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else {
>   return null;
> }
> excessTypes.remove(storage.getStorageType());
> return storage; {code}
> Change the above logic to the following:
> {code:java}
> //BlockPlacementPolicyDefault#chooseReplicaToDelete
> final DatanodeStorageInfo storage;
> if (minSpaceStorage != null) {
>   storage = minSpaceStorage;
> } else if (oldestHeartbeatStorage != null) {
>   storage = oldestHeartbeatStorage;
> } else {
>   return null;
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-16450) Give priority to releasing DNs with less free space

2022-02-09 Thread yanbin.zhang (Jira)

yanbin.zhang created HDFS-16450:
---

 Summary: Give priority to releasing DNs with less free space
 Key: HDFS-16450
 URL: https://issues.apache.org/jira/browse/HDFS-16450
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.3.0
Reporter: yanbin.zhang
Assignee: yanbin.zhang


When deleting redundant replicas, the one with the least free space should be 
prioritized.
{code:java}
//BlockPlacementPolicyDefault#chooseReplicaToDelete

final DatanodeStorageInfo storage;
if (oldestHeartbeatStorage != null) {
  storage = oldestHeartbeatStorage;
} else if (minSpaceStorage != null) {
  storage = minSpaceStorage;
} else {
  return null;
}
excessTypes.remove(storage.getStorageType());
return storage; {code}
Change the above logic to the following:
{code:java}
//BlockPlacementPolicyDefault#chooseReplicaToDelete
final DatanodeStorageInfo storage;
if (minSpaceStorage != null) {
  storage = minSpaceStorage;
} else if (oldestHeartbeatStorage != null) {
  storage = oldestHeartbeatStorage;
} else {
  return null;
} {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work stopped] (HDFS-16186) Datanode kicks out hard disk logic optimization

2022-02-09 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16186 stopped by yanbin.zhang.
---
> Datanode kicks out hard disk logic optimization
> ---
>
> Key: HDFS-16186
> URL: https://issues.apache.org/jira/browse/HDFS-16186
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.1.2
> Environment: In the hadoop cluster, a certain hard disk in a certain 
> Datanode has a problem, but the datanode of hdfs did not kick out the hard 
> disk in time, causing the datanode to become a slow node
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> 2021-08-24 08:56:10,456 WARN datanode.DataNode 
> (BlockSender.java:readChecksum(681)) - Could not read or failed to verify 
> checksum for data at offset 113115136 for block 
> BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709
> java.io.IOException: Input/output error
>  at java.io.FileInputStream.readBytes(Native Method)
>  at java.io.FileInputStream.read(FileInputStream.java:255)
>  at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:876)
>  at java.io.FilterInputStream.read(FilterInputStream.java:133)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>  at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>  at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>  at java.io.DataInputStream.read(DataInputStream.java:149)
>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:210)
>  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:679)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:588)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:803)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:750)
>  at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.scanBlock(VolumeScanner.java:448)
>  at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:558)
>  at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633)
> 2021-08-24 08:56:11,121 WARN datanode.VolumeScanner 
> (VolumeScanner.java:handle(292)) - Reporting bad 
> BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709 on 
> /data11/hdfs/data



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work started] (HDFS-16186) Datanode kicks out hard disk logic optimization

2022-02-09 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-16186 started by yanbin.zhang.
---
> Datanode kicks out hard disk logic optimization
> ---
>
> Key: HDFS-16186
> URL: https://issues.apache.org/jira/browse/HDFS-16186
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.1.2
> Environment: In the hadoop cluster, a certain hard disk in a certain 
> Datanode has a problem, but the datanode of hdfs did not kick out the hard 
> disk in time, causing the datanode to become a slow node
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> 2021-08-24 08:56:10,456 WARN datanode.DataNode 
> (BlockSender.java:readChecksum(681)) - Could not read or failed to verify 
> checksum for data at offset 113115136 for block 
> BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709
> java.io.IOException: Input/output error
>  at java.io.FileInputStream.readBytes(Native Method)
>  at java.io.FileInputStream.read(FileInputStream.java:255)
>  at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:876)
>  at java.io.FilterInputStream.read(FilterInputStream.java:133)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>  at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>  at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>  at java.io.DataInputStream.read(DataInputStream.java:149)
>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:210)
>  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:679)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:588)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:803)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:750)
>  at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.scanBlock(VolumeScanner.java:448)
>  at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:558)
>  at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633)
> 2021-08-24 08:56:11,121 WARN datanode.VolumeScanner 
> (VolumeScanner.java:handle(292)) - Reporting bad 
> BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709 on 
> /data11/hdfs/data



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Assigned] (HDFS-16186) Datanode kicks out hard disk logic optimization

2022-02-09 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang reassigned HDFS-16186:
---

Assignee: yanbin.zhang

> Datanode kicks out hard disk logic optimization
> ---
>
> Key: HDFS-16186
> URL: https://issues.apache.org/jira/browse/HDFS-16186
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.1.2
> Environment: In the hadoop cluster, a certain hard disk in a certain 
> Datanode has a problem, but the datanode of hdfs did not kick out the hard 
> disk in time, causing the datanode to become a slow node
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> 2021-08-24 08:56:10,456 WARN datanode.DataNode 
> (BlockSender.java:readChecksum(681)) - Could not read or failed to verify 
> checksum for data at offset 113115136 for block 
> BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709
> java.io.IOException: Input/output error
>  at java.io.FileInputStream.readBytes(Native Method)
>  at java.io.FileInputStream.read(FileInputStream.java:255)
>  at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:876)
>  at java.io.FilterInputStream.read(FilterInputStream.java:133)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>  at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>  at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>  at java.io.DataInputStream.read(DataInputStream.java:149)
>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:210)
>  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:679)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:588)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:803)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:750)
>  at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.scanBlock(VolumeScanner.java:448)
>  at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:558)
>  at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633)
> 2021-08-24 08:56:11,121 WARN datanode.VolumeScanner 
> (VolumeScanner.java:handle(292)) - Reporting bad 
> BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709 on 
> /data11/hdfs/data



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Assigned] (HDFS-5920) Support rollback of rolling upgrade in NameNode and JournalNodes

2022-02-09 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-5920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang reassigned HDFS-5920:
--

Assignee: (was: yanbin.zhang)

> Support rollback of rolling upgrade in NameNode and JournalNodes
> 
>
> Key: HDFS-5920
> URL: https://issues.apache.org/jira/browse/HDFS-5920
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: journal-node, namenode
>Reporter: Jing Zhao
>Priority: Major
> Attachments: HDFS-5920.000.patch, HDFS-5920.000.patch, 
> HDFS-5920.001.patch, HDFS-5920.002.patch, HDFS-5920.003.patch
>
>
> This jira provides rollback functionality for NameNode and JournalNode in 
> rolling upgrade.
> Currently the proposed rollback for rolling upgrade is:
> 1. Shutdown both NN
> 2. Start one of the NN using "-rollingUpgrade rollback" option
> 3. This NN will load the special fsimage right before the upgrade marker, 
> then discard all the editlog segments after the txid of the fsimage
> 4. The NN will also send RPC requests to all the JNs to discard editlog 
> segments. This call expects response from all the JNs. The NN will keep 
> running if the call succeeds.
> 5. We start the other NN using bootstrapstandby rather than "-rollingUpgrade 
> rollback"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Assigned] (HDFS-5920) Support rollback of rolling upgrade in NameNode and JournalNodes

2022-02-09 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-5920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang reassigned HDFS-5920:
--

Assignee: yanbin.zhang  (was: Jing Zhao)

> Support rollback of rolling upgrade in NameNode and JournalNodes
> 
>
> Key: HDFS-5920
> URL: https://issues.apache.org/jira/browse/HDFS-5920
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: journal-node, namenode
>Reporter: Jing Zhao
>Assignee: yanbin.zhang
>Priority: Major
> Attachments: HDFS-5920.000.patch, HDFS-5920.000.patch, 
> HDFS-5920.001.patch, HDFS-5920.002.patch, HDFS-5920.003.patch
>
>
> This jira provides rollback functionality for NameNode and JournalNode in 
> rolling upgrade.
> Currently the proposed rollback for rolling upgrade is:
> 1. Shutdown both NN
> 2. Start one of the NN using "-rollingUpgrade rollback" option
> 3. This NN will load the special fsimage right before the upgrade marker, 
> then discard all the editlog segments after the txid of the fsimage
> 4. The NN will also send RPC requests to all the JNs to discard editlog 
> segments. This call expects response from all the JNs. The NN will keep 
> running if the call succeeds.
> 5. We start the other NN using bootstrapstandby rather than "-rollingUpgrade 
> rollback"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16437) ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.

2022-02-05 Thread yanbin.zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487639#comment-17487639
 ] 

yanbin.zhang commented on HDFS-16437:
-

Thank you [~weichiu]  for the merge and suggestion, thanks.

> ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.
> --
>
> Key: HDFS-16437
> URL: https://issues.apache.org/jira/browse/HDFS-16437
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1, 3.3.0
>Reporter: yanbin.zhang
>Assignee: yanbin.zhang
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> In a cluster environment without snapshot, if you want to convert back to 
> fsimage through the generated xml, an error will be reported.
> {code:java}
> //代码占位符
> [test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml 
> -o fsimage_0257220
> OfflineImageReconstructor failed: FSImage XML ended prematurely, without 
> including section(s) SnapshotDiffSection
> java.io.IOException: FSImage XML ended prematurely, without including 
> section(s) SnapshotDiffSection
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765)
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842)
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211)
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149)
> 22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-16437) ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.

2022-02-03 Thread yanbin.zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486782#comment-17486782
 ] 

yanbin.zhang edited comment on HDFS-16437 at 2/4/22, 1:58 AM:
--

Yes, I'm singer-bin, thank you [~weichiu] ！


was (Author: it_singer):
Yes, I'm singer-bin, thank you

> ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.
> --
>
> Key: HDFS-16437
> URL: https://issues.apache.org/jira/browse/HDFS-16437
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1, 3.3.0
>Reporter: yanbin.zhang
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> In a cluster environment without snapshot, if you want to convert back to 
> fsimage through the generated xml, an error will be reported.
> {code:java}
> //代码占位符
> [test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml 
> -o fsimage_0257220
> OfflineImageReconstructor failed: FSImage XML ended prematurely, without 
> including section(s) SnapshotDiffSection
> java.io.IOException: FSImage XML ended prematurely, without including 
> section(s) SnapshotDiffSection
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765)
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842)
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211)
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149)
> 22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16437) ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.

2022-02-03 Thread yanbin.zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17486782#comment-17486782
 ] 

yanbin.zhang commented on HDFS-16437:
-

Yes, I'm singer-bin, thank you

> ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.
> --
>
> Key: HDFS-16437
> URL: https://issues.apache.org/jira/browse/HDFS-16437
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1, 3.3.0
>Reporter: yanbin.zhang
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> In a cluster environment without snapshot, if you want to convert back to 
> fsimage through the generated xml, an error will be reported.
> {code:java}
> //代码占位符
> [test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml 
> -o fsimage_0257220
> OfflineImageReconstructor failed: FSImage XML ended prematurely, without 
> including section(s) SnapshotDiffSection
> java.io.IOException: FSImage XML ended prematurely, without including 
> section(s) SnapshotDiffSection
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765)
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842)
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211)
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149)
> 22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16437) ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.

2022-01-24 Thread yanbin.zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481622#comment-17481622
 ] 

yanbin.zhang commented on HDFS-16437:
-

There is already a solution for this problem, please wait.

> ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.
> --
>
> Key: HDFS-16437
> URL: https://issues.apache.org/jira/browse/HDFS-16437
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1, 3.3.0
>Reporter: yanbin.zhang
>Priority: Critical
>
> In a cluster environment without snapshot, if you want to convert back to 
> fsimage through the generated xml, an error will be reported.
> {code:java}
> //代码占位符
> [test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml 
> -o fsimage_0257220
> OfflineImageReconstructor failed: FSImage XML ended prematurely, without 
> including section(s) SnapshotDiffSection
> java.io.IOException: FSImage XML ended prematurely, without including 
> section(s) SnapshotDiffSection
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765)
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842)
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211)
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149)
> 22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-16437) ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.

2022-01-24 Thread yanbin.zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/HDFS-16437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yanbin.zhang updated HDFS-16437:

Description: 
In a cluster environment without snapshot, if you want to convert back to 
fsimage through the generated xml, an error will be reported.
{code:java}
//代码占位符

[test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml -o 
fsimage_0257220
OfflineImageReconstructor failed: FSImage XML ended prematurely, without 
including section(s) SnapshotDiffSection
java.io.IOException: FSImage XML ended prematurely, without including 
section(s) SnapshotDiffSection
        at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765)
        at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842)
        at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211)
        at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149)
22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException 
{code}

  was:In a cluster environment without snapshot, if you want to convert back to 
fsimage through the generated xml, an error will be reported

Environment: (was: {code:java}
//代码占位符

[test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml -o 
fsimage_0257220
OfflineImageReconstructor failed: FSImage XML ended prematurely, without 
including section(s) SnapshotDiffSection
java.io.IOException: FSImage XML ended prematurely, without including 
section(s) SnapshotDiffSection
        at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765)
        at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842)
        at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211)
        at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149)
22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException 
{code})

> ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.
> --
>
> Key: HDFS-16437
> URL: https://issues.apache.org/jira/browse/HDFS-16437
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.1, 3.3.0
>Reporter: yanbin.zhang
>Priority: Critical
>
> In a cluster environment without snapshot, if you want to convert back to 
> fsimage through the generated xml, an error will be reported.
> {code:java}
> //代码占位符
> [test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml 
> -o fsimage_0257220
> OfflineImageReconstructor failed: FSImage XML ended prematurely, without 
> including section(s) SnapshotDiffSection
> java.io.IOException: FSImage XML ended prematurely, without including 
> section(s) SnapshotDiffSection
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765)
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842)
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211)
>         at 
> org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149)
> 22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-16437) ReverseXML processor doesn't accept XML files without the SnapshotDiffSection.

2022-01-24 Thread yanbin.zhang (Jira)

yanbin.zhang created HDFS-16437:
---

 Summary: ReverseXML processor doesn't accept XML files without the 
SnapshotDiffSection.
 Key: HDFS-16437
 URL: https://issues.apache.org/jira/browse/HDFS-16437
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Affects Versions: 3.3.0, 3.1.1
 Environment: {code:java}
//代码占位符

[test@test001 ~]$ hdfs oiv -p ReverseXML -i fsimage_0257220.xml -o 
fsimage_0257220
OfflineImageReconstructor failed: FSImage XML ended prematurely, without 
including section(s) SnapshotDiffSection
java.io.IOException: FSImage XML ended prematurely, without including 
section(s) SnapshotDiffSection
        at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.processXml(OfflineImageReconstructor.java:1765)
        at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageReconstructor.run(OfflineImageReconstructor.java:1842)
        at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.run(OfflineImageViewerPB.java:211)
        at 
org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewerPB.main(OfflineImageViewerPB.java:149)
22/01/25 15:56:52 INFO util.ExitUtil: Exiting with status 1: ExitException 
{code}
Reporter: yanbin.zhang


In a cluster environment without snapshot, if you want to convert back to 
fsimage through the generated xml, an error will be reported



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-16186) Datanode kicks out hard disk logic optimization

2021-08-26 Thread yanbin.zhang (Jira)



[ 
https://issues.apache.org/jira/browse/HDFS-16186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405111#comment-17405111
 ] 

yanbin.zhang commented on HDFS-16186:
-

It took two days from the disk error to the datanode kicking out the hard disk, 
because 
org.apache.hadoop.hdfs.server.datanode.checker.DatasetVolumeChecker.ResultHandler#markHealthy
 was successfully bypassed during the check, and the datanode became a slow node

> Datanode kicks out hard disk logic optimization
> ---
>
> Key: HDFS-16186
> URL: https://issues.apache.org/jira/browse/HDFS-16186
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.1.2
> Environment: In the hadoop cluster, a certain hard disk in a certain 
> Datanode has a problem, but the datanode of hdfs did not kick out the hard 
> disk in time, causing the datanode to become a slow node
>Reporter: yanbin.zhang
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> 2021-08-24 08:56:10,456 WARN datanode.DataNode 
> (BlockSender.java:readChecksum(681)) - Could not read or failed to verify 
> checksum for data at offset 113115136 for block 
> BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709
> java.io.IOException: Input/output error
>  at java.io.FileInputStream.readBytes(Native Method)
>  at java.io.FileInputStream.read(FileInputStream.java:255)
>  at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:876)
>  at java.io.FilterInputStream.read(FilterInputStream.java:133)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>  at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>  at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>  at java.io.DataInputStream.read(DataInputStream.java:149)
>  at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:210)
>  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:679)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:588)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:803)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:750)
>  at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.scanBlock(VolumeScanner.java:448)
>  at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:558)
>  at 
> org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633)
> 2021-08-24 08:56:11,121 WARN datanode.VolumeScanner 
> (VolumeScanner.java:handle(292)) - Reporting bad 
> BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709 on 
> /data11/hdfs/data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Created] (HDFS-16186) Datanode kicks out hard disk logic optimization

2021-08-26 Thread yanbin.zhang (Jira)

yanbin.zhang created HDFS-16186:
---

 Summary: Datanode kicks out hard disk logic optimization
 Key: HDFS-16186
 URL: https://issues.apache.org/jira/browse/HDFS-16186
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode
Affects Versions: 3.1.2
 Environment: In the hadoop cluster, a certain hard disk in a certain 
Datanode has a problem, but the datanode of hdfs did not kick out the hard disk 
in time, causing the datanode to become a slow node
Reporter: yanbin.zhang


2021-08-24 08:56:10,456 WARN datanode.DataNode 
(BlockSender.java:readChecksum(681)) - Could not read or failed to verify 
checksum for data at offset 113115136 for block 
BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709
java.io.IOException: Input/output error
 at java.io.FileInputStream.readBytes(Native Method)
 at java.io.FileInputStream.read(FileInputStream.java:255)
 at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider$WrappedFileInputStream.read(FileIoProvider.java:876)
 at java.io.FilterInputStream.read(FilterInputStream.java:133)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
 at java.io.DataInputStream.read(DataInputStream.java:149)
 at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:210)
 at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.ReplicaInputStreams.readChecksumFully(ReplicaInputStreams.java:90)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.readChecksum(BlockSender.java:679)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:588)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.doSendBlock(BlockSender.java:803)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:750)
 at 
org.apache.hadoop.hdfs.server.datanode.VolumeScanner.scanBlock(VolumeScanner.java:448)
 at 
org.apache.hadoop.hdfs.server.datanode.VolumeScanner.runLoop(VolumeScanner.java:558)
 at 
org.apache.hadoop.hdfs.server.datanode.VolumeScanner.run(VolumeScanner.java:633)
2021-08-24 08:56:11,121 WARN datanode.VolumeScanner 
(VolumeScanner.java:handle(292)) - Reporting bad 
BP-1801371083-x.x.x.x-1603704063698:blk_5635828768_4563943709 on 
/data11/hdfs/data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

49 matches

Mail list logo