[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-09-03 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190570#comment-17190570
 ] 

Lisheng Sun commented on HDFS-14694:


The failed UT is not related to this patch.

> Call recoverLease on DFSOutputStream close exception
> 
>
> Key: HDFS-14694
> URL: https://issues.apache.org/jira/browse/HDFS-14694
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Chen Zhang
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, 
> HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, 
> HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, 
> HDFS-14694.009.patch, HDFS-14694.010.patch, HDFS-14694.011.patch
>
>
> HDFS uses file-lease to manage opened files, when a file is not closed 
> normally, NN will recover lease automatically after hard limit exceeded. But 
> for a long running service(e.g. HBase), the hdfs-client will never die and NN 
> don't have any chances to recover the file.
> Usually client program needs to handle exceptions by themself to avoid this 
> condition(e.g. HBase automatically call recover lease for files that not 
> closed normally), but in our experience, most services (in our company) don't 
> process this condition properly, which will cause lots of files in abnormal 
> status or even data loss.
> This Jira propose to add a feature that call recoverLease operation 
> automatically when DFSOutputSteam close encounters exception. It should be 
> disabled by default, but when somebody builds a long-running service based on 
> HDFS, they can enable this option.
> We've add this feature to our internal Hadoop distribution for more than 3 
> years, it's quite useful according our experience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-09-03 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190568#comment-17190568
 ] 

Hadoop QA commented on HDFS-14694:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m 
20s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
10s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m 
33s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m  
5s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
18m 35s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
27s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
56s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  3m 
13s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
46s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
24s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
 3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m 
31s{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  4m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
59s{color} | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 13s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
23s{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
48s{color} | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  6m  
2s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| 

[jira] [Work logged] (HDFS-15554) RBF: force router check file existence in destinations before adding/updating mount points

2020-09-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15554?focusedWorklogId=478944=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478944
 ]

ASF GitHub Bot logged work on HDFS-15554:
-

Author: ASF GitHub Bot
Created on: 04/Sep/20 05:44
Start Date: 04/Sep/20 05:44
Worklog Time Spent: 10m 
  Work Description: fengnanli commented on a change in pull request #2266:
URL: https://github.com/apache/hadoop/pull/2266#discussion_r483399376



##
File path: 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterAdminServer.java
##
@@ -562,11 +595,35 @@ public GetDestinationResponse getDestination(
   LOG.error("Cannot get location for {}: {}",
   src, ioe.getMessage());
 }
-if (nsIds.isEmpty() && !locations.isEmpty()) {
-  String nsId = locations.get(0).getNameserviceId();
-  nsIds.add(nsId);
+return nsIds;
+  }
+
+  /**
+   * Verify the file exists in destination nameservices to avoid dangling
+   * mount points.
+   *
+   * @param entry the new mount points added, could be from add or update.
+   * @return destination nameservices where the file doesn't exist.
+   * @throws IOException
+   */
+  private List verifyFileInDestinations(MountTable entry)

Review comment:
   @goiri Uploaded an early version of trying to fix all tests. This is 
pretty tedious work so before I spend more time on this, let me know your 
thoughts.
   There are mainly two types of tests when dealing with mount table:
   1. Use mock RouterRpcServer and so on, this way no downstream namenode calls 
are made. I put the mock as well, see the change for TestRouterAdmin.java
   2. Use real downstream namenode interaction, see TestRouterMountTable.java. 
I created the paths before calling mount points change.
   
   I kept thinking a much easier way is to add a Router server side config to 
turn this on and the default is on. In the tests I can just turn the config off 
explicitly and this way I don't need to deal with individual tests.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 478944)
Time Spent: 1.5h  (was: 1h 20m)

> RBF: force router check file existence in destinations before adding/updating 
> mount points
> --
>
> Key: HDFS-15554
> URL: https://issues.apache.org/jira/browse/HDFS-15554
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Adding/Updating mount points right now is only a router action without 
> validation in the downstream namenodes for the destination files/directories.
> In practice we have set up the dangling mount points and when clients call 
> listStatus they would get the file returned, but then if they try to access 
> the file FileNotFoundException would be thrown out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13678) StorageType is incompatible when rolling upgrade to 2.6/2.6+ versions

2020-09-03 Thread Masatake Iwasaki (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190567#comment-17190567
 ] 

Masatake Iwasaki commented on HDFS-13678:
-

updated the target version for preparing 2.10.1 release.

> StorageType is incompatible when rolling upgrade to 2.6/2.6+ versions
> -
>
> Key: HDFS-13678
> URL: https://issues.apache.org/jira/browse/HDFS-13678
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rolling upgrades
>Affects Versions: 2.5.0
>Reporter: Yiqun Lin
>Priority: Major
>
> In version 2.6.0, we supported more storage types in HDFS that implemented in 
> HDFS-6584. But this seems a incompatible change when we rolling upgrade our 
> cluster from 2.5.0 to 2.6.0 and throw following error.
> {noformat}
> 2018-06-14 11:43:39,246 ERROR [DataNode: 
> [[[DISK]file:/home/vipshop/hard_disk/dfs/, [DISK]file:/data1/dfs/, 
> [DISK]file:/data2/dfs/]] heartbeating to xx.xx.xx.xx:8022] 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService 
> for Block pool BP-670256553-xx.xx.xx.xx-1528795419404 (Datanode Uuid 
> ab150e05-fcb7-49ed-b8ba-f05c27593fee) service to xx.xx.xx.xx:8022
> java.lang.ArrayStoreException
>  at java.util.ArrayList.toArray(ArrayList.java:412)
>  at 
> java.util.Collections$UnmodifiableCollection.toArray(Collections.java:1034)
>  at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1030)
>  at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:836)
>  at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:146)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:566)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:664)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:835)
>  at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The scenery is that old DN parses StorageType error that got from new NN. 
> This error is taking place in sending heratbeat to NN and blocks won't be 
> reported to NN successfully. This will lead subsequent errors.
> Corresponding logic in 2.5.0:
> {code}
>   public static BlockCommand convert(BlockCommandProto blkCmd) {
> ...
> StorageType[][] targetStorageTypes = new StorageType[targetList.size()][];
> List targetStorageTypesList = 
> blkCmd.getTargetStorageTypesList();
> if (targetStorageTypesList.isEmpty()) { // missing storage types
>   for(int i = 0; i < targetStorageTypes.length; i++) {
> targetStorageTypes[i] = new StorageType[targets[i].length];
> Arrays.fill(targetStorageTypes[i], StorageType.DEFAULT);
>   }
> } else {
>   for(int i = 0; i < targetStorageTypes.length; i++) {
> List p = 
> targetStorageTypesList.get(i).getStorageTypesList();
> targetStorageTypes[i] = p.toArray(new StorageType[p.size()]);  < 
> error here
>   }
> }
> {code}
> But corresponding to the current logic , it's will be better to return 
> default type instead of a exception in case StorageType changed(new fields 
> added or new types) in new versions during rolling upgrade.
> {code:java}
> public static StorageType convertStorageType(StorageTypeProto type) {
> switch(type) {
> case DISK:
>   return StorageType.DISK;
> case SSD:
>   return StorageType.SSD;
> case ARCHIVE:
>   return StorageType.ARCHIVE;
> case RAM_DISK:
>   return StorageType.RAM_DISK;
> case PROVIDED:
>   return StorageType.PROVIDED;
> default:
>   throw new IllegalStateException(
>   "BUG: StorageTypeProto not found, type=" + type);
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13678) StorageType is incompatible when rolling upgrade to 2.6/2.6+ versions

2020-09-03 Thread Masatake Iwasaki (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated HDFS-13678:

Target Version/s: 2.9.3, 2.10.2  (was: 2.9.3, 2.10.1)

> StorageType is incompatible when rolling upgrade to 2.6/2.6+ versions
> -
>
> Key: HDFS-13678
> URL: https://issues.apache.org/jira/browse/HDFS-13678
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rolling upgrades
>Affects Versions: 2.5.0
>Reporter: Yiqun Lin
>Priority: Major
>
> In version 2.6.0, we supported more storage types in HDFS that implemented in 
> HDFS-6584. But this seems a incompatible change when we rolling upgrade our 
> cluster from 2.5.0 to 2.6.0 and throw following error.
> {noformat}
> 2018-06-14 11:43:39,246 ERROR [DataNode: 
> [[[DISK]file:/home/vipshop/hard_disk/dfs/, [DISK]file:/data1/dfs/, 
> [DISK]file:/data2/dfs/]] heartbeating to xx.xx.xx.xx:8022] 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService 
> for Block pool BP-670256553-xx.xx.xx.xx-1528795419404 (Datanode Uuid 
> ab150e05-fcb7-49ed-b8ba-f05c27593fee) service to xx.xx.xx.xx:8022
> java.lang.ArrayStoreException
>  at java.util.ArrayList.toArray(ArrayList.java:412)
>  at 
> java.util.Collections$UnmodifiableCollection.toArray(Collections.java:1034)
>  at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1030)
>  at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:836)
>  at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:146)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:566)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:664)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:835)
>  at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The scenery is that old DN parses StorageType error that got from new NN. 
> This error is taking place in sending heratbeat to NN and blocks won't be 
> reported to NN successfully. This will lead subsequent errors.
> Corresponding logic in 2.5.0:
> {code}
>   public static BlockCommand convert(BlockCommandProto blkCmd) {
> ...
> StorageType[][] targetStorageTypes = new StorageType[targetList.size()][];
> List targetStorageTypesList = 
> blkCmd.getTargetStorageTypesList();
> if (targetStorageTypesList.isEmpty()) { // missing storage types
>   for(int i = 0; i < targetStorageTypes.length; i++) {
> targetStorageTypes[i] = new StorageType[targets[i].length];
> Arrays.fill(targetStorageTypes[i], StorageType.DEFAULT);
>   }
> } else {
>   for(int i = 0; i < targetStorageTypes.length; i++) {
> List p = 
> targetStorageTypesList.get(i).getStorageTypesList();
> targetStorageTypes[i] = p.toArray(new StorageType[p.size()]);  < 
> error here
>   }
> }
> {code}
> But corresponding to the current logic , it's will be better to return 
> default type instead of a exception in case StorageType changed(new fields 
> added or new types) in new versions during rolling upgrade.
> {code:java}
> public static StorageType convertStorageType(StorageTypeProto type) {
> switch(type) {
> case DISK:
>   return StorageType.DISK;
> case SSD:
>   return StorageType.SSD;
> case ARCHIVE:
>   return StorageType.ARCHIVE;
> case RAM_DISK:
>   return StorageType.RAM_DISK;
> case PROVIDED:
>   return StorageType.PROVIDED;
> default:
>   throw new IllegalStateException(
>   "BUG: StorageTypeProto not found, type=" + type);
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14794) [SBN read] reportBadBlock is rejected by Observer.

2020-09-03 Thread Masatake Iwasaki (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190565#comment-17190565
 ] 

Masatake Iwasaki commented on HDFS-14794:
-

updated the target version for preparing 2.10.1 release.

> [SBN read] reportBadBlock is rejected by Observer.
> --
>
> Key: HDFS-14794
> URL: https://issues.apache.org/jira/browse/HDFS-14794
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Priority: Major
>
> {{reportBadBlock}} is rejected by Observer via StandbyException
> {code}StandbyException: Operation category WRITE is not supported in state 
> observer{code}
> We should investigate what are the consequences of this and if we should 
> treat {{reportBadBlock}} as IBRs. Note that {{reportBadBlock}} is a part of 
> both {{ClientProtocol}} and {{DatanodeProtocol}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14794) [SBN read] reportBadBlock is rejected by Observer.

2020-09-03 Thread Masatake Iwasaki (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated HDFS-14794:

Target Version/s: 2.10.2  (was: 2.10.1)

> [SBN read] reportBadBlock is rejected by Observer.
> --
>
> Key: HDFS-14794
> URL: https://issues.apache.org/jira/browse/HDFS-14794
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Priority: Major
>
> {{reportBadBlock}} is rejected by Observer via StandbyException
> {code}StandbyException: Operation category WRITE is not supported in state 
> observer{code}
> We should investigate what are the consequences of this and if we should 
> treat {{reportBadBlock}} as IBRs. Note that {{reportBadBlock}} is a part of 
> both {{ClientProtocol}} and {{DatanodeProtocol}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15004) Refactor TestBalancer for faster execution.

2020-09-03 Thread Masatake Iwasaki (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190564#comment-17190564
 ] 

Masatake Iwasaki commented on HDFS-15004:
-

updated the target version for preparing 2.10.1 release.

> Refactor TestBalancer for faster execution.
> ---
>
> Key: HDFS-15004
> URL: https://issues.apache.org/jira/browse/HDFS-15004
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs, test
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>
> {{TestBalancer}} is a big test by itself, it is also a part of many other 
> tests. Running these tests involves spinning of {{MiniDFSCluter}} and 
> shutting it down for every test case, which is inefficient. Many of the test 
> cases can run using the same instance of {{MiniDFSCluter}}, but not all of 
> them. Would be good to refactor the tests to optimize their running time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15037) Encryption Zone operations should not block other RPC calls while retreiving encryption keys.

2020-09-03 Thread Masatake Iwasaki (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190563#comment-17190563
 ] 

Masatake Iwasaki commented on HDFS-15037:
-

updated the target version for preparing 2.10.1 release.

> Encryption Zone operations should not block other RPC calls while retreiving 
> encryption keys.
> -
>
> Key: HDFS-15037
> URL: https://issues.apache.org/jira/browse/HDFS-15037
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: encryption, namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Priority: Major
>
> I believe it was an intention to avoid blocking other operations while 
> retrieving keys with holding {{FSDirectory.dirLock}}. But in reality all 
> other operations enter first {{FSNamesystemLock}} then {{dirLock}}. So they 
> are all blocked waiting for the key.
> We see substantial increase in RPC wait time ({{RpcQueueTimeAvgTime}}) on 
> NameNode when encryption operations are intermixed with regular workloads.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15004) Refactor TestBalancer for faster execution.

2020-09-03 Thread Masatake Iwasaki (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated HDFS-15004:

Target Version/s: 2.10.2  (was: 2.10.1)

> Refactor TestBalancer for faster execution.
> ---
>
> Key: HDFS-15004
> URL: https://issues.apache.org/jira/browse/HDFS-15004
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs, test
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>
> {{TestBalancer}} is a big test by itself, it is also a part of many other 
> tests. Running these tests involves spinning of {{MiniDFSCluter}} and 
> shutting it down for every test case, which is inefficient. Many of the test 
> cases can run using the same instance of {{MiniDFSCluter}}, but not all of 
> them. Would be good to refactor the tests to optimize their running time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15037) Encryption Zone operations should not block other RPC calls while retreiving encryption keys.

2020-09-03 Thread Masatake Iwasaki (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated HDFS-15037:

Target Version/s: 2.10.2  (was: 2.10.1)

> Encryption Zone operations should not block other RPC calls while retreiving 
> encryption keys.
> -
>
> Key: HDFS-15037
> URL: https://issues.apache.org/jira/browse/HDFS-15037
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: encryption, namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Priority: Major
>
> I believe it was an intention to avoid blocking other operations while 
> retrieving keys with holding {{FSDirectory.dirLock}}. But in reality all 
> other operations enter first {{FSNamesystemLock}} then {{dirLock}}. So they 
> are all blocked waiting for the key.
> We see substantial increase in RPC wait time ({{RpcQueueTimeAvgTime}}) on 
> NameNode when encryption operations are intermixed with regular workloads.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15163) hdfs-2.10.0-webapps-secondary-status.html miss moment.js

2020-09-03 Thread Masatake Iwasaki (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190562#comment-17190562
 ] 

Masatake Iwasaki commented on HDFS-15163:
-

updated the target version for preparing 2.10.1 release.

> hdfs-2.10.0-webapps-secondary-status.html miss moment.js
> 
>
> Key: HDFS-15163
> URL: https://issues.apache.org/jira/browse/HDFS-15163
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.10.0
>Reporter: 谢波
>Priority: Minor
> Attachments: 微信截图_20200212183444.png
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> hdfs-2.10.0-webapps-secondary-status.html miss moment.js
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15163) hdfs-2.10.0-webapps-secondary-status.html miss moment.js

2020-09-03 Thread Masatake Iwasaki (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated HDFS-15163:

Target Version/s: 2.10.2  (was: 2.10.1)

> hdfs-2.10.0-webapps-secondary-status.html miss moment.js
> 
>
> Key: HDFS-15163
> URL: https://issues.apache.org/jira/browse/HDFS-15163
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.10.0
>Reporter: 谢波
>Priority: Minor
> Attachments: 微信截图_20200212183444.png
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> hdfs-2.10.0-webapps-secondary-status.html miss moment.js
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15357) Do not trust bad block reports from clients

2020-09-03 Thread Masatake Iwasaki (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated HDFS-15357:

Target Version/s: 3.4.0, 2.10.2  (was: 2.10.1, 3.4.0)

> Do not trust bad block reports from clients
> ---
>
> Key: HDFS-15357
> URL: https://issues.apache.org/jira/browse/HDFS-15357
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Priority: Major
>
> {{reportBadBlocks()}} is implemented by both ClientNamenodeProtocol and 
> DatanodeProtocol. When DFSClient is calling it, a faulty client can cause 
> data availability issues in a cluster. 
> In the past we had such an incident where a node with a faulty NIC was 
> randomly corrupting data. All clients ran on the machine reported all 
> accessed blocks and all associated replicas to be corrupt.  More recently, a 
> single faulty client process caused  a small number of missing blocks.  In 
> all cases, actual data was fine.
> The bad block reports from clients shouldn't be trusted blindly. Instead, the 
> namenode should send a datanode command to verify the claim. A bonus would be 
> to keep the record for a while and ignore repeated reports from the same 
> nodes.
> At minimum, there should be an option to ignore bad block reports from 
> clients, perhaps after logging it. A very crude way would be to make it short 
> out in {{ClientNamenodeProtocolServerSideTranslatorPB#reportBadBlocks()}}. 
> More sophisticated way would be to check for the datanode user name in 
> {{FSNamesystem#reportBadBlocks()}} so that it can be easily logged, or 
> optionally do further processing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15357) Do not trust bad block reports from clients

2020-09-03 Thread Masatake Iwasaki (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190559#comment-17190559
 ] 

Masatake Iwasaki commented on HDFS-15357:
-

updated the target version for preparing 2.10.1 release.

> Do not trust bad block reports from clients
> ---
>
> Key: HDFS-15357
> URL: https://issues.apache.org/jira/browse/HDFS-15357
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Priority: Major
>
> {{reportBadBlocks()}} is implemented by both ClientNamenodeProtocol and 
> DatanodeProtocol. When DFSClient is calling it, a faulty client can cause 
> data availability issues in a cluster. 
> In the past we had such an incident where a node with a faulty NIC was 
> randomly corrupting data. All clients ran on the machine reported all 
> accessed blocks and all associated replicas to be corrupt.  More recently, a 
> single faulty client process caused  a small number of missing blocks.  In 
> all cases, actual data was fine.
> The bad block reports from clients shouldn't be trusted blindly. Instead, the 
> namenode should send a datanode command to verify the claim. A bonus would be 
> to keep the record for a while and ignore repeated reports from the same 
> nodes.
> At minimum, there should be an option to ignore bad block reports from 
> clients, perhaps after logging it. A very crude way would be to make it short 
> out in {{ClientNamenodeProtocolServerSideTranslatorPB#reportBadBlocks()}}. 
> More sophisticated way would be to check for the datanode user name in 
> {{FSNamesystem#reportBadBlocks()}} so that it can be easily logged, or 
> optionally do further processing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14277) [SBN read] Observer benchmark results

2020-09-03 Thread Masatake Iwasaki (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190554#comment-17190554
 ] 

Masatake Iwasaki commented on HDFS-14277:
-

I set the target version to 2.10.2 for preparing release of 2.10.1.  [~weichiu] 
Let me know if this should be blocker of 2.10.1.

> [SBN read] Observer benchmark results
> -
>
> Key: HDFS-14277
> URL: https://issues.apache.org/jira/browse/HDFS-14277
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: ha, namenode
>Affects Versions: 2.10.0, 3.3.0
> Environment: Hardware: 4-node cluster, each node has 4 core, Xeon 
> 2.5Ghz, 25GB memory.
> Software: CentOS 7.4, CDH 6.0 + Consistent Reads from Standby, Kerberos, SSL, 
> RPC encryption + Data Transfer Encryption, Cloudera Navigator.
>Reporter: Wei-Chiu Chuang
>Priority: Blocker
> Attachments: Observer profiler.png, Screen Shot 2019-02-14 at 
> 11.50.37 AM.png, observer RPC queue processing time.png
>
>
> Ran a few benchmarks and profiler (VisualVM) today on an Observer-enabled 
> cluster. Would like to share the results with the community. The cluster has 
> 1 Observer node.
> h2. NNThroughputBenchmark
> Generate 1 million files and send fileStatus RPCs.
> {code:java}
> hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs 
>   -op fileStatus -threads 100 -files 100 -useExisting 
> -keepResults
> {code}
> h3. Kerberos, SSL, RPC encryption, Data Transfer Encryption enabled:
> ||Node||fileStatus (Ops per sec)||
> |Active NameNode|4865|
> |Observer|3996|
> h3. Kerberos, SSL:
> ||Node||fileStatus (Ops per sec)||
> |Active NameNode|7078|
> |Observer|6459|
> Observation:
>  * due to the edit tailing overhead, Observer node consume 30% CPU 
> utilization even if the cluster is idle.
>  * While Active NN has less than 1ms RPC processing time, Observer node has > 
> 5ms RPC processing time. I am still looking for the source of the longer 
> processing time. The longer RPC processing time may be the cause for the 
> performance degradation compared to that of Active NN. Note the cluster has 
> Cloudera Navigator installed which adds additional overhead to RPC processing 
> time.
>  * {{GlobalStateIdContext#isCoordinatedCall()}} pops up as one of the top 
> hotspots in the profiler. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14277) [SBN read] Observer benchmark results

2020-09-03 Thread Masatake Iwasaki (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated HDFS-14277:

Target Version/s: 2.10.2  (was: 2.10.1)

> [SBN read] Observer benchmark results
> -
>
> Key: HDFS-14277
> URL: https://issues.apache.org/jira/browse/HDFS-14277
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: ha, namenode
>Affects Versions: 2.10.0, 3.3.0
> Environment: Hardware: 4-node cluster, each node has 4 core, Xeon 
> 2.5Ghz, 25GB memory.
> Software: CentOS 7.4, CDH 6.0 + Consistent Reads from Standby, Kerberos, SSL, 
> RPC encryption + Data Transfer Encryption, Cloudera Navigator.
>Reporter: Wei-Chiu Chuang
>Priority: Blocker
> Attachments: Observer profiler.png, Screen Shot 2019-02-14 at 
> 11.50.37 AM.png, observer RPC queue processing time.png
>
>
> Ran a few benchmarks and profiler (VisualVM) today on an Observer-enabled 
> cluster. Would like to share the results with the community. The cluster has 
> 1 Observer node.
> h2. NNThroughputBenchmark
> Generate 1 million files and send fileStatus RPCs.
> {code:java}
> hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs 
>   -op fileStatus -threads 100 -files 100 -useExisting 
> -keepResults
> {code}
> h3. Kerberos, SSL, RPC encryption, Data Transfer Encryption enabled:
> ||Node||fileStatus (Ops per sec)||
> |Active NameNode|4865|
> |Observer|3996|
> h3. Kerberos, SSL:
> ||Node||fileStatus (Ops per sec)||
> |Active NameNode|7078|
> |Observer|6459|
> Observation:
>  * due to the edit tailing overhead, Observer node consume 30% CPU 
> utilization even if the cluster is idle.
>  * While Active NN has less than 1ms RPC processing time, Observer node has > 
> 5ms RPC processing time. I am still looking for the source of the longer 
> processing time. The longer RPC processing time may be the cause for the 
> performance degradation compared to that of Active NN. Note the cluster has 
> Cloudera Navigator installed which adds additional overhead to RPC processing 
> time.
>  * {{GlobalStateIdContext#isCoordinatedCall()}} pops up as one of the top 
> hotspots in the profiler. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-09-03 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190546#comment-17190546
 ] 

Hadoop QA commented on HDFS-14694:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
38s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
11s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m  
0s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
39s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
11s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 58s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
30s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m  
4s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 
55s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
16s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
26s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
54s{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
34s{color} | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 46s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
20s{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
55s{color} | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
26s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| 

[jira] [Work logged] (HDFS-15551) Tiny Improve for DeadNode detector

2020-09-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15551?focusedWorklogId=478934=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478934
 ]

ASF GitHub Bot logged work on HDFS-15551:
-

Author: ASF GitHub Bot
Created on: 04/Sep/20 04:49
Start Date: 04/Sep/20 04:49
Worklog Time Spent: 10m 
  Work Description: leosunli commented on a change in pull request #2265:
URL: https://github.com/apache/hadoop/pull/2265#discussion_r483382751



##
File path: 
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DeadNodeDetector.java
##
@@ -475,6 +475,7 @@ public synchronized void addNodeToDetect(DFSInputStream 
dfsInputStream,
   datanodeInfos.add(datanodeInfo);
 }
 
+LOG.warn("Add datanode {} to suspectAndDeadNodes", datanodeInfo);

Review comment:
   One case: when a lot of invalid relicas, will the log flood?

##
File path: 
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DeadNodeDetector.java
##
@@ -396,13 +395,13 @@ private void probeCallBack(Probe probe, boolean success) {
 probe.getDatanodeInfo());
 removeDeadNode(probe.getDatanodeInfo());
   } else if (probe.getType() == ProbeType.CHECK_SUSPECT) {
-LOG.debug("Remove the node out from suspect node list: {}.",
+LOG.info("Remove the node out from suspect node list: {}.",

Review comment:
when a lot of invalid relicas,  it should have many supsect nodes  but 
not dead nodes.
   These nodes all will print this log.
   What is the purpose of printing this log? 
   The client can access normally the suspect node.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 478934)
Time Spent: 50m  (was: 40m)

> Tiny Improve for DeadNode detector
> --
>
> Key: HDFS-15551
> URL: https://issues.apache.org/jira/browse/HDFS-15551
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: 3.3.0
>Reporter: dark_num
>Assignee: imbajin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> # add or improve some logs for adding local & global deadnodes
>  # logic improve
>  # fix typo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15551) Tiny Improve for DeadNode detector

2020-09-03 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190537#comment-17190537
 ] 

Lisheng Sun commented on HDFS-15551:


Yeah,I would review it recently.

> Tiny Improve for DeadNode detector
> --
>
> Key: HDFS-15551
> URL: https://issues.apache.org/jira/browse/HDFS-15551
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: 3.3.0
>Reporter: dark_num
>Assignee: imbajin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> # add or improve some logs for adding local & global deadnodes
>  # logic improve
>  # fix typo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15557) Log the reason why a storage log file can't be deleted

2020-09-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15557?focusedWorklogId=478927=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478927
 ]

ASF GitHub Bot logged work on HDFS-15557:
-

Author: ASF GitHub Bot
Created on: 04/Sep/20 04:21
Start Date: 04/Sep/20 04:21
Worklog Time Spent: 10m 
  Work Description: liuml07 commented on a change in pull request #2274:
URL: https://github.com/apache/hadoop/pull/2274#discussion_r483377105



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/AtomicFileOutputStream.java
##
@@ -75,8 +76,13 @@ public void close() throws IOException {
 boolean renamed = tmpFile.renameTo(origFile);
 if (!renamed) {
   // On windows, renameTo does not replace.
-  if (origFile.exists() && !origFile.delete()) {
-throw new IOException("Could not delete original file " + 
origFile);
+  if (origFile.exists()) {
+try {
+  Files.delete(origFile.toPath());
+} catch (IOException e) {
+  throw new IOException("Could not delete original file " + 
origFile

Review comment:
   Is it simpler
   ```
   throw new IOException("Could not delete original file " + origFile, e);
   ```
   
   Other than that, +1





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 478927)
Time Spent: 50m  (was: 40m)

> Log the reason why a storage log file can't be deleted
> --
>
> Key: HDFS-15557
> URL: https://issues.apache.org/jira/browse/HDFS-15557
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ye Ni
>Assignee: Ye Ni
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Before
>  
> {code:java}
> 2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] 
> org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
> failed on Storage Directory root= K:\data\hdfs\namenode; location= null; 
> type= IMAGE; isShared= false; lock= 
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= 
> null java.io.IOException: Could not delete original file 
> K:\data\hdfs\namenode\current\seen_txid{code}
>  
> After
>  
> {code:java}
> 2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] 
> org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
> failed on Storage Directory root= K:\data\hdfs\namenode; location= null; 
> type= IMAGE; isShared= false; lock= 
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= 
> null java.io.IOException: Could not delete original file 
> K:\data\hdfs\namenode\current\seen_txid due to failure: 
> java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: 
> The process cannot access the file because it is being used by another 
> process.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15551) Tiny Improve for DeadNode detector

2020-09-03 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190526#comment-17190526
 ] 

Xiaoqiao He commented on HDFS-15551:


Thanks [~imbajin] involve me here.
Add [~imbajin] to contributor list and assign this JIRA to him. [~leosun08] 
would you like to take another review?

> Tiny Improve for DeadNode detector
> --
>
> Key: HDFS-15551
> URL: https://issues.apache.org/jira/browse/HDFS-15551
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: 3.3.0
>Reporter: dark_num
>Assignee: imbajin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> # add or improve some logs for adding local & global deadnodes
>  # logic improve
>  # fix typo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15551) Tiny Improve for DeadNode detector

2020-09-03 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-15551:
--

Assignee: imbajin  (was: dark_num)

> Tiny Improve for DeadNode detector
> --
>
> Key: HDFS-15551
> URL: https://issues.apache.org/jira/browse/HDFS-15551
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: 3.3.0
>Reporter: dark_num
>Assignee: imbajin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> # add or improve some logs for adding local & global deadnodes
>  # logic improve
>  # fix typo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15551) Tiny Improve for DeadNode detector

2020-09-03 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reassigned HDFS-15551:
--

Assignee: dark_num

> Tiny Improve for DeadNode detector
> --
>
> Key: HDFS-15551
> URL: https://issues.apache.org/jira/browse/HDFS-15551
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: 3.3.0
>Reporter: dark_num
>Assignee: dark_num
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> # add or improve some logs for adding local & global deadnodes
>  # logic improve
>  # fix typo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15557) Log the reason why a storage log file can't be deleted

2020-09-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15557?focusedWorklogId=478920=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478920
 ]

ASF GitHub Bot logged work on HDFS-15557:
-

Author: ASF GitHub Bot
Created on: 04/Sep/20 03:29
Start Date: 04/Sep/20 03:29
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2274:
URL: https://github.com/apache/hadoop/pull/2274#issuecomment-686880097


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | +0 :ok: |  reexec  |   0m 30s |  Docker mode activated.  |
   ||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  No case conflicting files 
found.  |
   | +1 :green_heart: |  @author  |   0m  0s |  The patch does not contain any 
@author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
   ||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  28m  6s |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 16s |  trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1  |
   | +1 :green_heart: |  compile  |   1m 12s |  trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01  |
   | +1 :green_heart: |  checkstyle  |   0m 48s |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 20s |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  16m 25s |  branch has no errors when 
building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 52s |  trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 27s |  trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01  |
   | +0 :ok: |  spotbugs  |   3m  0s |  Used deprecated FindBugs config; 
considering switching to SpotBugs.  |
   | +1 :green_heart: |  findbugs  |   2m 58s |  trunk passed  |
   ||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 10s |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 10s |  the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1  |
   | +1 :green_heart: |  javac  |   1m 10s |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  3s |  the patch passed with JDK 
Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01  |
   | +1 :green_heart: |  javac  |   1m  3s |  the patch passed  |
   | +1 :green_heart: |  checkstyle  |   0m 39s |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m  8s |  the patch passed  |
   | +1 :green_heart: |  whitespace  |   0m  0s |  The patch has no whitespace 
issues.  |
   | +1 :green_heart: |  shadedclient  |  13m 54s |  patch has no errors when 
building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 45s |  the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 19s |  the patch passed with JDK 
Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01  |
   | +1 :green_heart: |  findbugs  |   3m  4s |  the patch passed  |
   ||| _ Other Tests _ |
   | -1 :x: |  unit  |  94m 42s |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 43s |  The patch does not generate 
ASF License warnings.  |
   |  |   | 176m 26s |   |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.namenode.TestNameNodeRetryCacheMetrics |
   |   | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier |
   |   | hadoop.hdfs.TestReconstructStripedFile |
   |   | hadoop.hdfs.TestFileChecksumCompositeCrc |
   |   | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped |
   |   | hadoop.hdfs.TestFileChecksum |
   |   | hadoop.hdfs.TestGetFileChecksum |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2274/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2274 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient findbugs checkstyle |
   | uname | Linux 68f01445e536 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 
16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 139a43e98e2 |
   | Default Java | Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 

[jira] [Comment Edited] (HDFS-15551) Tiny Improve for DeadNode detector

2020-09-03 Thread imbajin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190501#comment-17190501
 ] 

imbajin edited comment on HDFS-15551 at 9/4/20, 2:56 AM:
-

 [~hexiaoqiao] , [~linyiqun],  [~weichiu] Could u take a view for the patch?  
THX

And I wonder how this issue is *assigned* to me? (Seems I can't do this by 
myself)


was (Author: imbajin):
 [~hexiaoqiao] , [~linyiqun],  [~weichiu] wang Could u take a view for the 
patch?  THX

And I wonder how this issue is *assigned* to me? (Seems I can't do this by 
myself)

> Tiny Improve for DeadNode detector
> --
>
> Key: HDFS-15551
> URL: https://issues.apache.org/jira/browse/HDFS-15551
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: 3.3.0
>Reporter: dark_num
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> # add or improve some logs for adding local & global deadnodes
>  # logic improve
>  # fix typo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15551) Tiny Improve for DeadNode detector

2020-09-03 Thread imbajin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190501#comment-17190501
 ] 

imbajin edited comment on HDFS-15551 at 9/4/20, 2:55 AM:
-

 [~hexiaoqiao] , [~linyiqun],  [~weichiu] wang Could u take a view for the 
patch?  THX

And I wonder how this issue is *assigned* to me? (Seems I can't do this by 
myself)


was (Author: imbajin):
 [~hexiaoqiao] , [~linyiqun],   Could u take a view for the patch?  THX

And I wonder how this issue is *assigned* to me? (Seems I can't do this by 
myself)

> Tiny Improve for DeadNode detector
> --
>
> Key: HDFS-15551
> URL: https://issues.apache.org/jira/browse/HDFS-15551
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: 3.3.0
>Reporter: dark_num
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> # add or improve some logs for adding local & global deadnodes
>  # logic improve
>  # fix typo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15551) Tiny Improve for DeadNode detector

2020-09-03 Thread imbajin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190501#comment-17190501
 ] 

imbajin commented on HDFS-15551:


 [~hexiaoqiao] , [~linyiqun],   Could u take a view for the patch?  THX

And I wonder how this issue is *assigned* to me? (Seems I can't do this by 
myself)

> Tiny Improve for DeadNode detector
> --
>
> Key: HDFS-15551
> URL: https://issues.apache.org/jira/browse/HDFS-15551
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs-client
>Affects Versions: 3.3.0
>Reporter: dark_num
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> # add or improve some logs for adding local & global deadnodes
>  # logic improve
>  # fix typo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13522) Support observer node from Router-Based Federation

2020-09-03 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190483#comment-17190483
 ] 

Chao Sun commented on HDFS-13522:
-

[~hemanthboyina] feel free to take over this. I haven't got a chance to work on 
this but I think it is an important feature. I may be able to help on code 
review.

> Support observer node from Router-Based Federation
> --
>
> Key: HDFS-13522
> URL: https://issues.apache.org/jira/browse/HDFS-13522
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: federation, namenode
>Reporter: Erik Krogen
>Assignee: Chao Sun
>Priority: Major
> Attachments: HDFS-13522.001.patch, HDFS-13522_WIP.patch, RBF_ 
> Observer support.pdf, Router+Observer RPC clogging.png, 
> ShortTerm-Routers+Observer.png
>
>
> Changes will need to occur to the router to support the new observer node.
> One such change will be to make the router understand the observer state, 
> e.g. {{FederationNamenodeServiceState}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-09-03 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190479#comment-17190479
 ] 

Lisheng Sun commented on HDFS-14694:


Thanks [~hexiaoqiao] for patient review.

The v011 patch removed unused code.

> Call recoverLease on DFSOutputStream close exception
> 
>
> Key: HDFS-14694
> URL: https://issues.apache.org/jira/browse/HDFS-14694
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Chen Zhang
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, 
> HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, 
> HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, 
> HDFS-14694.009.patch, HDFS-14694.010.patch, HDFS-14694.011.patch
>
>
> HDFS uses file-lease to manage opened files, when a file is not closed 
> normally, NN will recover lease automatically after hard limit exceeded. But 
> for a long running service(e.g. HBase), the hdfs-client will never die and NN 
> don't have any chances to recover the file.
> Usually client program needs to handle exceptions by themself to avoid this 
> condition(e.g. HBase automatically call recover lease for files that not 
> closed normally), but in our experience, most services (in our company) don't 
> process this condition properly, which will cause lots of files in abnormal 
> status or even data loss.
> This Jira propose to add a feature that call recoverLease operation 
> automatically when DFSOutputSteam close encounters exception. It should be 
> disabled by default, but when somebody builds a long-running service based on 
> HDFS, they can enable this option.
> We've add this feature to our internal Hadoop distribution for more than 3 
> years, it's quite useful according our experience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-09-03 Thread Lisheng Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lisheng Sun updated HDFS-14694:
---
Attachment: HDFS-14694.011.patch

> Call recoverLease on DFSOutputStream close exception
> 
>
> Key: HDFS-14694
> URL: https://issues.apache.org/jira/browse/HDFS-14694
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Chen Zhang
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, 
> HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, 
> HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, 
> HDFS-14694.009.patch, HDFS-14694.010.patch, HDFS-14694.011.patch
>
>
> HDFS uses file-lease to manage opened files, when a file is not closed 
> normally, NN will recover lease automatically after hard limit exceeded. But 
> for a long running service(e.g. HBase), the hdfs-client will never die and NN 
> don't have any chances to recover the file.
> Usually client program needs to handle exceptions by themself to avoid this 
> condition(e.g. HBase automatically call recover lease for files that not 
> closed normally), but in our experience, most services (in our company) don't 
> process this condition properly, which will cause lots of files in abnormal 
> status or even data loss.
> This Jira propose to add a feature that call recoverLease operation 
> automatically when DFSOutputSteam close encounters exception. It should be 
> disabled by default, but when somebody builds a long-running service based on 
> HDFS, they can enable this option.
> We've add this feature to our internal Hadoop distribution for more than 3 
> years, it's quite useful according our experience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-09-03 Thread Lisheng Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lisheng Sun updated HDFS-14694:
---
Attachment: (was: HDFS-14694.010.patch)

> Call recoverLease on DFSOutputStream close exception
> 
>
> Key: HDFS-14694
> URL: https://issues.apache.org/jira/browse/HDFS-14694
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Chen Zhang
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, 
> HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, 
> HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, 
> HDFS-14694.009.patch, HDFS-14694.010.patch
>
>
> HDFS uses file-lease to manage opened files, when a file is not closed 
> normally, NN will recover lease automatically after hard limit exceeded. But 
> for a long running service(e.g. HBase), the hdfs-client will never die and NN 
> don't have any chances to recover the file.
> Usually client program needs to handle exceptions by themself to avoid this 
> condition(e.g. HBase automatically call recover lease for files that not 
> closed normally), but in our experience, most services (in our company) don't 
> process this condition properly, which will cause lots of files in abnormal 
> status or even data loss.
> This Jira propose to add a feature that call recoverLease operation 
> automatically when DFSOutputSteam close encounters exception. It should be 
> disabled by default, but when somebody builds a long-running service based on 
> HDFS, they can enable this option.
> We've add this feature to our internal Hadoop distribution for more than 3 
> years, it's quite useful according our experience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-09-03 Thread Lisheng Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lisheng Sun updated HDFS-14694:
---
Attachment: HDFS-14694.010.patch

> Call recoverLease on DFSOutputStream close exception
> 
>
> Key: HDFS-14694
> URL: https://issues.apache.org/jira/browse/HDFS-14694
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Chen Zhang
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, 
> HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, 
> HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, 
> HDFS-14694.009.patch, HDFS-14694.010.patch
>
>
> HDFS uses file-lease to manage opened files, when a file is not closed 
> normally, NN will recover lease automatically after hard limit exceeded. But 
> for a long running service(e.g. HBase), the hdfs-client will never die and NN 
> don't have any chances to recover the file.
> Usually client program needs to handle exceptions by themself to avoid this 
> condition(e.g. HBase automatically call recover lease for files that not 
> closed normally), but in our experience, most services (in our company) don't 
> process this condition properly, which will cause lots of files in abnormal 
> status or even data loss.
> This Jira propose to add a feature that call recoverLease operation 
> automatically when DFSOutputSteam close encounters exception. It should be 
> disabled by default, but when somebody builds a long-running service based on 
> HDFS, they can enable this option.
> We've add this feature to our internal Hadoop distribution for more than 3 
> years, it's quite useful according our experience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15557) Log the reason why a storage log file can't be deleted

2020-09-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15557?focusedWorklogId=478886=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478886
 ]

ASF GitHub Bot logged work on HDFS-15557:
-

Author: ASF GitHub Bot
Created on: 04/Sep/20 01:52
Start Date: 04/Sep/20 01:52
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2274:
URL: https://github.com/apache/hadoop/pull/2274#issuecomment-686852824


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | +0 :ok: |  reexec  |  28m 28s |  Docker mode activated.  |
   ||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  No case conflicting files 
found.  |
   | +1 :green_heart: |  @author  |   0m  0s |  The patch does not contain any 
@author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
   ||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  28m 13s |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 17s |  trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1  |
   | +1 :green_heart: |  compile  |   1m 11s |  trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01  |
   | +1 :green_heart: |  checkstyle  |   0m 49s |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 18s |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  16m 15s |  branch has no errors when 
building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 51s |  trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 22s |  trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01  |
   | +0 :ok: |  spotbugs  |   2m 58s |  Used deprecated FindBugs config; 
considering switching to SpotBugs.  |
   | +1 :green_heart: |  findbugs  |   2m 56s |  trunk passed  |
   ||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 11s |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  8s |  the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1  |
   | +1 :green_heart: |  javac  |   1m  8s |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  4s |  the patch passed with JDK 
Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01  |
   | +1 :green_heart: |  javac  |   1m  4s |  the patch passed  |
   | -0 :warning: |  checkstyle  |   0m 40s |  hadoop-hdfs-project/hadoop-hdfs: 
The patch generated 2 new + 3 unchanged - 0 fixed = 5 total (was 3)  |
   | +1 :green_heart: |  mvnsite  |   1m 12s |  the patch passed  |
   | +1 :green_heart: |  whitespace  |   0m  0s |  The patch has no whitespace 
issues.  |
   | +1 :green_heart: |  shadedclient  |  13m 51s |  patch has no errors when 
building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 51s |  the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 25s |  the patch passed with JDK 
Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01  |
   | +1 :green_heart: |  findbugs  |   3m  7s |  the patch passed  |
   ||| _ Other Tests _ |
   | -1 :x: |  unit  |  97m 52s |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 42s |  The patch does not generate 
ASF License warnings.  |
   |  |   | 207m 42s |   |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestFileChecksumCompositeCrc |
   |   | hadoop.hdfs.TestMultipleNNPortQOP |
   |   | hadoop.hdfs.TestFileAppend4 |
   |   | hadoop.hdfs.TestErasureCodingExerciseAPIs |
   |   | hadoop.hdfs.TestFileChecksum |
   |   | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier |
   |   | hadoop.hdfs.TestDFSStripedOutputStream |
   |   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure |
   |   | hadoop.hdfs.server.balancer.TestBalancerWithMultipleNameNodes |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2274/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2274 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient findbugs checkstyle |
   | uname | Linux c407a908478b 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 139a43e98e2 |
   | Default Java | Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
   | Multi-JDK versions | 

[jira] [Commented] (HDFS-15557) Log the reason why a storage log file can't be deleted

2020-09-03 Thread Ye Ni (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190452#comment-17190452
 ] 

Ye Ni commented on HDFS-15557:
--

[~inigoiri] testWriteTransactionIdHandlesIOE()?

> Log the reason why a storage log file can't be deleted
> --
>
> Key: HDFS-15557
> URL: https://issues.apache.org/jira/browse/HDFS-15557
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ye Ni
>Assignee: Ye Ni
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Before
>  
> {code:java}
> 2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] 
> org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
> failed on Storage Directory root= K:\data\hdfs\namenode; location= null; 
> type= IMAGE; isShared= false; lock= 
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= 
> null java.io.IOException: Could not delete original file 
> K:\data\hdfs\namenode\current\seen_txid{code}
>  
> After
>  
> {code:java}
> 2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] 
> org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
> failed on Storage Directory root= K:\data\hdfs\namenode; location= null; 
> type= IMAGE; isShared= false; lock= 
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= 
> null java.io.IOException: Could not delete original file 
> K:\data\hdfs\namenode\current\seen_txid due to failure: 
> java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: 
> The process cannot access the file because it is being used by another 
> process.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15557) Log the reason why a storage log file can't be deleted

2020-09-03 Thread Ye Ni (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Ni updated HDFS-15557:
-
Summary: Log the reason why a storage log file can't be deleted  (was: Log 
the reason why a file can't be deleted)

> Log the reason why a storage log file can't be deleted
> --
>
> Key: HDFS-15557
> URL: https://issues.apache.org/jira/browse/HDFS-15557
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ye Ni
>Assignee: Ye Ni
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Before
>  
> {code:java}
> 2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] 
> org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
> failed on Storage Directory root= K:\data\hdfs\namenode; location= null; 
> type= IMAGE; isShared= false; lock= 
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= 
> null java.io.IOException: Could not delete original file 
> K:\data\hdfs\namenode\current\seen_txid{code}
>  
> After
>  
> {code:java}
> 2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] 
> org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
> failed on Storage Directory root= K:\data\hdfs\namenode; location= null; 
> type= IMAGE; isShared= false; lock= 
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= 
> null java.io.IOException: Could not delete original file 
> K:\data\hdfs\namenode\current\seen_txid due to failure: 
> java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: 
> The process cannot access the file because it is being used by another 
> process.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15557) Log the reason why a file can't be deleted

2020-09-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190425#comment-17190425
 ] 

Íñigo Goiri commented on HDFS-15557:


[~NickyYe] thanks for the patch.
Can you rename the JIRA to indicate this is used for the storage logs?
Which tests cover this, BTW?

> Log the reason why a file can't be deleted
> --
>
> Key: HDFS-15557
> URL: https://issues.apache.org/jira/browse/HDFS-15557
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ye Ni
>Assignee: Ye Ni
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Before
>  
> {code:java}
> 2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] 
> org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
> failed on Storage Directory root= K:\data\hdfs\namenode; location= null; 
> type= IMAGE; isShared= false; lock= 
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= 
> null java.io.IOException: Could not delete original file 
> K:\data\hdfs\namenode\current\seen_txid{code}
>  
> After
>  
> {code:java}
> 2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] 
> org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
> failed on Storage Directory root= K:\data\hdfs\namenode; location= null; 
> type= IMAGE; isShared= false; lock= 
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= 
> null java.io.IOException: Could not delete original file 
> K:\data\hdfs\namenode\current\seen_txid due to failure: 
> java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: 
> The process cannot access the file because it is being used by another 
> process.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15557) Log the reason why a file can't be deleted

2020-09-03 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HDFS-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Íñigo Goiri reassigned HDFS-15557:
--

Assignee: Ye Ni

> Log the reason why a file can't be deleted
> --
>
> Key: HDFS-15557
> URL: https://issues.apache.org/jira/browse/HDFS-15557
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ye Ni
>Assignee: Ye Ni
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Before
>  
> {code:java}
> 2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] 
> org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
> failed on Storage Directory root= K:\data\hdfs\namenode; location= null; 
> type= IMAGE; isShared= false; lock= 
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= 
> null java.io.IOException: Could not delete original file 
> K:\data\hdfs\namenode\current\seen_txid{code}
>  
> After
>  
> {code:java}
> 2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] 
> org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
> failed on Storage Directory root= K:\data\hdfs\namenode; location= null; 
> type= IMAGE; isShared= false; lock= 
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= 
> null java.io.IOException: Could not delete original file 
> K:\data\hdfs\namenode\current\seen_txid due to failure: 
> java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: 
> The process cannot access the file because it is being used by another 
> process.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15557) Log the reason why a file can't be deleted

2020-09-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15557?focusedWorklogId=478829=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478829
 ]

ASF GitHub Bot logged work on HDFS-15557:
-

Author: ASF GitHub Bot
Created on: 03/Sep/20 22:11
Start Date: 03/Sep/20 22:11
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2274:
URL: https://github.com/apache/hadoop/pull/2274#issuecomment-686789909


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime | Comment |
   |::|--:|:|:|
   | +0 :ok: |  reexec  |   0m 30s |  Docker mode activated.  |
   ||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  No case conflicting files 
found.  |
   | +1 :green_heart: |  @author  |   0m  0s |  The patch does not contain any 
@author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
   ||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  29m 28s |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 17s |  trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1  |
   | +1 :green_heart: |  compile  |   1m 10s |  trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01  |
   | +1 :green_heart: |  checkstyle  |   0m 48s |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 20s |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  16m 16s |  branch has no errors when 
building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 53s |  trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 26s |  trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01  |
   | +0 :ok: |  spotbugs  |   3m  0s |  Used deprecated FindBugs config; 
considering switching to SpotBugs.  |
   | +1 :green_heart: |  findbugs  |   2m 57s |  trunk passed  |
   ||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m  7s |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  8s |  the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1  |
   | +1 :green_heart: |  javac  |   1m  8s |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  2s |  the patch passed with JDK 
Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01  |
   | +1 :green_heart: |  javac  |   1m  2s |  the patch passed  |
   | +1 :green_heart: |  checkstyle  |   0m 39s |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m  9s |  the patch passed  |
   | +1 :green_heart: |  whitespace  |   0m  0s |  The patch has no whitespace 
issues.  |
   | +1 :green_heart: |  shadedclient  |  13m 53s |  patch has no errors when 
building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 46s |  the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 24s |  the patch passed with JDK 
Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01  |
   | +1 :green_heart: |  findbugs  |   3m  4s |  the patch passed  |
   ||| _ Other Tests _ |
   | -1 :x: |  unit  |  94m 29s |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 44s |  The patch does not generate 
ASF License warnings.  |
   |  |   | 177m 27s |   |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.balancer.TestBalancer |
   |   | hadoop.hdfs.server.blockmanagement.TestBlockStatsMXBean |
   |   | hadoop.hdfs.TestFileChecksum |
   |   | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks |
   |   | hadoop.hdfs.server.namenode.TestNameNodeRetryCacheMetrics |
   |   | hadoop.hdfs.TestFileChecksumCompositeCrc |
   |   | hadoop.hdfs.TestDFSInotifyEventInputStreamKerberized |
   |   | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.40 ServerAPI=1.40 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2274/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2274 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient findbugs checkstyle |
   | uname | Linux a5a6b91bbc09 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 139a43e98e2 |
   | Default Java | Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 

[jira] [Commented] (HDFS-12548) HDFS Jenkins build is unstable on branch-2

2020-09-03 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190411#comment-17190411
 ] 

Jim Brennan commented on HDFS-12548:


I propose we close this issue or at least reduce the priority.  It's three 
years old and I don't see any evidence that we've seen it again.  Haven't 
switched over to cloudbees as well?


> HDFS Jenkins build is unstable on branch-2
> --
>
> Key: HDFS-12548
> URL: https://issues.apache.org/jira/browse/HDFS-12548
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.9.0
>Reporter: Rushabh Shah
>Priority: Critical
>
> Feel free move the ticket to another project (e.g. infra).
> Recently I attached branch-2 patch while working on one jira 
> [HDFS-12386|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180676=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180676]
> There were at-least 100 failed and timed out tests. I am sure they are not 
> related to my patch.
> Also I came across another jira which was just a javadoc related change and 
> there were around 100 failed tests.
> Below are the details for pre-commits that failed in branch-2
> 1 [HDFS-12386 attempt 
> 1|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180069=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180069]
> {noformat}
> Ran on slave: asf912.gq1.ygridcore.net/H12
> Failed with following error message:
> Build timed out (after 300 minutes). Marking the build as aborted.
> Build was aborted
> Performing Post build task...
> {noformat}
> 2. [HDFS-12386 attempt 
> 2|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180676=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180676]
> {noformat}
> Ran on slave: asf900.gq1.ygridcore.net
> Failed with following error message:
> FATAL: command execution failed
> Command close created at
>   at hudson.remoting.Command.(Command.java:60)
>   at hudson.remoting.Channel$CloseCommand.(Channel.java:1123)
>   at hudson.remoting.Channel$CloseCommand.(Channel.java:1121)
>   at hudson.remoting.Channel.close(Channel.java:1281)
>   at hudson.remoting.Channel.close(Channel.java:1263)
>   at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1128)
> Caused: hudson.remoting.Channel$OrderlyShutdown
>   at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1129)
>   at hudson.remoting.Channel$1.handle(Channel.java:527)
>   at 
> hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:83)
> Caused: java.io.IOException: Backing channel 'H0' is disconnected.
>   at 
> hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:192)
>   at 
> hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:257)
>   at com.sun.proxy.$Proxy125.isAlive(Unknown Source)
>   at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1043)
>   at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1035)
>   at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155)
>   at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109)
>   at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
>   at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:735)
>   at hudson.model.Build$BuildExecution.build(Build.java:206)
>   at hudson.model.Build$BuildExecution.doRun(Build.java:163)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:490)
>   at hudson.model.Run.execute(Run.java:1735)
>   at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
>   at hudson.model.ResourceController.execute(ResourceController.java:97)
>   at hudson.model.Executor.run(Executor.java:405)
> {noformat}
> 3. [HDFS-12531 attempt 
> 1|https://issues.apache.org/jira/browse/HDFS-12531?focusedCommentId=16176493=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16176493]
> {noformat}
> Ran on slave:  asf911.gq1.ygridcore.net
> Failed with following error message:
> FATAL: command execution failed
> Command close created at
>   at hudson.remoting.Command.(Command.java:60)
>   at hudson.remoting.Channel$CloseCommand.(Channel.java:1123)
>   at hudson.remoting.Channel$CloseCommand.(Channel.java:1121)
>   at hudson.remoting.Channel.close(Channel.java:1281)
>   at hudson.remoting.Channel.close(Channel.java:1263)
>   at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1128)
> Caused: hudson.remoting.Channel$OrderlyShutdown
>   

[jira] [Created] (HDFS-15558) ViewDistributedFileSystem#recoverLease should call super.recoverLease when there are no mounts configured

2020-09-03 Thread Uma Maheswara Rao G (Jira)
Uma Maheswara Rao G created HDFS-15558:
--

 Summary: ViewDistributedFileSystem#recoverLease should call 
super.recoverLease when there are no mounts configured
 Key: HDFS-15558
 URL: https://issues.apache.org/jira/browse/HDFS-15558
 Project: Hadoop HDFS
  Issue Type: Sub-task
Reporter: Uma Maheswara Rao G
Assignee: Uma Maheswara Rao G






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15543) RBF: Write Should allow, when a subcluster is unavailable for RANDOM mount points with fault Tolerance enabled.

2020-09-03 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190381#comment-17190381
 ] 

Hadoop QA commented on HDFS-15543:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 33m  
8s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
1s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
31s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
37s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
34s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 20s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
34s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
52s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  1m  
9s{color} | {color:blue} Used deprecated FindBugs config; considering switching 
to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m  
8s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
31s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
29s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 21s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
32s{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
13s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}  8m 39s{color} 
| {color:red} hadoop-hdfs-rbf in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
28s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |

[jira] [Work logged] (HDFS-15557) Log the reason why a file can't be deleted

2020-09-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15557?focusedWorklogId=478769=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478769
 ]

ASF GitHub Bot logged work on HDFS-15557:
-

Author: ASF GitHub Bot
Created on: 03/Sep/20 19:12
Start Date: 03/Sep/20 19:12
Worklog Time Spent: 10m 
  Work Description: NickyYe opened a new pull request #2274:
URL: https://github.com/apache/hadoop/pull/2274


   https://issues.apache.org/jira/browse/HDFS-15557
   
   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HADOOP-X. Fix a typo in YYY.)
   For more details, please see 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 478769)
Remaining Estimate: 0h
Time Spent: 10m

> Log the reason why a file can't be deleted
> --
>
> Key: HDFS-15557
> URL: https://issues.apache.org/jira/browse/HDFS-15557
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ye Ni
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Before
>  
> {code:java}
> 2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] 
> org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
> failed on Storage Directory root= K:\data\hdfs\namenode; location= null; 
> type= IMAGE; isShared= false; lock= 
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= 
> null java.io.IOException: Could not delete original file 
> K:\data\hdfs\namenode\current\seen_txid{code}
>  
> After
>  
> {code:java}
> 2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] 
> org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
> failed on Storage Directory root= K:\data\hdfs\namenode; location= null; 
> type= IMAGE; isShared= false; lock= 
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= 
> null java.io.IOException: Could not delete original file 
> K:\data\hdfs\namenode\current\seen_txid due to failure: 
> java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: 
> The process cannot access the file because it is being used by another 
> process.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15557) Log the reason why a file can't be deleted

2020-09-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-15557:
--
Labels: pull-request-available  (was: )

> Log the reason why a file can't be deleted
> --
>
> Key: HDFS-15557
> URL: https://issues.apache.org/jira/browse/HDFS-15557
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ye Ni
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Before
>  
> {code:java}
> 2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] 
> org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
> failed on Storage Directory root= K:\data\hdfs\namenode; location= null; 
> type= IMAGE; isShared= false; lock= 
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= 
> null java.io.IOException: Could not delete original file 
> K:\data\hdfs\namenode\current\seen_txid{code}
>  
> After
>  
> {code:java}
> 2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] 
> org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
> failed on Storage Directory root= K:\data\hdfs\namenode; location= null; 
> type= IMAGE; isShared= false; lock= 
> sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= 
> null java.io.IOException: Could not delete original file 
> K:\data\hdfs\namenode\current\seen_txid due to failure: 
> java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: 
> The process cannot access the file because it is being used by another 
> process.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15557) Log the reason why a file can't be deleted

2020-09-03 Thread Ye Ni (Jira)
Ye Ni created HDFS-15557:


 Summary: Log the reason why a file can't be deleted
 Key: HDFS-15557
 URL: https://issues.apache.org/jira/browse/HDFS-15557
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Ye Ni


Before

 
{code:java}
2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] 
org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
failed on Storage Directory root= K:\data\hdfs\namenode; location= null; type= 
IMAGE; isShared= false; lock= sun.nio.ch.FileLockImpl[0:9223372036854775807 
exclusive valid]; storageUuid= null java.io.IOException: Could not delete 
original file K:\data\hdfs\namenode\current\seen_txid{code}
 

After

 
{code:java}

2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] 
org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage 
failed on Storage Directory root= K:\data\hdfs\namenode; location= null; type= 
IMAGE; isShared= false; lock= sun.nio.ch.FileLockImpl[0:9223372036854775807 
exclusive valid]; storageUuid= null java.io.IOException: Could not delete 
original file K:\data\hdfs\namenode\current\seen_txid due to failure: 
java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: The 
process cannot access the file because it is being used by another 
process.{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15025) Applying NVDIMM storage media to HDFS

2020-09-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15025?focusedWorklogId=478768=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478768
 ]

ASF GitHub Bot logged work on HDFS-15025:
-

Author: ASF GitHub Bot
Created on: 03/Sep/20 19:03
Start Date: 03/Sep/20 19:03
Worklog Time Spent: 10m 
  Work Description: liuml07 commented on a change in pull request #2189:
URL: https://github.com/apache/hadoop/pull/2189#discussion_r483194125



##
File path: 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/StorageType.java
##
@@ -34,28 +34,35 @@
 @InterfaceStability.Unstable
 public enum StorageType {
   // sorted by the speed of the storage types, from fast to slow
-  RAM_DISK(true),
-  SSD(false),
-  DISK(false),
-  ARCHIVE(false),
-  PROVIDED(false);
+  RAM_DISK(true, true),
+  NVDIMM(false, true),
+  SSD(false, false),
+  DISK(false, false),
+  ARCHIVE(false, false),
+  PROVIDED(false, false);
 
   private final boolean isTransient;
+  private final boolean isRAM;
 
   public static final StorageType DEFAULT = DISK;
 
   public static final StorageType[] EMPTY_ARRAY = {};
 
   private static final StorageType[] VALUES = values();
 
-  StorageType(boolean isTransient) {
+  StorageType(boolean isTransient, boolean isRAM) {
 this.isTransient = isTransient;
+this.isRAM = isRAM;
   }
 
   public boolean isTransient() {
 return isTransient;
   }
 
+  public boolean isRAM() {
+return isRAM;
+  }

Review comment:
   Oh, I was thinking that allowing Balancer to move the NVDIMM data is by 
design since they are not volatile. But if that is case, then we can update 
Balancer code by replacing `isTransient()` call with `isRAM()` call. Not sure 
if this makes more sense?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 478768)
Time Spent: 1.5h  (was: 1h 20m)

> Applying NVDIMM storage media to HDFS
> -
>
> Key: HDFS-15025
> URL: https://issues.apache.org/jira/browse/HDFS-15025
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, hdfs
>Reporter: YaYun Wang
>Assignee: YaYun Wang
>Priority: Major
>  Labels: pull-request-available
> Attachments: Applying NVDIMM to HDFS.pdf, HDFS-15025.001.patch, 
> HDFS-15025.002.patch, HDFS-15025.003.patch, HDFS-15025.004.patch, 
> HDFS-15025.005.patch, HDFS-15025.006.patch, NVDIMM_patch(WIP).patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The non-volatile memory NVDIMM is faster than SSD, it can be used 
> simultaneously with RAM, DISK, SSD. The data of HDFS stored directly on 
> NVDIMM can not only improves the response rate of HDFS, but also ensure the 
> reliability of the data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15543) RBF: Write Should allow, when a subcluster is unavailable for RANDOM mount points with fault Tolerance enabled.

2020-09-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190363#comment-17190363
 ] 

Íñigo Goiri commented on HDFS-15543:


Another thing is that this is touching a part of the code around HDFS-1 
where [~aajisaka] is trying to support general socket exceptions.

> RBF: Write Should allow, when a subcluster is unavailable for RANDOM mount 
> points with fault Tolerance enabled. 
> 
>
> Key: HDFS-15543
> URL: https://issues.apache.org/jira/browse/HDFS-15543
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.1.1
>Reporter: Harshakiran Reddy
>Assignee: Hemanth Boyina
>Priority: Major
> Attachments: HDFS-15543.001.patch, HDFS-15543.002.patch, 
> HDFS-15543.003.patch, HDFS-15543_testrepro.patch
>
>
> A RANDOM mount point should allow to creating new files if one subcluster is 
> down also with Fault Tolerance was enabled. but here it's failed.
> MultiDestination_client]# hdfs dfsrouteradmin -ls /test_ec
> *Mount Table Entries:*
> Source Destinations Owner Group Mode Quota/Usage
> /test_ec *hacluster->/tes_ec,hacluster1->/tes_ec* test ficommon rwxr-xr-x 
> [NsQuota: -/-, SsQuota: -/-]
> *File Write throne the Exception:-*
> 2020-08-26 19:13:21,839 WARN hdfs.DataStreamer: Abandoning blk_1073743375_2551
>  2020-08-26 19:13:21,877 WARN hdfs.DataStreamer: Excluding datanode 
> DatanodeInfoWithStorage[DISK]
>  2020-08-26 19:13:21,878 WARN hdfs.DataStreamer: DataStreamer Exception
>  java.io.IOException: Unable to create new block.
>  at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1758)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:718)
>  2020-08-26 19:13:21,879 WARN hdfs.DataStreamer: Could not get block 
> locations. Source file "/test_ec/f1._COPYING_" - Aborting...block==null
>  put: Could not get block locations. Source file "/test_ec/f1._COPYING_" - 
> Aborting...block==null



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13522) Support observer node from Router-Based Federation

2020-09-03 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190360#comment-17190360
 ] 

Íñigo Goiri commented on HDFS-13522:


Thanks [~hemanthboyina] for the update.
The patch has a couple of things that I would try to fix but it looks like the 
right approach to me.
We may want to discuss adding the contexts and so on but I would move forward 
with that.

> Support observer node from Router-Based Federation
> --
>
> Key: HDFS-13522
> URL: https://issues.apache.org/jira/browse/HDFS-13522
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: federation, namenode
>Reporter: Erik Krogen
>Assignee: Chao Sun
>Priority: Major
> Attachments: HDFS-13522.001.patch, HDFS-13522_WIP.patch, RBF_ 
> Observer support.pdf, Router+Observer RPC clogging.png, 
> ShortTerm-Routers+Observer.png
>
>
> Changes will need to occur to the router to support the new observer node.
> One such change will be to make the router understand the observer state, 
> e.g. {{FederationNamenodeServiceState}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15529) getChildFilesystems should include fallback fs as well

2020-09-03 Thread Uma Maheswara Rao G (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uma Maheswara Rao G resolved HDFS-15529.

Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Thanks [~ayushsaxena] for review. I have committed this to trunk.

> getChildFilesystems should include fallback fs as well
> --
>
> Key: HDFS-15529
> URL: https://issues.apache.org/jira/browse/HDFS-15529
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: viewfs, viewfsOverloadScheme
>Affects Versions: 3.4.0
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently getChildSystems API used by many other APIs, like 
> getAdditionalTokenIssuers, getTrashRoots etc.
> If fallBack filesystem not included in child filesystems, Application like 
> YARN who uses getAdditionalTokenIssuers, would not get delegation tokens for 
> fallback fs. This would be a critical bug for secure clusters.
> Similarly, trashRoots. when applications tried to use trashRoots, it will not 
> considers trash folders from fallback. So, it will leak from cleanup logics. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15529) getChildFilesystems should include fallback fs as well

2020-09-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15529?focusedWorklogId=478737=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478737
 ]

ASF GitHub Bot logged work on HDFS-15529:
-

Author: ASF GitHub Bot
Created on: 03/Sep/20 18:07
Start Date: 03/Sep/20 18:07
Worklog Time Spent: 10m 
  Work Description: umamaheswararao commented on pull request #2234:
URL: https://github.com/apache/hadoop/pull/2234#issuecomment-686660352


   Thanks @ayushtkn for the review.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 478737)
Time Spent: 40m  (was: 0.5h)

> getChildFilesystems should include fallback fs as well
> --
>
> Key: HDFS-15529
> URL: https://issues.apache.org/jira/browse/HDFS-15529
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: viewfs, viewfsOverloadScheme
>Affects Versions: 3.4.0
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently getChildSystems API used by many other APIs, like 
> getAdditionalTokenIssuers, getTrashRoots etc.
> If fallBack filesystem not included in child filesystems, Application like 
> YARN who uses getAdditionalTokenIssuers, would not get delegation tokens for 
> fallback fs. This would be a critical bug for secure clusters.
> Similarly, trashRoots. when applications tried to use trashRoots, it will not 
> considers trash folders from fallback. So, it will leak from cleanup logics. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15529) getChildFilesystems should include fallback fs as well

2020-09-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15529?focusedWorklogId=478736=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478736
 ]

ASF GitHub Bot logged work on HDFS-15529:
-

Author: ASF GitHub Bot
Created on: 03/Sep/20 18:06
Start Date: 03/Sep/20 18:06
Worklog Time Spent: 10m 
  Work Description: umamaheswararao merged pull request #2234:
URL: https://github.com/apache/hadoop/pull/2234


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 478736)
Time Spent: 0.5h  (was: 20m)

> getChildFilesystems should include fallback fs as well
> --
>
> Key: HDFS-15529
> URL: https://issues.apache.org/jira/browse/HDFS-15529
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: viewfs, viewfsOverloadScheme
>Affects Versions: 3.4.0
>Reporter: Uma Maheswara Rao G
>Assignee: Uma Maheswara Rao G
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently getChildSystems API used by many other APIs, like 
> getAdditionalTokenIssuers, getTrashRoots etc.
> If fallBack filesystem not included in child filesystems, Application like 
> YARN who uses getAdditionalTokenIssuers, would not get delegation tokens for 
> fallback fs. This would be a critical bug for secure clusters.
> Similarly, trashRoots. when applications tried to use trashRoots, it will not 
> considers trash folders from fallback. So, it will leak from cleanup logics. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15543) RBF: Write Should allow, when a subcluster is unavailable for RANDOM mount points with fault Tolerance enabled.

2020-09-03 Thread Hemanth Boyina (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190337#comment-17190337
 ] 

Hemanth Boyina commented on HDFS-15543:
---

thanks for the review [~elgoiri]

have updated the patch by removing repeated things , please review

> RBF: Write Should allow, when a subcluster is unavailable for RANDOM mount 
> points with fault Tolerance enabled. 
> 
>
> Key: HDFS-15543
> URL: https://issues.apache.org/jira/browse/HDFS-15543
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.1.1
>Reporter: Harshakiran Reddy
>Assignee: Hemanth Boyina
>Priority: Major
> Attachments: HDFS-15543.001.patch, HDFS-15543.002.patch, 
> HDFS-15543.003.patch, HDFS-15543_testrepro.patch
>
>
> A RANDOM mount point should allow to creating new files if one subcluster is 
> down also with Fault Tolerance was enabled. but here it's failed.
> MultiDestination_client]# hdfs dfsrouteradmin -ls /test_ec
> *Mount Table Entries:*
> Source Destinations Owner Group Mode Quota/Usage
> /test_ec *hacluster->/tes_ec,hacluster1->/tes_ec* test ficommon rwxr-xr-x 
> [NsQuota: -/-, SsQuota: -/-]
> *File Write throne the Exception:-*
> 2020-08-26 19:13:21,839 WARN hdfs.DataStreamer: Abandoning blk_1073743375_2551
>  2020-08-26 19:13:21,877 WARN hdfs.DataStreamer: Excluding datanode 
> DatanodeInfoWithStorage[DISK]
>  2020-08-26 19:13:21,878 WARN hdfs.DataStreamer: DataStreamer Exception
>  java.io.IOException: Unable to create new block.
>  at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1758)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:718)
>  2020-08-26 19:13:21,879 WARN hdfs.DataStreamer: Could not get block 
> locations. Source file "/test_ec/f1._COPYING_" - Aborting...block==null
>  put: Could not get block locations. Source file "/test_ec/f1._COPYING_" - 
> Aborting...block==null



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15543) RBF: Write Should allow, when a subcluster is unavailable for RANDOM mount points with fault Tolerance enabled.

2020-09-03 Thread Hemanth Boyina (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hemanth Boyina updated HDFS-15543:
--
Attachment: HDFS-15543.003.patch

> RBF: Write Should allow, when a subcluster is unavailable for RANDOM mount 
> points with fault Tolerance enabled. 
> 
>
> Key: HDFS-15543
> URL: https://issues.apache.org/jira/browse/HDFS-15543
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rbf
>Affects Versions: 3.1.1
>Reporter: Harshakiran Reddy
>Assignee: Hemanth Boyina
>Priority: Major
> Attachments: HDFS-15543.001.patch, HDFS-15543.002.patch, 
> HDFS-15543.003.patch, HDFS-15543_testrepro.patch
>
>
> A RANDOM mount point should allow to creating new files if one subcluster is 
> down also with Fault Tolerance was enabled. but here it's failed.
> MultiDestination_client]# hdfs dfsrouteradmin -ls /test_ec
> *Mount Table Entries:*
> Source Destinations Owner Group Mode Quota/Usage
> /test_ec *hacluster->/tes_ec,hacluster1->/tes_ec* test ficommon rwxr-xr-x 
> [NsQuota: -/-, SsQuota: -/-]
> *File Write throne the Exception:-*
> 2020-08-26 19:13:21,839 WARN hdfs.DataStreamer: Abandoning blk_1073743375_2551
>  2020-08-26 19:13:21,877 WARN hdfs.DataStreamer: Excluding datanode 
> DatanodeInfoWithStorage[DISK]
>  2020-08-26 19:13:21,878 WARN hdfs.DataStreamer: DataStreamer Exception
>  java.io.IOException: Unable to create new block.
>  at 
> org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1758)
>  at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:718)
>  2020-08-26 19:13:21,879 WARN hdfs.DataStreamer: Could not get block 
> locations. Source file "/test_ec/f1._COPYING_" - Aborting...block==null
>  put: Could not get block locations. Source file "/test_ec/f1._COPYING_" - 
> Aborting...block==null



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15025) Applying NVDIMM storage media to HDFS

2020-09-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15025?focusedWorklogId=478697=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478697
 ]

ASF GitHub Bot logged work on HDFS-15025:
-

Author: ASF GitHub Bot
Created on: 03/Sep/20 17:00
Start Date: 03/Sep/20 17:00
Worklog Time Spent: 10m 
  Work Description: brahmareddybattula commented on a change in pull 
request #2189:
URL: https://github.com/apache/hadoop/pull/2189#discussion_r483125623



##
File path: 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/StorageType.java
##
@@ -34,28 +34,35 @@
 @InterfaceStability.Unstable
 public enum StorageType {
   // sorted by the speed of the storage types, from fast to slow
-  RAM_DISK(true),
-  SSD(false),
-  DISK(false),
-  ARCHIVE(false),
-  PROVIDED(false);
+  RAM_DISK(true, true),
+  NVDIMM(false, true),
+  SSD(false, false),
+  DISK(false, false),
+  ARCHIVE(false, false),
+  PROVIDED(false, false);
 
   private final boolean isTransient;
+  private final boolean isRAM;
 
   public static final StorageType DEFAULT = DISK;
 
   public static final StorageType[] EMPTY_ARRAY = {};
 
   private static final StorageType[] VALUES = values();
 
-  StorageType(boolean isTransient) {
+  StorageType(boolean isTransient, boolean isRAM) {
 this.isTransient = isTransient;
+this.isRAM = isRAM;
   }
 
   public boolean isTransient() {
 return isTransient;
   }
 
+  public boolean isRAM() {
+return isRAM;
+  }

Review comment:
   Balancer and mover will not move the blocks based on the `isTransient` ( 
they call getMovableTypes(..))..The blocks which are in NVDIMM shouldn't moved 
I feel(as this also exists in RAM and no need to move),but as per this change 
it will move.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 478697)
Time Spent: 1h 20m  (was: 1h 10m)

> Applying NVDIMM storage media to HDFS
> -
>
> Key: HDFS-15025
> URL: https://issues.apache.org/jira/browse/HDFS-15025
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, hdfs
>Reporter: YaYun Wang
>Assignee: YaYun Wang
>Priority: Major
>  Labels: pull-request-available
> Attachments: Applying NVDIMM to HDFS.pdf, HDFS-15025.001.patch, 
> HDFS-15025.002.patch, HDFS-15025.003.patch, HDFS-15025.004.patch, 
> HDFS-15025.005.patch, HDFS-15025.006.patch, NVDIMM_patch(WIP).patch
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The non-volatile memory NVDIMM is faster than SSD, it can be used 
> simultaneously with RAM, DISK, SSD. The data of HDFS stored directly on 
> NVDIMM can not only improves the response rate of HDFS, but also ensure the 
> reliability of the data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15554) RBF: force router check file existence in destinations before adding/updating mount points

2020-09-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15554?focusedWorklogId=478696=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478696
 ]

ASF GitHub Bot logged work on HDFS-15554:
-

Author: ASF GitHub Bot
Created on: 03/Sep/20 17:00
Start Date: 03/Sep/20 17:00
Worklog Time Spent: 10m 
  Work Description: fengnanli commented on a change in pull request #2266:
URL: https://github.com/apache/hadoop/pull/2266#discussion_r483126174



##
File path: 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterAdminServer.java
##
@@ -562,11 +595,35 @@ public GetDestinationResponse getDestination(
   LOG.error("Cannot get location for {}: {}",
   src, ioe.getMessage());
 }
-if (nsIds.isEmpty() && !locations.isEmpty()) {
-  String nsId = locations.get(0).getNameserviceId();
-  nsIds.add(nsId);
+return nsIds;
+  }
+
+  /**
+   * Verify the file exists in destination nameservices to avoid dangling
+   * mount points.
+   *
+   * @param entry the new mount points added, could be from add or update.
+   * @return destination nameservices where the file doesn't exist.
+   * @throws IOException
+   */
+  private List verifyFileInDestinations(MountTable entry)

Review comment:
   Thanks for the suggestion. I want to involve more people as well since 
when I started to fix the tests, I found there are quite a few tests 
targeting/testing cases for dangling mount points.
   @aajisaka Can you share your thoughts as well?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 478696)
Time Spent: 1h 20m  (was: 1h 10m)

> RBF: force router check file existence in destinations before adding/updating 
> mount points
> --
>
> Key: HDFS-15554
> URL: https://issues.apache.org/jira/browse/HDFS-15554
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Adding/Updating mount points right now is only a router action without 
> validation in the downstream namenodes for the destination files/directories.
> In practice we have set up the dangling mount points and when clients call 
> listStatus they would get the file returned, but then if they try to access 
> the file FileNotFoundException would be thrown out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15554) RBF: force router check file existence in destinations before adding/updating mount points

2020-09-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15554?focusedWorklogId=478693=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478693
 ]

ASF GitHub Bot logged work on HDFS-15554:
-

Author: ASF GitHub Bot
Created on: 03/Sep/20 17:00
Start Date: 03/Sep/20 17:00
Worklog Time Spent: 10m 
  Work Description: fengnanli commented on a change in pull request #2266:
URL: https://github.com/apache/hadoop/pull/2266#discussion_r483126174



##
File path: 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterAdminServer.java
##
@@ -562,11 +595,35 @@ public GetDestinationResponse getDestination(
   LOG.error("Cannot get location for {}: {}",
   src, ioe.getMessage());
 }
-if (nsIds.isEmpty() && !locations.isEmpty()) {
-  String nsId = locations.get(0).getNameserviceId();
-  nsIds.add(nsId);
+return nsIds;
+  }
+
+  /**
+   * Verify the file exists in destination nameservices to avoid dangling
+   * mount points.
+   *
+   * @param entry the new mount points added, could be from add or update.
+   * @return destination nameservices where the file doesn't exist.
+   * @throws IOException
+   */
+  private List verifyFileInDestinations(MountTable entry)

Review comment:
   Thanks for the suggestion. I want to involve more people as well since 
when I started to fix the tests, I found there are quite a few tests targeting 
the logic of dangling mount points.
   @aajisaka Can you share your thoughts as well?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 478693)
Time Spent: 1h 10m  (was: 1h)

> RBF: force router check file existence in destinations before adding/updating 
> mount points
> --
>
> Key: HDFS-15554
> URL: https://issues.apache.org/jira/browse/HDFS-15554
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Adding/Updating mount points right now is only a router action without 
> validation in the downstream namenodes for the destination files/directories.
> In practice we have set up the dangling mount points and when clients call 
> listStatus they would get the file returned, but then if they try to access 
> the file FileNotFoundException would be thrown out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15025) Applying NVDIMM storage media to HDFS

2020-09-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15025?focusedWorklogId=478692=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478692
 ]

ASF GitHub Bot logged work on HDFS-15025:
-

Author: ASF GitHub Bot
Created on: 03/Sep/20 16:59
Start Date: 03/Sep/20 16:59
Worklog Time Spent: 10m 
  Work Description: brahmareddybattula commented on a change in pull 
request #2189:
URL: https://github.com/apache/hadoop/pull/2189#discussion_r483125623



##
File path: 
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/StorageType.java
##
@@ -34,28 +34,35 @@
 @InterfaceStability.Unstable
 public enum StorageType {
   // sorted by the speed of the storage types, from fast to slow
-  RAM_DISK(true),
-  SSD(false),
-  DISK(false),
-  ARCHIVE(false),
-  PROVIDED(false);
+  RAM_DISK(true, true),
+  NVDIMM(false, true),
+  SSD(false, false),
+  DISK(false, false),
+  ARCHIVE(false, false),
+  PROVIDED(false, false);
 
   private final boolean isTransient;
+  private final boolean isRAM;
 
   public static final StorageType DEFAULT = DISK;
 
   public static final StorageType[] EMPTY_ARRAY = {};
 
   private static final StorageType[] VALUES = values();
 
-  StorageType(boolean isTransient) {
+  StorageType(boolean isTransient, boolean isRAM) {
 this.isTransient = isTransient;
+this.isRAM = isRAM;
   }
 
   public boolean isTransient() {
 return isTransient;
   }
 
+  public boolean isRAM() {
+return isRAM;
+  }

Review comment:
   Balancer and mover will not move the blocks based on the `isTransient` ( 
they call getMovableTypes(..))..the blocks which NVDIMM shouldn't moved I 
feel,but as per this change it will move.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 478692)
Time Spent: 1h 10m  (was: 1h)

> Applying NVDIMM storage media to HDFS
> -
>
> Key: HDFS-15025
> URL: https://issues.apache.org/jira/browse/HDFS-15025
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, hdfs
>Reporter: YaYun Wang
>Assignee: YaYun Wang
>Priority: Major
>  Labels: pull-request-available
> Attachments: Applying NVDIMM to HDFS.pdf, HDFS-15025.001.patch, 
> HDFS-15025.002.patch, HDFS-15025.003.patch, HDFS-15025.004.patch, 
> HDFS-15025.005.patch, HDFS-15025.006.patch, NVDIMM_patch(WIP).patch
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The non-volatile memory NVDIMM is faster than SSD, it can be used 
> simultaneously with RAM, DISK, SSD. The data of HDFS stored directly on 
> NVDIMM can not only improves the response rate of HDFS, but also ensure the 
> reliability of the data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13522) Support observer node from Router-Based Federation

2020-09-03 Thread Hemanth Boyina (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190290#comment-17190290
 ] 

Hemanth Boyina commented on HDFS-13522:
---

thanks everyone or the discussions here

at huawei , we have developed and have been using router with observer node for 
quite some time , please check  [^HDFS-13522_WIP.patch]

> Support observer node from Router-Based Federation
> --
>
> Key: HDFS-13522
> URL: https://issues.apache.org/jira/browse/HDFS-13522
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: federation, namenode
>Reporter: Erik Krogen
>Assignee: Chao Sun
>Priority: Major
> Attachments: HDFS-13522.001.patch, HDFS-13522_WIP.patch, RBF_ 
> Observer support.pdf, Router+Observer RPC clogging.png, 
> ShortTerm-Routers+Observer.png
>
>
> Changes will need to occur to the router to support the new observer node.
> One such change will be to make the router understand the observer state, 
> e.g. {{FederationNamenodeServiceState}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13522) Support observer node from Router-Based Federation

2020-09-03 Thread Hemanth Boyina (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hemanth Boyina updated HDFS-13522:
--
Attachment: HDFS-13522_WIP.patch

> Support observer node from Router-Based Federation
> --
>
> Key: HDFS-13522
> URL: https://issues.apache.org/jira/browse/HDFS-13522
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: federation, namenode
>Reporter: Erik Krogen
>Assignee: Chao Sun
>Priority: Major
> Attachments: HDFS-13522.001.patch, HDFS-13522_WIP.patch, RBF_ 
> Observer support.pdf, Router+Observer RPC clogging.png, 
> ShortTerm-Routers+Observer.png
>
>
> Changes will need to occur to the router to support the new observer node.
> One such change will be to make the router understand the observer state, 
> e.g. {{FederationNamenodeServiceState}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-09-03 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190172#comment-17190172
 ] 

Hadoop QA commented on HDFS-14694:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
50s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green} No case conflicting files found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
10s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  4m  
1s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
39s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m  
9s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 23s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
28s{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  2m  
5s{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 
57s{color} | {color:blue} Used deprecated FindBugs config; considering 
switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
13s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
24s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
53s{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  3m 
34s{color} | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  3m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 38s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
22s{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
55s{color} | {color:green} the patch passed with JDK Private 
Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m 
59s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| 

[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-09-03 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190162#comment-17190162
 ] 

Xiaoqiao He commented on HDFS-14694:


Thanks [~leosun08] for your continued patches.
I will give my +1 about  [^HDFS-14694.010.patch] after remove unused print 
`System.out.println("sls close:" + closed);` Thanks again.

> Call recoverLease on DFSOutputStream close exception
> 
>
> Key: HDFS-14694
> URL: https://issues.apache.org/jira/browse/HDFS-14694
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Chen Zhang
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, 
> HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, 
> HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, 
> HDFS-14694.009.patch, HDFS-14694.010.patch
>
>
> HDFS uses file-lease to manage opened files, when a file is not closed 
> normally, NN will recover lease automatically after hard limit exceeded. But 
> for a long running service(e.g. HBase), the hdfs-client will never die and NN 
> don't have any chances to recover the file.
> Usually client program needs to handle exceptions by themself to avoid this 
> condition(e.g. HBase automatically call recover lease for files that not 
> closed normally), but in our experience, most services (in our company) don't 
> process this condition properly, which will cause lots of files in abnormal 
> status or even data loss.
> This Jira propose to add a feature that call recoverLease operation 
> automatically when DFSOutputSteam close encounters exception. It should be 
> disabled by default, but when somebody builds a long-running service based on 
> HDFS, they can enable this option.
> We've add this feature to our internal Hadoop distribution for more than 3 
> years, it's quite useful according our experience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13351) Revert HDFS-11156 from branch-2/branch-2.8

2020-09-03 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190161#comment-17190161
 ] 

Hadoop QA commented on HDFS-13351:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m 11s{color} 
| {color:red} HDFS-13351 does not apply to branch-2. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-13351 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12918911/HDFS-13351-branch-2.003.patch
 |
| Console output | 
https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/125/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org |


This message was automatically generated.



> Revert HDFS-11156 from branch-2/branch-2.8
> --
>
> Key: HDFS-13351
> URL: https://issues.apache.org/jira/browse/HDFS-13351
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: webhdfs
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: release-blocker
> Attachments: HDFS-13351-branch-2.001.patch, 
> HDFS-13351-branch-2.002.patch, HDFS-13351-branch-2.003.patch
>
>
> Per discussion in HDFS-11156, lets revert the change from branch-2 and 
> branch-2.8. New patch can be tracked in HDFS-12459 .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13351) Revert HDFS-11156 from branch-2/branch-2.8

2020-09-03 Thread Masatake Iwasaki (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190159#comment-17190159
 ] 

Masatake Iwasaki commented on HDFS-13351:
-

I updated the 'Target Version/s:' to 2.9.3 since HDFS-11156 already has been 
reverted from branch-2.10. 2.9.3 will not be released since there is ongoing 
vote for EOL of branch-2.9.

> Revert HDFS-11156 from branch-2/branch-2.8
> --
>
> Key: HDFS-13351
> URL: https://issues.apache.org/jira/browse/HDFS-13351
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: webhdfs
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: release-blocker
> Attachments: HDFS-13351-branch-2.001.patch, 
> HDFS-13351-branch-2.002.patch, HDFS-13351-branch-2.003.patch
>
>
> Per discussion in HDFS-11156, lets revert the change from branch-2 and 
> branch-2.8. New patch can be tracked in HDFS-12459 .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13351) Revert HDFS-11156 from branch-2/branch-2.8

2020-09-03 Thread Masatake Iwasaki (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-13351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated HDFS-13351:

Target Version/s: 2.9.3  (was: 2.10.1)

> Revert HDFS-11156 from branch-2/branch-2.8
> --
>
> Key: HDFS-13351
> URL: https://issues.apache.org/jira/browse/HDFS-13351
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: webhdfs
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Major
>  Labels: release-blocker
> Attachments: HDFS-13351-branch-2.001.patch, 
> HDFS-13351-branch-2.002.patch, HDFS-13351-branch-2.003.patch
>
>
> Per discussion in HDFS-11156, lets revert the change from branch-2 and 
> branch-2.8. New patch can be tracked in HDFS-12459 .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread Hongbing Wang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190115#comment-17190115
 ] 

Hongbing Wang commented on HDFS-15556:
--

BPServiceActor uses `initialRegistrationComplete` variable of type 
`CountDownLatch(1)` to ensure that the sendLifeLine thread must be after the 
registration is completed.  
It seems that this rule does not take effect when reRegister because 
`initialRegistrationComplete` already countDown() in the first registration. 

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: NN_DN.LOG

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: (was: NN_DN.LOG)

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12548) HDFS Jenkins build is unstable on branch-2

2020-09-03 Thread Masatake Iwasaki (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190111#comment-17190111
 ] 

Masatake Iwasaki commented on HDFS-12548:
-

Since there has been no update for a long time, I updated 'Target Version/s:' 
for preparing 2.10.1 release.

> HDFS Jenkins build is unstable on branch-2
> --
>
> Key: HDFS-12548
> URL: https://issues.apache.org/jira/browse/HDFS-12548
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.9.0
>Reporter: Rushabh Shah
>Priority: Critical
>
> Feel free move the ticket to another project (e.g. infra).
> Recently I attached branch-2 patch while working on one jira 
> [HDFS-12386|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180676=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180676]
> There were at-least 100 failed and timed out tests. I am sure they are not 
> related to my patch.
> Also I came across another jira which was just a javadoc related change and 
> there were around 100 failed tests.
> Below are the details for pre-commits that failed in branch-2
> 1 [HDFS-12386 attempt 
> 1|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180069=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180069]
> {noformat}
> Ran on slave: asf912.gq1.ygridcore.net/H12
> Failed with following error message:
> Build timed out (after 300 minutes). Marking the build as aborted.
> Build was aborted
> Performing Post build task...
> {noformat}
> 2. [HDFS-12386 attempt 
> 2|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180676=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180676]
> {noformat}
> Ran on slave: asf900.gq1.ygridcore.net
> Failed with following error message:
> FATAL: command execution failed
> Command close created at
>   at hudson.remoting.Command.(Command.java:60)
>   at hudson.remoting.Channel$CloseCommand.(Channel.java:1123)
>   at hudson.remoting.Channel$CloseCommand.(Channel.java:1121)
>   at hudson.remoting.Channel.close(Channel.java:1281)
>   at hudson.remoting.Channel.close(Channel.java:1263)
>   at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1128)
> Caused: hudson.remoting.Channel$OrderlyShutdown
>   at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1129)
>   at hudson.remoting.Channel$1.handle(Channel.java:527)
>   at 
> hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:83)
> Caused: java.io.IOException: Backing channel 'H0' is disconnected.
>   at 
> hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:192)
>   at 
> hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:257)
>   at com.sun.proxy.$Proxy125.isAlive(Unknown Source)
>   at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1043)
>   at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1035)
>   at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155)
>   at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109)
>   at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
>   at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:735)
>   at hudson.model.Build$BuildExecution.build(Build.java:206)
>   at hudson.model.Build$BuildExecution.doRun(Build.java:163)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:490)
>   at hudson.model.Run.execute(Run.java:1735)
>   at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
>   at hudson.model.ResourceController.execute(ResourceController.java:97)
>   at hudson.model.Executor.run(Executor.java:405)
> {noformat}
> 3. [HDFS-12531 attempt 
> 1|https://issues.apache.org/jira/browse/HDFS-12531?focusedCommentId=16176493=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16176493]
> {noformat}
> Ran on slave:  asf911.gq1.ygridcore.net
> Failed with following error message:
> FATAL: command execution failed
> Command close created at
>   at hudson.remoting.Command.(Command.java:60)
>   at hudson.remoting.Channel$CloseCommand.(Channel.java:1123)
>   at hudson.remoting.Channel$CloseCommand.(Channel.java:1121)
>   at hudson.remoting.Channel.close(Channel.java:1281)
>   at hudson.remoting.Channel.close(Channel.java:1263)
>   at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1128)
> Caused: hudson.remoting.Channel$OrderlyShutdown
>   at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1129)
> 

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:48 PM:


[~hexiaoqiao] Thanks for your comments.
{quote}
Great catch here. v001 is fair for me, it will be better if add new unit test 
to cover.
{quote}
I'll add to it later unit test

{quote}
I am interested that why storage is null here. Anywhere not synchronized 
storageMap where should do that?
{quote}

the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

detailed execution log
 [^NN_DN.LOG] 

Source code is:
HeartbeatManager#updateLifeline
{code:java}
synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] 
reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes,
  VolumeFailureSummary volumeFailureSummary) {
stats.subtract(node);
   //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus 
the XceiverCount of the DN of the current 
...
node.updateHeartbeatState(reports, cacheCapacity, cacheUsed,  xceiverCount, 
failedVolumes, volumeFailureSummary);
  //NPE exception occurred here throws

stats.add(node);  //Here logic is never executed
  }
{code}

BlockPlacementPolicyDefault#excludeNodeByLoad
{code:java}
  boolean excludeNodeByLoad(DatanodeDescriptor node){
final double maxLoad = considerLoadFactor *
stats.getInServiceXceiverAverage(); 
//stats.getInServiceXceiverAverage()= 
heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() 
//the final maxLoad value will be affected
final int nodeLoad = node.getXceiverCount();
if ((nodeLoad > maxLoad) && (maxLoad > 0)) {
  logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY,
  "(load: " + nodeLoad + " > " + maxLoad + ")");
  return true;
}
return false;
  }
{code}




was (Author: haiyang hu):
3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

4. detailed execution log
 [^NN_DN.LOG] 

5.Source code is:
HeartbeatManager#updateLifeline
{code:java}
synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] 
reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes,
  VolumeFailureSummary volumeFailureSummary) {
stats.subtract(node);
   //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus 
the XceiverCount of the DN of the current 
...
node.updateHeartbeatState(reports, cacheCapacity, cacheUsed,  xceiverCount, 
failedVolumes, volumeFailureSummary);
  //NPE exception occurred here throws

stats.add(node);  //Here logic is never executed
  }
{code}

BlockPlacementPolicyDefault#excludeNodeByLoad
{code:java}
  boolean excludeNodeByLoad(DatanodeDescriptor node){
final double maxLoad = considerLoadFactor *
stats.getInServiceXceiverAverage(); 
//stats.getInServiceXceiverAverage()= 
heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() 
//the final maxLoad value will be affected
final int nodeLoad = node.getXceiverCount();
if ((nodeLoad > maxLoad) && (maxLoad > 0)) {
  logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY,
  "(load: " + nodeLoad + " > " + maxLoad + ")");
  return true;
}
return false;
  }
{code}



> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline 

[jira] [Updated] (HDFS-12548) HDFS Jenkins build is unstable on branch-2

2020-09-03 Thread Masatake Iwasaki (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-12548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated HDFS-12548:

Target Version/s: 2.10.2  (was: 2.10.1)

> HDFS Jenkins build is unstable on branch-2
> --
>
> Key: HDFS-12548
> URL: https://issues.apache.org/jira/browse/HDFS-12548
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.9.0
>Reporter: Rushabh Shah
>Priority: Critical
>
> Feel free move the ticket to another project (e.g. infra).
> Recently I attached branch-2 patch while working on one jira 
> [HDFS-12386|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180676=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180676]
> There were at-least 100 failed and timed out tests. I am sure they are not 
> related to my patch.
> Also I came across another jira which was just a javadoc related change and 
> there were around 100 failed tests.
> Below are the details for pre-commits that failed in branch-2
> 1 [HDFS-12386 attempt 
> 1|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180069=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180069]
> {noformat}
> Ran on slave: asf912.gq1.ygridcore.net/H12
> Failed with following error message:
> Build timed out (after 300 minutes). Marking the build as aborted.
> Build was aborted
> Performing Post build task...
> {noformat}
> 2. [HDFS-12386 attempt 
> 2|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180676=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180676]
> {noformat}
> Ran on slave: asf900.gq1.ygridcore.net
> Failed with following error message:
> FATAL: command execution failed
> Command close created at
>   at hudson.remoting.Command.(Command.java:60)
>   at hudson.remoting.Channel$CloseCommand.(Channel.java:1123)
>   at hudson.remoting.Channel$CloseCommand.(Channel.java:1121)
>   at hudson.remoting.Channel.close(Channel.java:1281)
>   at hudson.remoting.Channel.close(Channel.java:1263)
>   at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1128)
> Caused: hudson.remoting.Channel$OrderlyShutdown
>   at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1129)
>   at hudson.remoting.Channel$1.handle(Channel.java:527)
>   at 
> hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:83)
> Caused: java.io.IOException: Backing channel 'H0' is disconnected.
>   at 
> hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:192)
>   at 
> hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:257)
>   at com.sun.proxy.$Proxy125.isAlive(Unknown Source)
>   at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1043)
>   at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1035)
>   at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155)
>   at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109)
>   at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66)
>   at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:735)
>   at hudson.model.Build$BuildExecution.build(Build.java:206)
>   at hudson.model.Build$BuildExecution.doRun(Build.java:163)
>   at 
> hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:490)
>   at hudson.model.Run.execute(Run.java:1735)
>   at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
>   at hudson.model.ResourceController.execute(ResourceController.java:97)
>   at hudson.model.Executor.run(Executor.java:405)
> {noformat}
> 3. [HDFS-12531 attempt 
> 1|https://issues.apache.org/jira/browse/HDFS-12531?focusedCommentId=16176493=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16176493]
> {noformat}
> Ran on slave:  asf911.gq1.ygridcore.net
> Failed with following error message:
> FATAL: command execution failed
> Command close created at
>   at hudson.remoting.Command.(Command.java:60)
>   at hudson.remoting.Channel$CloseCommand.(Channel.java:1123)
>   at hudson.remoting.Channel$CloseCommand.(Channel.java:1121)
>   at hudson.remoting.Channel.close(Channel.java:1281)
>   at hudson.remoting.Channel.close(Channel.java:1263)
>   at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1128)
> Caused: hudson.remoting.Channel$OrderlyShutdown
>   at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1129)
>   at hudson.remoting.Channel$1.handle(Channel.java:527)
>   at 
> 

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:43 PM:


3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

4. detailed execution log
 [^NN_DN.LOG] 

5.Source code is:
HeartbeatManager#updateLifeline
{code:java}
synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] 
reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes,
  VolumeFailureSummary volumeFailureSummary) {
stats.subtract(node);
   //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus 
the XceiverCount of the DN of the current 
...
node.updateHeartbeatState(reports, cacheCapacity, cacheUsed,  xceiverCount, 
failedVolumes, volumeFailureSummary);
  //NPE exception occurred here throws

stats.add(node);  //Here logic is never executed
  }
{code}

BlockPlacementPolicyDefault#excludeNodeByLoad
{code:java}
  boolean excludeNodeByLoad(DatanodeDescriptor node){
final double maxLoad = considerLoadFactor *
stats.getInServiceXceiverAverage(); 
//stats.getInServiceXceiverAverage()= 
heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() 
//the final maxLoad value will be affected
final int nodeLoad = node.getXceiverCount();
if ((nodeLoad > maxLoad) && (maxLoad > 0)) {
  logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY,
  "(load: " + nodeLoad + " > " + maxLoad + ")");
  return true;
}
return false;
  }
{code}




was (Author: haiyang hu):
3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

4. detailed execution log
 [^NN_DN.LOG] 

5.Source code is:

{code:java}
HeartbeatManager#updateLifeline
synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] 
reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes,
  VolumeFailureSummary volumeFailureSummary) {
stats.subtract(node);
   //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus 
the XceiverCount of the DN of the current 
...
node.updateHeartbeatState(reports, cacheCapacity, cacheUsed,  xceiverCount, 
failedVolumes, volumeFailureSummary);
  //NPE exception occurred here throws

stats.add(node);  //Here logic is never executed
  }
{code}


{code:java}
BlockPlacementPolicyDefault#excludeNodeByLoad
  boolean excludeNodeByLoad(DatanodeDescriptor node){
final double maxLoad = considerLoadFactor *
stats.getInServiceXceiverAverage(); 
//stats.getInServiceXceiverAverage()= 
heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() 
//the final maxLoad value will be affected
final int nodeLoad = node.getXceiverCount();
if ((nodeLoad > maxLoad) && (maxLoad > 0)) {
  logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY,
  "(load: " + nodeLoad + " > " + maxLoad + ")");
  return true;
}
return false;
  }
{code}



> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> 

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:42 PM:


3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

4. detailed execution log
 [^NN_DN.LOG] 

5.Source code is:

{code:java}
HeartbeatManager#updateLifeline
synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] 
reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes,
  VolumeFailureSummary volumeFailureSummary) {
stats.subtract(node);
   //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus 
the XceiverCount of the DN of the current 
...
node.updateHeartbeatState(reports, cacheCapacity, cacheUsed,  xceiverCount, 
failedVolumes, volumeFailureSummary);
  //NPE exception occurred here throws

stats.add(node);  //Here logic is never executed
  }
{code}


{code:java}
BlockPlacementPolicyDefault#excludeNodeByLoad
  boolean excludeNodeByLoad(DatanodeDescriptor node){
final double maxLoad = considerLoadFactor *
stats.getInServiceXceiverAverage(); 
//stats.getInServiceXceiverAverage()= 
heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() 
//the final maxLoad value will be affected
final int nodeLoad = node.getXceiverCount();
if ((nodeLoad > maxLoad) && (maxLoad > 0)) {
  logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY,
  "(load: " + nodeLoad + " > " + maxLoad + ")");
  return true;
}
return false;
  }
{code}




was (Author: haiyang hu):
3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

4. detailed execution log
 [^NN_DN.LOG] 

5.Source code is:

{code:java}
HeartbeatManager#updateLifeline

  synchronized void updateLifeline(final DatanodeDescriptor 
node,StorageReport[] reports, long cacheCapacity, long cacheUsed,
  int xceiverCount, int failedVolumes,
  VolumeFailureSummary volumeFailureSummary) {
stats.subtract(node); //Every time DN heartbeat 
report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of 
the current 
...
node.updateHeartbeatState(reports, cacheCapacity, cacheUsed,
xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception 
occurred here throws
stats.add(node);  //Here logic is never executed
  }

BlockPlacementPolicyDefault#excludeNodeByLoad
  boolean excludeNodeByLoad(DatanodeDescriptor node){
final double maxLoad = considerLoadFactor *
stats.getInServiceXceiverAverage(); 
//stats.getInServiceXceiverAverage()= 
heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() 
//the final maxLoad value will be affected
final int nodeLoad = node.getXceiverCount();
if ((nodeLoad > maxLoad) && (maxLoad > 0)) {
  logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY,
  "(load: " + nodeLoad + " > " + maxLoad + ")");
  return true;
}
return false;
  }
{code}


> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> 

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:41 PM:


3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

4. detailed execution log
 [^NN_DN.LOG] 

5.Source code is:

{code:java}
HeartbeatManager#updateLifeline

  synchronized void updateLifeline(final DatanodeDescriptor 
node,StorageReport[] reports, long cacheCapacity, long cacheUsed,
  int xceiverCount, int failedVolumes,
  VolumeFailureSummary volumeFailureSummary) {
stats.subtract(node); //Every time DN heartbeat 
report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of 
the current 
...
node.updateHeartbeatState(reports, cacheCapacity, cacheUsed,
xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception 
occurred here throws
stats.add(node);  //Here logic is never executed
  }

BlockPlacementPolicyDefault#excludeNodeByLoad
  boolean excludeNodeByLoad(DatanodeDescriptor node){
final double maxLoad = considerLoadFactor *
stats.getInServiceXceiverAverage(); 
//stats.getInServiceXceiverAverage()= 
heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() 
//the final maxLoad value will be affected
final int nodeLoad = node.getXceiverCount();
if ((nodeLoad > maxLoad) && (maxLoad > 0)) {
  logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY,
  "(load: " + nodeLoad + " > " + maxLoad + ")");
  return true;
}
return false;
  }
{code}



was (Author: haiyang hu):
3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

//execution log
 [^NN_DN.LOG] 


> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> 

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:39 PM:


3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

//execution log
 [^NN_DN.LOG] 



was (Author: haiyang hu):
3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

//execution log



> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This 

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:38 PM:


3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

//execution log




was (Author: haiyang hu):
3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}


{code:java}
//execution log
//NameNode LOG:
#registered DN:
2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a 
node: xxx:50010
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: xx:50010
2020-08-25 00:58:53,977 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: 
[DISK]:NORMAL:xxx:50010 failed.
2020-08-25 00:58:53,978 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed 
storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010
...

#sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It 
keeps occurred the NPE 
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException

...
2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 
on 8022, call Call#67833 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException
...

#DN sendHeartBeat the NN will add storageMap:
2020-08-25 00:59:46,632 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new 
storage ID xxx for DN xxx:50010

DN LOG:
#DN run DNA_REGISTER
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action : DNA_REGISTER from NN:8021 with active state
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake 
with NN
2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in LifelineSender for Block pool XXX service to NN:8021
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: NN_DN.LOG

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:36 PM:


3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}


{code:java}
//execution log
//NameNode LOG:
#registered DN:
2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a 
node: xxx:50010
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: xx:50010
2020-08-25 00:58:53,977 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: 
[DISK]:NORMAL:xxx:50010 failed.
2020-08-25 00:58:53,978 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed 
storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010
...

#sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It 
keeps occurred the NPE 
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException

...
2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 
on 8022, call Call#67833 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException
...

#DN sendHeartBeat the NN will add storageMap:
2020-08-25 00:59:46,632 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new 
storage ID xxx for DN xxx:50010

DN LOG:
#DN run DNA_REGISTER
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action : DNA_REGISTER from NN:8021 with active state
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake 
with NN
2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in LifelineSender for Block pool XXX service to NN:8021
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511)
at org.apache.hadoop.ipc.Client.call(Client.java:1457)
at org.apache.hadoop.ipc.Client.call(Client.java:1367)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at 

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:36 PM:


3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run-->offerService-->processCommand-->reRegister-->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}


{code:java}
//execution log
//NameNode LOG:
#registered DN:
2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a 
node: xxx:50010
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: xx:50010
2020-08-25 00:58:53,977 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: 
[DISK]:NORMAL:xxx:50010 failed.
2020-08-25 00:58:53,978 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed 
storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010
...

#sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It 
keeps occurred the NPE 
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException

...
2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 
on 8022, call Call#67833 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException
...

#DN sendHeartBeat the NN will add storageMap:
2020-08-25 00:59:46,632 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new 
storage ID xxx for DN xxx:50010

DN LOG:
#DN run DNA_REGISTER
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action : DNA_REGISTER from NN:8021 with active state
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake 
with NN
2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in LifelineSender for Block pool XXX service to NN:8021
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511)
at org.apache.hadoop.ipc.Client.call(Client.java:1457)
at org.apache.hadoop.ipc.Client.call(Client.java:1367)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at 

[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108
 ] 

huhaiyang commented on HDFS-15556:
--

3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
  
#BPServiceActor#run-->offerService-->processCommand-->reRegister-->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}


{code:java}
//execution log
//NameNode LOG:
#registered DN:
2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a 
node: xxx:50010
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: xx:50010
2020-08-25 00:58:53,977 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: 
[DISK]:NORMAL:xxx:50010 failed.
2020-08-25 00:58:53,978 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed 
storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010
...

#sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It 
keeps occurred the NPE 
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException

...
2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 
on 8022, call Call#67833 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException
...

#DN sendHeartBeat the NN will add storageMap:
2020-08-25 00:59:46,632 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new 
storage ID xxx for DN xxx:50010

DN LOG:
#DN run DNA_REGISTER
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action : DNA_REGISTER from NN:8021 with active state
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake 
with NN
2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in LifelineSender for Block pool XXX service to NN:8021
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511)
at org.apache.hadoop.ipc.Client.call(Client.java:1457)
at org.apache.hadoop.ipc.Client.call(Client.java:1367)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy21.sendLifeline(Unknown Source)
at 

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190102#comment-17190102
 ] 

Xiaoqiao He edited comment on HDFS-15556 at 9/3/20, 12:30 PM:
--

[~haiyang Hu] Thanks for report. Great catch here. v001 is fair for me, it will 
be better if add new unit test to cover.
I am interested that why {{storage}} is null here. Anywhere not synchronized 
{{storageMap}} where should do that?


was (Author: hexiaoqiao):
[~haiyang Hu] Great catch here. v001 is fair for me, it will be better if add 
new unit test to cover.
I am interested that why {{storage}} is null here. Anywhere not synchronized 
{{storageMap}} where should do that?

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190102#comment-17190102
 ] 

Xiaoqiao He commented on HDFS-15556:


[~haiyang Hu] Great catch here. v001 is fair for me, it will be better if add 
new unit test to cover.
I am interested that why {{storage}} is null here. Anywhere not synchronized 
{{storageMap}} where should do that?

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15163) hdfs-2.10.0-webapps-secondary-status.html miss moment.js

2020-09-03 Thread Masatake Iwasaki (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated HDFS-15163:

Target Version/s: 2.10.1  (was: 2.10.0)

> hdfs-2.10.0-webapps-secondary-status.html miss moment.js
> 
>
> Key: HDFS-15163
> URL: https://issues.apache.org/jira/browse/HDFS-15163
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.10.0
>Reporter: 谢波
>Priority: Minor
> Fix For: 2.10.1
>
> Attachments: 微信截图_20200212183444.png
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> hdfs-2.10.0-webapps-secondary-status.html miss moment.js
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15163) hdfs-2.10.0-webapps-secondary-status.html miss moment.js

2020-09-03 Thread Masatake Iwasaki (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated HDFS-15163:

Fix Version/s: (was: 2.10.1)

> hdfs-2.10.0-webapps-secondary-status.html miss moment.js
> 
>
> Key: HDFS-15163
> URL: https://issues.apache.org/jira/browse/HDFS-15163
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.10.0
>Reporter: 谢波
>Priority: Minor
> Attachments: 微信截图_20200212183444.png
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> hdfs-2.10.0-webapps-secondary-status.html miss moment.js
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14694) Call recoverLease on DFSOutputStream close exception

2020-09-03 Thread Lisheng Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lisheng Sun updated HDFS-14694:
---
Attachment: HDFS-14694.010.patch

> Call recoverLease on DFSOutputStream close exception
> 
>
> Key: HDFS-14694
> URL: https://issues.apache.org/jira/browse/HDFS-14694
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Reporter: Chen Zhang
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, 
> HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, 
> HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, 
> HDFS-14694.009.patch, HDFS-14694.010.patch
>
>
> HDFS uses file-lease to manage opened files, when a file is not closed 
> normally, NN will recover lease automatically after hard limit exceeded. But 
> for a long running service(e.g. HBase), the hdfs-client will never die and NN 
> don't have any chances to recover the file.
> Usually client program needs to handle exceptions by themself to avoid this 
> condition(e.g. HBase automatically call recover lease for files that not 
> closed normally), but in our experience, most services (in our company) don't 
> process this condition properly, which will cause lots of files in abnormal 
> status or even data loss.
> This Jira propose to add a feature that call recoverLease operation 
> automatically when DFSOutputSteam close encounters exception. It should be 
> disabled by default, but when somebody builds a long-running service based on 
> HDFS, they can enable this option.
> We've add this feature to our internal Hadoop distribution for more than 3 
> years, it's quite useful according our experience.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189991#comment-17189991
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 9:30 AM:
---

1. CPU NameNode high, thread stack is

{code:java}
"IPC Server handler 59 on 8020" #244 daemon prio=5 os_prio=0 
tid=0x7f18b0ff7800 nid=0x1c006 runnable [0x7f185cbfc000]
   java.lang.Thread.State: RUNNABLE
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at 
org.apache.hadoop.net.NetworkTopology.getNode(NetworkTopology.java:263)
at 
org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:678)
at 
org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533)
at 
org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:903)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:800)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:768)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseFromNextRack(BlockPlacementPolicyDefault.java:719)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:687)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:534)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:440)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:310)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:149)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:174)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2239)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2828)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:913)
{code}

  
2.there are a large number of logs, and in extreme cases, all DN nodes of the 
cluster are not satisfied with the allocation

{code:java}
2020-08-25 01:38:50,370 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not enough 
replicas was chosen. Reason:{NODE_TOO_BUSY=xxx}
2020-08-25 01:38:50,370 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
place enough replicas, still in need of 3 to reach 3 (unavailableStoragrom 
storage xxx node DatanodeRegistration(:50010, datanodeUuid=xxx, 
infoPort=50075, infoSecurePor
t=0, ipcPort=50020, storageInfo=lv=-57;cid=xxx;nsid=;c=0), blocks: 2266, 
hasStaleStorage: false, processing time: 7 msecs, invalidatedBlocks: 0
2020-08-25 01:38:50,370 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not enough 
replicas was chosen. Reason:{NODE_TOO_BUSY=xxx}
{code}






was (Author: haiyang hu):
1. CPU NameNode high, thread stack is

{code:java}
"IPC Server handler 59 on 8020" #244 daemon prio=5 os_prio=0 
tid=0x7f18b0ff7800 nid=0x1c006 runnable [0x7f185cbfc000]
   java.lang.Thread.State: RUNNABLE
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at 
org.apache.hadoop.net.NetworkTopology.getNode(NetworkTopology.java:263)
at 
org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:678)
at 
org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533)
at 
org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:903)
at 

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189991#comment-17189991
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 9:25 AM:
---

1. CPU NameNode high, thread stack is

{code:java}
"IPC Server handler 59 on 8020" #244 daemon prio=5 os_prio=0 
tid=0x7f18b0ff7800 nid=0x1c006 runnable [0x7f185cbfc000]
   java.lang.Thread.State: RUNNABLE
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at 
org.apache.hadoop.net.NetworkTopology.getNode(NetworkTopology.java:263)
at 
org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:678)
at 
org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533)
at 
org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:903)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:800)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:768)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseFromNextRack(BlockPlacementPolicyDefault.java:719)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:687)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:534)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:440)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:310)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:149)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:174)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2239)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2828)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:913)
{code}

  
2.


was (Author: haiyang hu):
# CPU NameNode high, thread stack is
  
# 

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> 

[jira] [Commented] (HDFS-14351) RBF: Optimize configuration item resolving for monitor namenode

2020-09-03 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189992#comment-17189992
 ] 

Fei Hui commented on HDFS-14351:


Maybe it's helpful that backport it to other 3.x branches. Thanks

> RBF: Optimize configuration item resolving for monitor namenode
> ---
>
> Key: HDFS-14351
> URL: https://issues.apache.org/jira/browse/HDFS-14351
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: rbf
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, HDFS-13891
>
> Attachments: HDFS-14351-HDFS-13891.001.patch, 
> HDFS-14351-HDFS-13891.002.patch, HDFS-14351-HDFS-13891.003.patch, 
> HDFS-14351-HDFS-13891.004.patch, HDFS-14351-HDFS-13891.005.patch, 
> HDFS-14351-HDFS-13891.006.patch, HDFS-14351.001.patch, HDFS-14351.002.patch
>
>
> We invoke {{configuration.get}} to resolve configuration item 
> `dfs.federation.router.monitor.namenode` at `Router.java`, then split the 
> value by comma to get nsid and nnid, it may confused users since this is not 
> compatible with blank space but other common parameters could do. The 
> following segment show example that resolve fails.
> {code:java}
>   
> dfs.federation.router.monitor.namenode
> nameservice1.nn1, nameservice1.nn2
> 
>   The identifier of the namenodes to monitor and heartbeat.
> 
>   
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189991#comment-17189991
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 9:24 AM:
---

# CPU NameNode high, thread stack is
  
# 


was (Author: haiyang hu):
# CPU NameNode high, thread stack is
  !NN-jstack.png! 
# 

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: (was: NN-jstack.png)

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: NN-jstack.png

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189991#comment-17189991
 ] 

huhaiyang commented on HDFS-15556:
--

# CPU NameNode high, thread stack is
  !NN-jstack.png! 
# 

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: screenshot-1.png

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: (was: screenshot-1.png)

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: NN-CPU.png

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: HDFS-15556.001.patch

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: (was: NN-CPU.png)

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: NN-CPU.png

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: NN-CPU.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
because DataNode is identified as busy and unable to allocate available nodes 
in choose  DataNode, program loop execution results in high CPU and reduces the 
processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


{code:java}
// DatanodeDescriptor#updateStorageStats
...
for (StorageReport report : reports) {

  DatanodeStorageInfo storage = null;
  synchronized (storageMap) {
storage =
storageMap.get(report.getStorage().getStorageID());
  }
  if (checkFailedStorages) {
failedStorageInfos.remove(storage);
  }

  storage.receivedHeartbeat(report);  //  NPE exception occurred here 
  // skip accounting for capacity of PROVIDED storages!
  if (StorageType.PROVIDED.equals(storage.getStorageType())) {
continue;
  }
...
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
because DataNode is identified as busy and unable to allocate available nodes 
in choose  DataNode, program loop execution results in high CPU and reduces the 
processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at 

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
because DataNode is identified as busy and unable to allocate available nodes 
in choose  DataNode, program loop execution results in high CPU and reduces the 
processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


{code:java}
// DatanodeDescriptor#updateStorageStats
...
for (StorageReport report : reports) {

  DatanodeStorageInfo storage = null;
  synchronized (storageMap) {
storage =
storageMap.get(report.getStorage().getStorageID());
  }
  if (checkFailedStorages) {
failedStorageInfos.remove(storage);
  }

  storage.receivedHeartbeat(report);  //  NPE exception occurred here 
  // skip accounting for capacity of PROVIDED storages!
  if (StorageType.PROVIDED.equals(storage.getStorageType())) {
continue;
  }
...
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at 

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


{code:java}
// DatanodeDescriptor#updateStorageStats
...
for (StorageReport report : reports) {

  DatanodeStorageInfo storage = null;
  synchronized (storageMap) {
storage =
storageMap.get(report.getStorage().getStorageID());
  }
  if (checkFailedStorages) {
failedStorageInfos.remove(storage);
  }

  storage.receivedHeartbeat(report);  //  NPE exception occurred here 
  // skip accounting for capacity of PROVIDED storages!
  if (StorageType.PROVIDED.equals(storage.getStorageType())) {
continue;
  }
...
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at 

  1   2   >