[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190570#comment-17190570 ] Lisheng Sun commented on HDFS-14694: The failed UT is not related to this patch. > Call recoverLease on DFSOutputStream close exception > > > Key: HDFS-14694 > URL: https://issues.apache.org/jira/browse/HDFS-14694 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Chen Zhang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, > HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, > HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, > HDFS-14694.009.patch, HDFS-14694.010.patch, HDFS-14694.011.patch > > > HDFS uses file-lease to manage opened files, when a file is not closed > normally, NN will recover lease automatically after hard limit exceeded. But > for a long running service(e.g. HBase), the hdfs-client will never die and NN > don't have any chances to recover the file. > Usually client program needs to handle exceptions by themself to avoid this > condition(e.g. HBase automatically call recover lease for files that not > closed normally), but in our experience, most services (in our company) don't > process this condition properly, which will cause lots of files in abnormal > status or even data loss. > This Jira propose to add a feature that call recoverLease operation > automatically when DFSOutputSteam close encounters exception. It should be > disabled by default, but when somebody builds a long-running service based on > HDFS, they can enable this option. > We've add this feature to our internal Hadoop distribution for more than 3 > years, it's quite useful according our experience. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190568#comment-17190568 ] Hadoop QA commented on HDFS-14694: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 20s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 10s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 33s{color} | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 5s{color} | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 18m 35s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 27s{color} | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 56s{color} | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 3m 13s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 46s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 31s{color} | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 4m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 59s{color} | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 13s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 23s{color} | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 48s{color} | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 6m 2s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || |
[jira] [Work logged] (HDFS-15554) RBF: force router check file existence in destinations before adding/updating mount points
[ https://issues.apache.org/jira/browse/HDFS-15554?focusedWorklogId=478944=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478944 ] ASF GitHub Bot logged work on HDFS-15554: - Author: ASF GitHub Bot Created on: 04/Sep/20 05:44 Start Date: 04/Sep/20 05:44 Worklog Time Spent: 10m Work Description: fengnanli commented on a change in pull request #2266: URL: https://github.com/apache/hadoop/pull/2266#discussion_r483399376 ## File path: hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterAdminServer.java ## @@ -562,11 +595,35 @@ public GetDestinationResponse getDestination( LOG.error("Cannot get location for {}: {}", src, ioe.getMessage()); } -if (nsIds.isEmpty() && !locations.isEmpty()) { - String nsId = locations.get(0).getNameserviceId(); - nsIds.add(nsId); +return nsIds; + } + + /** + * Verify the file exists in destination nameservices to avoid dangling + * mount points. + * + * @param entry the new mount points added, could be from add or update. + * @return destination nameservices where the file doesn't exist. + * @throws IOException + */ + private List verifyFileInDestinations(MountTable entry) Review comment: @goiri Uploaded an early version of trying to fix all tests. This is pretty tedious work so before I spend more time on this, let me know your thoughts. There are mainly two types of tests when dealing with mount table: 1. Use mock RouterRpcServer and so on, this way no downstream namenode calls are made. I put the mock as well, see the change for TestRouterAdmin.java 2. Use real downstream namenode interaction, see TestRouterMountTable.java. I created the paths before calling mount points change. I kept thinking a much easier way is to add a Router server side config to turn this on and the default is on. In the tests I can just turn the config off explicitly and this way I don't need to deal with individual tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 478944) Time Spent: 1.5h (was: 1h 20m) > RBF: force router check file existence in destinations before adding/updating > mount points > -- > > Key: HDFS-15554 > URL: https://issues.apache.org/jira/browse/HDFS-15554 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Minor > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > Adding/Updating mount points right now is only a router action without > validation in the downstream namenodes for the destination files/directories. > In practice we have set up the dangling mount points and when clients call > listStatus they would get the file returned, but then if they try to access > the file FileNotFoundException would be thrown out. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13678) StorageType is incompatible when rolling upgrade to 2.6/2.6+ versions
[ https://issues.apache.org/jira/browse/HDFS-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190567#comment-17190567 ] Masatake Iwasaki commented on HDFS-13678: - updated the target version for preparing 2.10.1 release. > StorageType is incompatible when rolling upgrade to 2.6/2.6+ versions > - > > Key: HDFS-13678 > URL: https://issues.apache.org/jira/browse/HDFS-13678 > Project: Hadoop HDFS > Issue Type: Bug > Components: rolling upgrades >Affects Versions: 2.5.0 >Reporter: Yiqun Lin >Priority: Major > > In version 2.6.0, we supported more storage types in HDFS that implemented in > HDFS-6584. But this seems a incompatible change when we rolling upgrade our > cluster from 2.5.0 to 2.6.0 and throw following error. > {noformat} > 2018-06-14 11:43:39,246 ERROR [DataNode: > [[[DISK]file:/home/vipshop/hard_disk/dfs/, [DISK]file:/data1/dfs/, > [DISK]file:/data2/dfs/]] heartbeating to xx.xx.xx.xx:8022] > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService > for Block pool BP-670256553-xx.xx.xx.xx-1528795419404 (Datanode Uuid > ab150e05-fcb7-49ed-b8ba-f05c27593fee) service to xx.xx.xx.xx:8022 > java.lang.ArrayStoreException > at java.util.ArrayList.toArray(ArrayList.java:412) > at > java.util.Collections$UnmodifiableCollection.toArray(Collections.java:1034) > at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1030) > at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:836) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:146) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:566) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:664) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:835) > at java.lang.Thread.run(Thread.java:748) > {noformat} > The scenery is that old DN parses StorageType error that got from new NN. > This error is taking place in sending heratbeat to NN and blocks won't be > reported to NN successfully. This will lead subsequent errors. > Corresponding logic in 2.5.0: > {code} > public static BlockCommand convert(BlockCommandProto blkCmd) { > ... > StorageType[][] targetStorageTypes = new StorageType[targetList.size()][]; > List targetStorageTypesList = > blkCmd.getTargetStorageTypesList(); > if (targetStorageTypesList.isEmpty()) { // missing storage types > for(int i = 0; i < targetStorageTypes.length; i++) { > targetStorageTypes[i] = new StorageType[targets[i].length]; > Arrays.fill(targetStorageTypes[i], StorageType.DEFAULT); > } > } else { > for(int i = 0; i < targetStorageTypes.length; i++) { > List p = > targetStorageTypesList.get(i).getStorageTypesList(); > targetStorageTypes[i] = p.toArray(new StorageType[p.size()]); < > error here > } > } > {code} > But corresponding to the current logic , it's will be better to return > default type instead of a exception in case StorageType changed(new fields > added or new types) in new versions during rolling upgrade. > {code:java} > public static StorageType convertStorageType(StorageTypeProto type) { > switch(type) { > case DISK: > return StorageType.DISK; > case SSD: > return StorageType.SSD; > case ARCHIVE: > return StorageType.ARCHIVE; > case RAM_DISK: > return StorageType.RAM_DISK; > case PROVIDED: > return StorageType.PROVIDED; > default: > throw new IllegalStateException( > "BUG: StorageTypeProto not found, type=" + type); > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13678) StorageType is incompatible when rolling upgrade to 2.6/2.6+ versions
[ https://issues.apache.org/jira/browse/HDFS-13678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-13678: Target Version/s: 2.9.3, 2.10.2 (was: 2.9.3, 2.10.1) > StorageType is incompatible when rolling upgrade to 2.6/2.6+ versions > - > > Key: HDFS-13678 > URL: https://issues.apache.org/jira/browse/HDFS-13678 > Project: Hadoop HDFS > Issue Type: Bug > Components: rolling upgrades >Affects Versions: 2.5.0 >Reporter: Yiqun Lin >Priority: Major > > In version 2.6.0, we supported more storage types in HDFS that implemented in > HDFS-6584. But this seems a incompatible change when we rolling upgrade our > cluster from 2.5.0 to 2.6.0 and throw following error. > {noformat} > 2018-06-14 11:43:39,246 ERROR [DataNode: > [[[DISK]file:/home/vipshop/hard_disk/dfs/, [DISK]file:/data1/dfs/, > [DISK]file:/data2/dfs/]] heartbeating to xx.xx.xx.xx:8022] > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService > for Block pool BP-670256553-xx.xx.xx.xx-1528795419404 (Datanode Uuid > ab150e05-fcb7-49ed-b8ba-f05c27593fee) service to xx.xx.xx.xx:8022 > java.lang.ArrayStoreException > at java.util.ArrayList.toArray(ArrayList.java:412) > at > java.util.Collections$UnmodifiableCollection.toArray(Collections.java:1034) > at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1030) > at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:836) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:146) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:566) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:664) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:835) > at java.lang.Thread.run(Thread.java:748) > {noformat} > The scenery is that old DN parses StorageType error that got from new NN. > This error is taking place in sending heratbeat to NN and blocks won't be > reported to NN successfully. This will lead subsequent errors. > Corresponding logic in 2.5.0: > {code} > public static BlockCommand convert(BlockCommandProto blkCmd) { > ... > StorageType[][] targetStorageTypes = new StorageType[targetList.size()][]; > List targetStorageTypesList = > blkCmd.getTargetStorageTypesList(); > if (targetStorageTypesList.isEmpty()) { // missing storage types > for(int i = 0; i < targetStorageTypes.length; i++) { > targetStorageTypes[i] = new StorageType[targets[i].length]; > Arrays.fill(targetStorageTypes[i], StorageType.DEFAULT); > } > } else { > for(int i = 0; i < targetStorageTypes.length; i++) { > List p = > targetStorageTypesList.get(i).getStorageTypesList(); > targetStorageTypes[i] = p.toArray(new StorageType[p.size()]); < > error here > } > } > {code} > But corresponding to the current logic , it's will be better to return > default type instead of a exception in case StorageType changed(new fields > added or new types) in new versions during rolling upgrade. > {code:java} > public static StorageType convertStorageType(StorageTypeProto type) { > switch(type) { > case DISK: > return StorageType.DISK; > case SSD: > return StorageType.SSD; > case ARCHIVE: > return StorageType.ARCHIVE; > case RAM_DISK: > return StorageType.RAM_DISK; > case PROVIDED: > return StorageType.PROVIDED; > default: > throw new IllegalStateException( > "BUG: StorageTypeProto not found, type=" + type); > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14794) [SBN read] reportBadBlock is rejected by Observer.
[ https://issues.apache.org/jira/browse/HDFS-14794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190565#comment-17190565 ] Masatake Iwasaki commented on HDFS-14794: - updated the target version for preparing 2.10.1 release. > [SBN read] reportBadBlock is rejected by Observer. > -- > > Key: HDFS-14794 > URL: https://issues.apache.org/jira/browse/HDFS-14794 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Priority: Major > > {{reportBadBlock}} is rejected by Observer via StandbyException > {code}StandbyException: Operation category WRITE is not supported in state > observer{code} > We should investigate what are the consequences of this and if we should > treat {{reportBadBlock}} as IBRs. Note that {{reportBadBlock}} is a part of > both {{ClientProtocol}} and {{DatanodeProtocol}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14794) [SBN read] reportBadBlock is rejected by Observer.
[ https://issues.apache.org/jira/browse/HDFS-14794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-14794: Target Version/s: 2.10.2 (was: 2.10.1) > [SBN read] reportBadBlock is rejected by Observer. > -- > > Key: HDFS-14794 > URL: https://issues.apache.org/jira/browse/HDFS-14794 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Priority: Major > > {{reportBadBlock}} is rejected by Observer via StandbyException > {code}StandbyException: Operation category WRITE is not supported in state > observer{code} > We should investigate what are the consequences of this and if we should > treat {{reportBadBlock}} as IBRs. Note that {{reportBadBlock}} is a part of > both {{ClientProtocol}} and {{DatanodeProtocol}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15004) Refactor TestBalancer for faster execution.
[ https://issues.apache.org/jira/browse/HDFS-15004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190564#comment-17190564 ] Masatake Iwasaki commented on HDFS-15004: - updated the target version for preparing 2.10.1 release. > Refactor TestBalancer for faster execution. > --- > > Key: HDFS-15004 > URL: https://issues.apache.org/jira/browse/HDFS-15004 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs, test >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Simbarashe Dzinamarira >Priority: Major > > {{TestBalancer}} is a big test by itself, it is also a part of many other > tests. Running these tests involves spinning of {{MiniDFSCluter}} and > shutting it down for every test case, which is inefficient. Many of the test > cases can run using the same instance of {{MiniDFSCluter}}, but not all of > them. Would be good to refactor the tests to optimize their running time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15037) Encryption Zone operations should not block other RPC calls while retreiving encryption keys.
[ https://issues.apache.org/jira/browse/HDFS-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190563#comment-17190563 ] Masatake Iwasaki commented on HDFS-15037: - updated the target version for preparing 2.10.1 release. > Encryption Zone operations should not block other RPC calls while retreiving > encryption keys. > - > > Key: HDFS-15037 > URL: https://issues.apache.org/jira/browse/HDFS-15037 > Project: Hadoop HDFS > Issue Type: Bug > Components: encryption, namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Priority: Major > > I believe it was an intention to avoid blocking other operations while > retrieving keys with holding {{FSDirectory.dirLock}}. But in reality all > other operations enter first {{FSNamesystemLock}} then {{dirLock}}. So they > are all blocked waiting for the key. > We see substantial increase in RPC wait time ({{RpcQueueTimeAvgTime}}) on > NameNode when encryption operations are intermixed with regular workloads. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15004) Refactor TestBalancer for faster execution.
[ https://issues.apache.org/jira/browse/HDFS-15004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-15004: Target Version/s: 2.10.2 (was: 2.10.1) > Refactor TestBalancer for faster execution. > --- > > Key: HDFS-15004 > URL: https://issues.apache.org/jira/browse/HDFS-15004 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs, test >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Simbarashe Dzinamarira >Priority: Major > > {{TestBalancer}} is a big test by itself, it is also a part of many other > tests. Running these tests involves spinning of {{MiniDFSCluter}} and > shutting it down for every test case, which is inefficient. Many of the test > cases can run using the same instance of {{MiniDFSCluter}}, but not all of > them. Would be good to refactor the tests to optimize their running time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15037) Encryption Zone operations should not block other RPC calls while retreiving encryption keys.
[ https://issues.apache.org/jira/browse/HDFS-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-15037: Target Version/s: 2.10.2 (was: 2.10.1) > Encryption Zone operations should not block other RPC calls while retreiving > encryption keys. > - > > Key: HDFS-15037 > URL: https://issues.apache.org/jira/browse/HDFS-15037 > Project: Hadoop HDFS > Issue Type: Bug > Components: encryption, namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Priority: Major > > I believe it was an intention to avoid blocking other operations while > retrieving keys with holding {{FSDirectory.dirLock}}. But in reality all > other operations enter first {{FSNamesystemLock}} then {{dirLock}}. So they > are all blocked waiting for the key. > We see substantial increase in RPC wait time ({{RpcQueueTimeAvgTime}}) on > NameNode when encryption operations are intermixed with regular workloads. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15163) hdfs-2.10.0-webapps-secondary-status.html miss moment.js
[ https://issues.apache.org/jira/browse/HDFS-15163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190562#comment-17190562 ] Masatake Iwasaki commented on HDFS-15163: - updated the target version for preparing 2.10.1 release. > hdfs-2.10.0-webapps-secondary-status.html miss moment.js > > > Key: HDFS-15163 > URL: https://issues.apache.org/jira/browse/HDFS-15163 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.10.0 >Reporter: 谢波 >Priority: Minor > Attachments: 微信截图_20200212183444.png > > Original Estimate: 96h > Remaining Estimate: 96h > > hdfs-2.10.0-webapps-secondary-status.html miss moment.js > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15163) hdfs-2.10.0-webapps-secondary-status.html miss moment.js
[ https://issues.apache.org/jira/browse/HDFS-15163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-15163: Target Version/s: 2.10.2 (was: 2.10.1) > hdfs-2.10.0-webapps-secondary-status.html miss moment.js > > > Key: HDFS-15163 > URL: https://issues.apache.org/jira/browse/HDFS-15163 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.10.0 >Reporter: 谢波 >Priority: Minor > Attachments: 微信截图_20200212183444.png > > Original Estimate: 96h > Remaining Estimate: 96h > > hdfs-2.10.0-webapps-secondary-status.html miss moment.js > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15357) Do not trust bad block reports from clients
[ https://issues.apache.org/jira/browse/HDFS-15357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-15357: Target Version/s: 3.4.0, 2.10.2 (was: 2.10.1, 3.4.0) > Do not trust bad block reports from clients > --- > > Key: HDFS-15357 > URL: https://issues.apache.org/jira/browse/HDFS-15357 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Priority: Major > > {{reportBadBlocks()}} is implemented by both ClientNamenodeProtocol and > DatanodeProtocol. When DFSClient is calling it, a faulty client can cause > data availability issues in a cluster. > In the past we had such an incident where a node with a faulty NIC was > randomly corrupting data. All clients ran on the machine reported all > accessed blocks and all associated replicas to be corrupt. More recently, a > single faulty client process caused a small number of missing blocks. In > all cases, actual data was fine. > The bad block reports from clients shouldn't be trusted blindly. Instead, the > namenode should send a datanode command to verify the claim. A bonus would be > to keep the record for a while and ignore repeated reports from the same > nodes. > At minimum, there should be an option to ignore bad block reports from > clients, perhaps after logging it. A very crude way would be to make it short > out in {{ClientNamenodeProtocolServerSideTranslatorPB#reportBadBlocks()}}. > More sophisticated way would be to check for the datanode user name in > {{FSNamesystem#reportBadBlocks()}} so that it can be easily logged, or > optionally do further processing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15357) Do not trust bad block reports from clients
[ https://issues.apache.org/jira/browse/HDFS-15357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190559#comment-17190559 ] Masatake Iwasaki commented on HDFS-15357: - updated the target version for preparing 2.10.1 release. > Do not trust bad block reports from clients > --- > > Key: HDFS-15357 > URL: https://issues.apache.org/jira/browse/HDFS-15357 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kihwal Lee >Priority: Major > > {{reportBadBlocks()}} is implemented by both ClientNamenodeProtocol and > DatanodeProtocol. When DFSClient is calling it, a faulty client can cause > data availability issues in a cluster. > In the past we had such an incident where a node with a faulty NIC was > randomly corrupting data. All clients ran on the machine reported all > accessed blocks and all associated replicas to be corrupt. More recently, a > single faulty client process caused a small number of missing blocks. In > all cases, actual data was fine. > The bad block reports from clients shouldn't be trusted blindly. Instead, the > namenode should send a datanode command to verify the claim. A bonus would be > to keep the record for a while and ignore repeated reports from the same > nodes. > At minimum, there should be an option to ignore bad block reports from > clients, perhaps after logging it. A very crude way would be to make it short > out in {{ClientNamenodeProtocolServerSideTranslatorPB#reportBadBlocks()}}. > More sophisticated way would be to check for the datanode user name in > {{FSNamesystem#reportBadBlocks()}} so that it can be easily logged, or > optionally do further processing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14277) [SBN read] Observer benchmark results
[ https://issues.apache.org/jira/browse/HDFS-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190554#comment-17190554 ] Masatake Iwasaki commented on HDFS-14277: - I set the target version to 2.10.2 for preparing release of 2.10.1. [~weichiu] Let me know if this should be blocker of 2.10.1. > [SBN read] Observer benchmark results > - > > Key: HDFS-14277 > URL: https://issues.apache.org/jira/browse/HDFS-14277 > Project: Hadoop HDFS > Issue Type: Task > Components: ha, namenode >Affects Versions: 2.10.0, 3.3.0 > Environment: Hardware: 4-node cluster, each node has 4 core, Xeon > 2.5Ghz, 25GB memory. > Software: CentOS 7.4, CDH 6.0 + Consistent Reads from Standby, Kerberos, SSL, > RPC encryption + Data Transfer Encryption, Cloudera Navigator. >Reporter: Wei-Chiu Chuang >Priority: Blocker > Attachments: Observer profiler.png, Screen Shot 2019-02-14 at > 11.50.37 AM.png, observer RPC queue processing time.png > > > Ran a few benchmarks and profiler (VisualVM) today on an Observer-enabled > cluster. Would like to share the results with the community. The cluster has > 1 Observer node. > h2. NNThroughputBenchmark > Generate 1 million files and send fileStatus RPCs. > {code:java} > hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs > -op fileStatus -threads 100 -files 100 -useExisting > -keepResults > {code} > h3. Kerberos, SSL, RPC encryption, Data Transfer Encryption enabled: > ||Node||fileStatus (Ops per sec)|| > |Active NameNode|4865| > |Observer|3996| > h3. Kerberos, SSL: > ||Node||fileStatus (Ops per sec)|| > |Active NameNode|7078| > |Observer|6459| > Observation: > * due to the edit tailing overhead, Observer node consume 30% CPU > utilization even if the cluster is idle. > * While Active NN has less than 1ms RPC processing time, Observer node has > > 5ms RPC processing time. I am still looking for the source of the longer > processing time. The longer RPC processing time may be the cause for the > performance degradation compared to that of Active NN. Note the cluster has > Cloudera Navigator installed which adds additional overhead to RPC processing > time. > * {{GlobalStateIdContext#isCoordinatedCall()}} pops up as one of the top > hotspots in the profiler. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14277) [SBN read] Observer benchmark results
[ https://issues.apache.org/jira/browse/HDFS-14277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-14277: Target Version/s: 2.10.2 (was: 2.10.1) > [SBN read] Observer benchmark results > - > > Key: HDFS-14277 > URL: https://issues.apache.org/jira/browse/HDFS-14277 > Project: Hadoop HDFS > Issue Type: Task > Components: ha, namenode >Affects Versions: 2.10.0, 3.3.0 > Environment: Hardware: 4-node cluster, each node has 4 core, Xeon > 2.5Ghz, 25GB memory. > Software: CentOS 7.4, CDH 6.0 + Consistent Reads from Standby, Kerberos, SSL, > RPC encryption + Data Transfer Encryption, Cloudera Navigator. >Reporter: Wei-Chiu Chuang >Priority: Blocker > Attachments: Observer profiler.png, Screen Shot 2019-02-14 at > 11.50.37 AM.png, observer RPC queue processing time.png > > > Ran a few benchmarks and profiler (VisualVM) today on an Observer-enabled > cluster. Would like to share the results with the community. The cluster has > 1 Observer node. > h2. NNThroughputBenchmark > Generate 1 million files and send fileStatus RPCs. > {code:java} > hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs > -op fileStatus -threads 100 -files 100 -useExisting > -keepResults > {code} > h3. Kerberos, SSL, RPC encryption, Data Transfer Encryption enabled: > ||Node||fileStatus (Ops per sec)|| > |Active NameNode|4865| > |Observer|3996| > h3. Kerberos, SSL: > ||Node||fileStatus (Ops per sec)|| > |Active NameNode|7078| > |Observer|6459| > Observation: > * due to the edit tailing overhead, Observer node consume 30% CPU > utilization even if the cluster is idle. > * While Active NN has less than 1ms RPC processing time, Observer node has > > 5ms RPC processing time. I am still looking for the source of the longer > processing time. The longer RPC processing time may be the cause for the > performance degradation compared to that of Active NN. Note the cluster has > Cloudera Navigator installed which adds additional overhead to RPC processing > time. > * {{GlobalStateIdContext#isCoordinatedCall()}} pops up as one of the top > hotspots in the profiler. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190546#comment-17190546 ] Hadoop QA commented on HDFS-14694: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 38s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 11s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 0s{color} | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 39s{color} | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 11s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 58s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s{color} | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 4s{color} | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 2m 55s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 16s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 26s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 54s{color} | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 34s{color} | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 46s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 20s{color} | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 55s{color} | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 26s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || |
[jira] [Work logged] (HDFS-15551) Tiny Improve for DeadNode detector
[ https://issues.apache.org/jira/browse/HDFS-15551?focusedWorklogId=478934=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478934 ] ASF GitHub Bot logged work on HDFS-15551: - Author: ASF GitHub Bot Created on: 04/Sep/20 04:49 Start Date: 04/Sep/20 04:49 Worklog Time Spent: 10m Work Description: leosunli commented on a change in pull request #2265: URL: https://github.com/apache/hadoop/pull/2265#discussion_r483382751 ## File path: hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DeadNodeDetector.java ## @@ -475,6 +475,7 @@ public synchronized void addNodeToDetect(DFSInputStream dfsInputStream, datanodeInfos.add(datanodeInfo); } +LOG.warn("Add datanode {} to suspectAndDeadNodes", datanodeInfo); Review comment: One case: when a lot of invalid relicas, will the log flood? ## File path: hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DeadNodeDetector.java ## @@ -396,13 +395,13 @@ private void probeCallBack(Probe probe, boolean success) { probe.getDatanodeInfo()); removeDeadNode(probe.getDatanodeInfo()); } else if (probe.getType() == ProbeType.CHECK_SUSPECT) { -LOG.debug("Remove the node out from suspect node list: {}.", +LOG.info("Remove the node out from suspect node list: {}.", Review comment: when a lot of invalid relicas, it should have many supsect nodes but not dead nodes. These nodes all will print this log. What is the purpose of printing this log? The client can access normally the suspect node. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 478934) Time Spent: 50m (was: 40m) > Tiny Improve for DeadNode detector > -- > > Key: HDFS-15551 > URL: https://issues.apache.org/jira/browse/HDFS-15551 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Affects Versions: 3.3.0 >Reporter: dark_num >Assignee: imbajin >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 50m > Remaining Estimate: 0h > > # add or improve some logs for adding local & global deadnodes > # logic improve > # fix typo -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15551) Tiny Improve for DeadNode detector
[ https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190537#comment-17190537 ] Lisheng Sun commented on HDFS-15551: Yeah,I would review it recently. > Tiny Improve for DeadNode detector > -- > > Key: HDFS-15551 > URL: https://issues.apache.org/jira/browse/HDFS-15551 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Affects Versions: 3.3.0 >Reporter: dark_num >Assignee: imbajin >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > # add or improve some logs for adding local & global deadnodes > # logic improve > # fix typo -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15557) Log the reason why a storage log file can't be deleted
[ https://issues.apache.org/jira/browse/HDFS-15557?focusedWorklogId=478927=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478927 ] ASF GitHub Bot logged work on HDFS-15557: - Author: ASF GitHub Bot Created on: 04/Sep/20 04:21 Start Date: 04/Sep/20 04:21 Worklog Time Spent: 10m Work Description: liuml07 commented on a change in pull request #2274: URL: https://github.com/apache/hadoop/pull/2274#discussion_r483377105 ## File path: hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/util/AtomicFileOutputStream.java ## @@ -75,8 +76,13 @@ public void close() throws IOException { boolean renamed = tmpFile.renameTo(origFile); if (!renamed) { // On windows, renameTo does not replace. - if (origFile.exists() && !origFile.delete()) { -throw new IOException("Could not delete original file " + origFile); + if (origFile.exists()) { +try { + Files.delete(origFile.toPath()); +} catch (IOException e) { + throw new IOException("Could not delete original file " + origFile Review comment: Is it simpler ``` throw new IOException("Could not delete original file " + origFile, e); ``` Other than that, +1 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 478927) Time Spent: 50m (was: 40m) > Log the reason why a storage log file can't be deleted > -- > > Key: HDFS-15557 > URL: https://issues.apache.org/jira/browse/HDFS-15557 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ye Ni >Assignee: Ye Ni >Priority: Minor > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Before > > {code:java} > 2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] > org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage > failed on Storage Directory root= K:\data\hdfs\namenode; location= null; > type= IMAGE; isShared= false; lock= > sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= > null java.io.IOException: Could not delete original file > K:\data\hdfs\namenode\current\seen_txid{code} > > After > > {code:java} > 2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] > org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage > failed on Storage Directory root= K:\data\hdfs\namenode; location= null; > type= IMAGE; isShared= false; lock= > sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= > null java.io.IOException: Could not delete original file > K:\data\hdfs\namenode\current\seen_txid due to failure: > java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: > The process cannot access the file because it is being used by another > process.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15551) Tiny Improve for DeadNode detector
[ https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190526#comment-17190526 ] Xiaoqiao He commented on HDFS-15551: Thanks [~imbajin] involve me here. Add [~imbajin] to contributor list and assign this JIRA to him. [~leosun08] would you like to take another review? > Tiny Improve for DeadNode detector > -- > > Key: HDFS-15551 > URL: https://issues.apache.org/jira/browse/HDFS-15551 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Affects Versions: 3.3.0 >Reporter: dark_num >Assignee: imbajin >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > # add or improve some logs for adding local & global deadnodes > # logic improve > # fix typo -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15551) Tiny Improve for DeadNode detector
[ https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoqiao He reassigned HDFS-15551: -- Assignee: imbajin (was: dark_num) > Tiny Improve for DeadNode detector > -- > > Key: HDFS-15551 > URL: https://issues.apache.org/jira/browse/HDFS-15551 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Affects Versions: 3.3.0 >Reporter: dark_num >Assignee: imbajin >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > # add or improve some logs for adding local & global deadnodes > # logic improve > # fix typo -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15551) Tiny Improve for DeadNode detector
[ https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaoqiao He reassigned HDFS-15551: -- Assignee: dark_num > Tiny Improve for DeadNode detector > -- > > Key: HDFS-15551 > URL: https://issues.apache.org/jira/browse/HDFS-15551 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Affects Versions: 3.3.0 >Reporter: dark_num >Assignee: dark_num >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > # add or improve some logs for adding local & global deadnodes > # logic improve > # fix typo -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15557) Log the reason why a storage log file can't be deleted
[ https://issues.apache.org/jira/browse/HDFS-15557?focusedWorklogId=478920=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478920 ] ASF GitHub Bot logged work on HDFS-15557: - Author: ASF GitHub Bot Created on: 04/Sep/20 03:29 Start Date: 04/Sep/20 03:29 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2274: URL: https://github.com/apache/hadoop/pull/2274#issuecomment-686880097 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Comment | |::|--:|:|:| | +0 :ok: | reexec | 0m 30s | Docker mode activated. | ||| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | No case conflicting files found. | | +1 :green_heart: | @author | 0m 0s | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | ||| _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 28m 6s | trunk passed | | +1 :green_heart: | compile | 1m 16s | trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 | | +1 :green_heart: | compile | 1m 12s | trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | +1 :green_heart: | checkstyle | 0m 48s | trunk passed | | +1 :green_heart: | mvnsite | 1m 20s | trunk passed | | +1 :green_heart: | shadedclient | 16m 25s | branch has no errors when building and testing our client artifacts. | | +1 :green_heart: | javadoc | 0m 52s | trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 | | +1 :green_heart: | javadoc | 1m 27s | trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | +0 :ok: | spotbugs | 3m 0s | Used deprecated FindBugs config; considering switching to SpotBugs. | | +1 :green_heart: | findbugs | 2m 58s | trunk passed | ||| _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 10s | the patch passed | | +1 :green_heart: | compile | 1m 10s | the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 | | +1 :green_heart: | javac | 1m 10s | the patch passed | | +1 :green_heart: | compile | 1m 3s | the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | +1 :green_heart: | javac | 1m 3s | the patch passed | | +1 :green_heart: | checkstyle | 0m 39s | the patch passed | | +1 :green_heart: | mvnsite | 1m 8s | the patch passed | | +1 :green_heart: | whitespace | 0m 0s | The patch has no whitespace issues. | | +1 :green_heart: | shadedclient | 13m 54s | patch has no errors when building and testing our client artifacts. | | +1 :green_heart: | javadoc | 0m 45s | the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 | | +1 :green_heart: | javadoc | 1m 19s | the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | +1 :green_heart: | findbugs | 3m 4s | the patch passed | ||| _ Other Tests _ | | -1 :x: | unit | 94m 42s | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 43s | The patch does not generate ASF License warnings. | | | | 176m 26s | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestNameNodeRetryCacheMetrics | | | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier | | | hadoop.hdfs.TestReconstructStripedFile | | | hadoop.hdfs.TestFileChecksumCompositeCrc | | | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped | | | hadoop.hdfs.TestFileChecksum | | | hadoop.hdfs.TestGetFileChecksum | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2274/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2274 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 68f01445e536 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 139a43e98e2 | | Default Java | Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private
[jira] [Comment Edited] (HDFS-15551) Tiny Improve for DeadNode detector
[ https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190501#comment-17190501 ] imbajin edited comment on HDFS-15551 at 9/4/20, 2:56 AM: - [~hexiaoqiao] , [~linyiqun], [~weichiu] Could u take a view for the patch? THX And I wonder how this issue is *assigned* to me? (Seems I can't do this by myself) was (Author: imbajin): [~hexiaoqiao] , [~linyiqun], [~weichiu] wang Could u take a view for the patch? THX And I wonder how this issue is *assigned* to me? (Seems I can't do this by myself) > Tiny Improve for DeadNode detector > -- > > Key: HDFS-15551 > URL: https://issues.apache.org/jira/browse/HDFS-15551 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Affects Versions: 3.3.0 >Reporter: dark_num >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > # add or improve some logs for adding local & global deadnodes > # logic improve > # fix typo -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15551) Tiny Improve for DeadNode detector
[ https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190501#comment-17190501 ] imbajin edited comment on HDFS-15551 at 9/4/20, 2:55 AM: - [~hexiaoqiao] , [~linyiqun], [~weichiu] wang Could u take a view for the patch? THX And I wonder how this issue is *assigned* to me? (Seems I can't do this by myself) was (Author: imbajin): [~hexiaoqiao] , [~linyiqun], Could u take a view for the patch? THX And I wonder how this issue is *assigned* to me? (Seems I can't do this by myself) > Tiny Improve for DeadNode detector > -- > > Key: HDFS-15551 > URL: https://issues.apache.org/jira/browse/HDFS-15551 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Affects Versions: 3.3.0 >Reporter: dark_num >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > # add or improve some logs for adding local & global deadnodes > # logic improve > # fix typo -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15551) Tiny Improve for DeadNode detector
[ https://issues.apache.org/jira/browse/HDFS-15551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190501#comment-17190501 ] imbajin commented on HDFS-15551: [~hexiaoqiao] , [~linyiqun], Could u take a view for the patch? THX And I wonder how this issue is *assigned* to me? (Seems I can't do this by myself) > Tiny Improve for DeadNode detector > -- > > Key: HDFS-15551 > URL: https://issues.apache.org/jira/browse/HDFS-15551 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs-client >Affects Versions: 3.3.0 >Reporter: dark_num >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > # add or improve some logs for adding local & global deadnodes > # logic improve > # fix typo -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13522) Support observer node from Router-Based Federation
[ https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190483#comment-17190483 ] Chao Sun commented on HDFS-13522: - [~hemanthboyina] feel free to take over this. I haven't got a chance to work on this but I think it is an important feature. I may be able to help on code review. > Support observer node from Router-Based Federation > -- > > Key: HDFS-13522 > URL: https://issues.apache.org/jira/browse/HDFS-13522 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: federation, namenode >Reporter: Erik Krogen >Assignee: Chao Sun >Priority: Major > Attachments: HDFS-13522.001.patch, HDFS-13522_WIP.patch, RBF_ > Observer support.pdf, Router+Observer RPC clogging.png, > ShortTerm-Routers+Observer.png > > > Changes will need to occur to the router to support the new observer node. > One such change will be to make the router understand the observer state, > e.g. {{FederationNamenodeServiceState}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190479#comment-17190479 ] Lisheng Sun commented on HDFS-14694: Thanks [~hexiaoqiao] for patient review. The v011 patch removed unused code. > Call recoverLease on DFSOutputStream close exception > > > Key: HDFS-14694 > URL: https://issues.apache.org/jira/browse/HDFS-14694 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Chen Zhang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, > HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, > HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, > HDFS-14694.009.patch, HDFS-14694.010.patch, HDFS-14694.011.patch > > > HDFS uses file-lease to manage opened files, when a file is not closed > normally, NN will recover lease automatically after hard limit exceeded. But > for a long running service(e.g. HBase), the hdfs-client will never die and NN > don't have any chances to recover the file. > Usually client program needs to handle exceptions by themself to avoid this > condition(e.g. HBase automatically call recover lease for files that not > closed normally), but in our experience, most services (in our company) don't > process this condition properly, which will cause lots of files in abnormal > status or even data loss. > This Jira propose to add a feature that call recoverLease operation > automatically when DFSOutputSteam close encounters exception. It should be > disabled by default, but when somebody builds a long-running service based on > HDFS, they can enable this option. > We've add this feature to our internal Hadoop distribution for more than 3 > years, it's quite useful according our experience. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14694) Call recoverLease on DFSOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14694: --- Attachment: HDFS-14694.011.patch > Call recoverLease on DFSOutputStream close exception > > > Key: HDFS-14694 > URL: https://issues.apache.org/jira/browse/HDFS-14694 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Chen Zhang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, > HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, > HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, > HDFS-14694.009.patch, HDFS-14694.010.patch, HDFS-14694.011.patch > > > HDFS uses file-lease to manage opened files, when a file is not closed > normally, NN will recover lease automatically after hard limit exceeded. But > for a long running service(e.g. HBase), the hdfs-client will never die and NN > don't have any chances to recover the file. > Usually client program needs to handle exceptions by themself to avoid this > condition(e.g. HBase automatically call recover lease for files that not > closed normally), but in our experience, most services (in our company) don't > process this condition properly, which will cause lots of files in abnormal > status or even data loss. > This Jira propose to add a feature that call recoverLease operation > automatically when DFSOutputSteam close encounters exception. It should be > disabled by default, but when somebody builds a long-running service based on > HDFS, they can enable this option. > We've add this feature to our internal Hadoop distribution for more than 3 > years, it's quite useful according our experience. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14694) Call recoverLease on DFSOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14694: --- Attachment: (was: HDFS-14694.010.patch) > Call recoverLease on DFSOutputStream close exception > > > Key: HDFS-14694 > URL: https://issues.apache.org/jira/browse/HDFS-14694 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Chen Zhang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, > HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, > HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, > HDFS-14694.009.patch, HDFS-14694.010.patch > > > HDFS uses file-lease to manage opened files, when a file is not closed > normally, NN will recover lease automatically after hard limit exceeded. But > for a long running service(e.g. HBase), the hdfs-client will never die and NN > don't have any chances to recover the file. > Usually client program needs to handle exceptions by themself to avoid this > condition(e.g. HBase automatically call recover lease for files that not > closed normally), but in our experience, most services (in our company) don't > process this condition properly, which will cause lots of files in abnormal > status or even data loss. > This Jira propose to add a feature that call recoverLease operation > automatically when DFSOutputSteam close encounters exception. It should be > disabled by default, but when somebody builds a long-running service based on > HDFS, they can enable this option. > We've add this feature to our internal Hadoop distribution for more than 3 > years, it's quite useful according our experience. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14694) Call recoverLease on DFSOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14694: --- Attachment: HDFS-14694.010.patch > Call recoverLease on DFSOutputStream close exception > > > Key: HDFS-14694 > URL: https://issues.apache.org/jira/browse/HDFS-14694 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Chen Zhang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, > HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, > HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, > HDFS-14694.009.patch, HDFS-14694.010.patch > > > HDFS uses file-lease to manage opened files, when a file is not closed > normally, NN will recover lease automatically after hard limit exceeded. But > for a long running service(e.g. HBase), the hdfs-client will never die and NN > don't have any chances to recover the file. > Usually client program needs to handle exceptions by themself to avoid this > condition(e.g. HBase automatically call recover lease for files that not > closed normally), but in our experience, most services (in our company) don't > process this condition properly, which will cause lots of files in abnormal > status or even data loss. > This Jira propose to add a feature that call recoverLease operation > automatically when DFSOutputSteam close encounters exception. It should be > disabled by default, but when somebody builds a long-running service based on > HDFS, they can enable this option. > We've add this feature to our internal Hadoop distribution for more than 3 > years, it's quite useful according our experience. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15557) Log the reason why a storage log file can't be deleted
[ https://issues.apache.org/jira/browse/HDFS-15557?focusedWorklogId=478886=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478886 ] ASF GitHub Bot logged work on HDFS-15557: - Author: ASF GitHub Bot Created on: 04/Sep/20 01:52 Start Date: 04/Sep/20 01:52 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2274: URL: https://github.com/apache/hadoop/pull/2274#issuecomment-686852824 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Comment | |::|--:|:|:| | +0 :ok: | reexec | 28m 28s | Docker mode activated. | ||| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | No case conflicting files found. | | +1 :green_heart: | @author | 0m 0s | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | ||| _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 28m 13s | trunk passed | | +1 :green_heart: | compile | 1m 17s | trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 | | +1 :green_heart: | compile | 1m 11s | trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | +1 :green_heart: | checkstyle | 0m 49s | trunk passed | | +1 :green_heart: | mvnsite | 1m 18s | trunk passed | | +1 :green_heart: | shadedclient | 16m 15s | branch has no errors when building and testing our client artifacts. | | +1 :green_heart: | javadoc | 0m 51s | trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 | | +1 :green_heart: | javadoc | 1m 22s | trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | +0 :ok: | spotbugs | 2m 58s | Used deprecated FindBugs config; considering switching to SpotBugs. | | +1 :green_heart: | findbugs | 2m 56s | trunk passed | ||| _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 11s | the patch passed | | +1 :green_heart: | compile | 1m 8s | the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 | | +1 :green_heart: | javac | 1m 8s | the patch passed | | +1 :green_heart: | compile | 1m 4s | the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | +1 :green_heart: | javac | 1m 4s | the patch passed | | -0 :warning: | checkstyle | 0m 40s | hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 3 unchanged - 0 fixed = 5 total (was 3) | | +1 :green_heart: | mvnsite | 1m 12s | the patch passed | | +1 :green_heart: | whitespace | 0m 0s | The patch has no whitespace issues. | | +1 :green_heart: | shadedclient | 13m 51s | patch has no errors when building and testing our client artifacts. | | +1 :green_heart: | javadoc | 0m 51s | the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 | | +1 :green_heart: | javadoc | 1m 25s | the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | +1 :green_heart: | findbugs | 3m 7s | the patch passed | ||| _ Other Tests _ | | -1 :x: | unit | 97m 52s | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 42s | The patch does not generate ASF License warnings. | | | | 207m 42s | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestFileChecksumCompositeCrc | | | hadoop.hdfs.TestMultipleNNPortQOP | | | hadoop.hdfs.TestFileAppend4 | | | hadoop.hdfs.TestErasureCodingExerciseAPIs | | | hadoop.hdfs.TestFileChecksum | | | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier | | | hadoop.hdfs.TestDFSStripedOutputStream | | | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure | | | hadoop.hdfs.server.balancer.TestBalancerWithMultipleNameNodes | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2274/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2274 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux c407a908478b 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 139a43e98e2 | | Default Java | Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | Multi-JDK versions |
[jira] [Commented] (HDFS-15557) Log the reason why a storage log file can't be deleted
[ https://issues.apache.org/jira/browse/HDFS-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190452#comment-17190452 ] Ye Ni commented on HDFS-15557: -- [~inigoiri] testWriteTransactionIdHandlesIOE()? > Log the reason why a storage log file can't be deleted > -- > > Key: HDFS-15557 > URL: https://issues.apache.org/jira/browse/HDFS-15557 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ye Ni >Assignee: Ye Ni >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Before > > {code:java} > 2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] > org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage > failed on Storage Directory root= K:\data\hdfs\namenode; location= null; > type= IMAGE; isShared= false; lock= > sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= > null java.io.IOException: Could not delete original file > K:\data\hdfs\namenode\current\seen_txid{code} > > After > > {code:java} > 2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] > org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage > failed on Storage Directory root= K:\data\hdfs\namenode; location= null; > type= IMAGE; isShared= false; lock= > sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= > null java.io.IOException: Could not delete original file > K:\data\hdfs\namenode\current\seen_txid due to failure: > java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: > The process cannot access the file because it is being used by another > process.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15557) Log the reason why a storage log file can't be deleted
[ https://issues.apache.org/jira/browse/HDFS-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Ni updated HDFS-15557: - Summary: Log the reason why a storage log file can't be deleted (was: Log the reason why a file can't be deleted) > Log the reason why a storage log file can't be deleted > -- > > Key: HDFS-15557 > URL: https://issues.apache.org/jira/browse/HDFS-15557 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ye Ni >Assignee: Ye Ni >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Before > > {code:java} > 2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] > org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage > failed on Storage Directory root= K:\data\hdfs\namenode; location= null; > type= IMAGE; isShared= false; lock= > sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= > null java.io.IOException: Could not delete original file > K:\data\hdfs\namenode\current\seen_txid{code} > > After > > {code:java} > 2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] > org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage > failed on Storage Directory root= K:\data\hdfs\namenode; location= null; > type= IMAGE; isShared= false; lock= > sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= > null java.io.IOException: Could not delete original file > K:\data\hdfs\namenode\current\seen_txid due to failure: > java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: > The process cannot access the file because it is being used by another > process.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15557) Log the reason why a file can't be deleted
[ https://issues.apache.org/jira/browse/HDFS-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190425#comment-17190425 ] Íñigo Goiri commented on HDFS-15557: [~NickyYe] thanks for the patch. Can you rename the JIRA to indicate this is used for the storage logs? Which tests cover this, BTW? > Log the reason why a file can't be deleted > -- > > Key: HDFS-15557 > URL: https://issues.apache.org/jira/browse/HDFS-15557 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ye Ni >Assignee: Ye Ni >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Before > > {code:java} > 2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] > org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage > failed on Storage Directory root= K:\data\hdfs\namenode; location= null; > type= IMAGE; isShared= false; lock= > sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= > null java.io.IOException: Could not delete original file > K:\data\hdfs\namenode\current\seen_txid{code} > > After > > {code:java} > 2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] > org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage > failed on Storage Directory root= K:\data\hdfs\namenode; location= null; > type= IMAGE; isShared= false; lock= > sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= > null java.io.IOException: Could not delete original file > K:\data\hdfs\namenode\current\seen_txid due to failure: > java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: > The process cannot access the file because it is being used by another > process.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15557) Log the reason why a file can't be deleted
[ https://issues.apache.org/jira/browse/HDFS-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Íñigo Goiri reassigned HDFS-15557: -- Assignee: Ye Ni > Log the reason why a file can't be deleted > -- > > Key: HDFS-15557 > URL: https://issues.apache.org/jira/browse/HDFS-15557 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ye Ni >Assignee: Ye Ni >Priority: Minor > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > Before > > {code:java} > 2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] > org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage > failed on Storage Directory root= K:\data\hdfs\namenode; location= null; > type= IMAGE; isShared= false; lock= > sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= > null java.io.IOException: Could not delete original file > K:\data\hdfs\namenode\current\seen_txid{code} > > After > > {code:java} > 2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] > org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage > failed on Storage Directory root= K:\data\hdfs\namenode; location= null; > type= IMAGE; isShared= false; lock= > sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= > null java.io.IOException: Could not delete original file > K:\data\hdfs\namenode\current\seen_txid due to failure: > java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: > The process cannot access the file because it is being used by another > process.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15557) Log the reason why a file can't be deleted
[ https://issues.apache.org/jira/browse/HDFS-15557?focusedWorklogId=478829=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478829 ] ASF GitHub Bot logged work on HDFS-15557: - Author: ASF GitHub Bot Created on: 03/Sep/20 22:11 Start Date: 03/Sep/20 22:11 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on pull request #2274: URL: https://github.com/apache/hadoop/pull/2274#issuecomment-686789909 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Comment | |::|--:|:|:| | +0 :ok: | reexec | 0m 30s | Docker mode activated. | ||| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | No case conflicting files found. | | +1 :green_heart: | @author | 0m 0s | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | ||| _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 29m 28s | trunk passed | | +1 :green_heart: | compile | 1m 17s | trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 | | +1 :green_heart: | compile | 1m 10s | trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | +1 :green_heart: | checkstyle | 0m 48s | trunk passed | | +1 :green_heart: | mvnsite | 1m 20s | trunk passed | | +1 :green_heart: | shadedclient | 16m 16s | branch has no errors when building and testing our client artifacts. | | +1 :green_heart: | javadoc | 0m 53s | trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 | | +1 :green_heart: | javadoc | 1m 26s | trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | +0 :ok: | spotbugs | 3m 0s | Used deprecated FindBugs config; considering switching to SpotBugs. | | +1 :green_heart: | findbugs | 2m 57s | trunk passed | ||| _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 7s | the patch passed | | +1 :green_heart: | compile | 1m 8s | the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 | | +1 :green_heart: | javac | 1m 8s | the patch passed | | +1 :green_heart: | compile | 1m 2s | the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | +1 :green_heart: | javac | 1m 2s | the patch passed | | +1 :green_heart: | checkstyle | 0m 39s | the patch passed | | +1 :green_heart: | mvnsite | 1m 9s | the patch passed | | +1 :green_heart: | whitespace | 0m 0s | The patch has no whitespace issues. | | +1 :green_heart: | shadedclient | 13m 53s | patch has no errors when building and testing our client artifacts. | | +1 :green_heart: | javadoc | 0m 46s | the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 | | +1 :green_heart: | javadoc | 1m 24s | the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | +1 :green_heart: | findbugs | 3m 4s | the patch passed | ||| _ Other Tests _ | | -1 :x: | unit | 94m 29s | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 44s | The patch does not generate ASF License warnings. | | | | 177m 27s | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.balancer.TestBalancer | | | hadoop.hdfs.server.blockmanagement.TestBlockStatsMXBean | | | hadoop.hdfs.TestFileChecksum | | | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks | | | hadoop.hdfs.server.namenode.TestNameNodeRetryCacheMetrics | | | hadoop.hdfs.TestFileChecksumCompositeCrc | | | hadoop.hdfs.TestDFSInotifyEventInputStreamKerberized | | | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.40 ServerAPI=1.40 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2274/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/2274 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux a5a6b91bbc09 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 139a43e98e2 | | Default Java | Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1
[jira] [Commented] (HDFS-12548) HDFS Jenkins build is unstable on branch-2
[ https://issues.apache.org/jira/browse/HDFS-12548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190411#comment-17190411 ] Jim Brennan commented on HDFS-12548: I propose we close this issue or at least reduce the priority. It's three years old and I don't see any evidence that we've seen it again. Haven't switched over to cloudbees as well? > HDFS Jenkins build is unstable on branch-2 > -- > > Key: HDFS-12548 > URL: https://issues.apache.org/jira/browse/HDFS-12548 > Project: Hadoop HDFS > Issue Type: Bug > Components: build >Affects Versions: 2.9.0 >Reporter: Rushabh Shah >Priority: Critical > > Feel free move the ticket to another project (e.g. infra). > Recently I attached branch-2 patch while working on one jira > [HDFS-12386|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180676=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180676] > There were at-least 100 failed and timed out tests. I am sure they are not > related to my patch. > Also I came across another jira which was just a javadoc related change and > there were around 100 failed tests. > Below are the details for pre-commits that failed in branch-2 > 1 [HDFS-12386 attempt > 1|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180069=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180069] > {noformat} > Ran on slave: asf912.gq1.ygridcore.net/H12 > Failed with following error message: > Build timed out (after 300 minutes). Marking the build as aborted. > Build was aborted > Performing Post build task... > {noformat} > 2. [HDFS-12386 attempt > 2|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180676=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180676] > {noformat} > Ran on slave: asf900.gq1.ygridcore.net > Failed with following error message: > FATAL: command execution failed > Command close created at > at hudson.remoting.Command.(Command.java:60) > at hudson.remoting.Channel$CloseCommand.(Channel.java:1123) > at hudson.remoting.Channel$CloseCommand.(Channel.java:1121) > at hudson.remoting.Channel.close(Channel.java:1281) > at hudson.remoting.Channel.close(Channel.java:1263) > at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1128) > Caused: hudson.remoting.Channel$OrderlyShutdown > at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1129) > at hudson.remoting.Channel$1.handle(Channel.java:527) > at > hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:83) > Caused: java.io.IOException: Backing channel 'H0' is disconnected. > at > hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:192) > at > hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:257) > at com.sun.proxy.$Proxy125.isAlive(Unknown Source) > at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1043) > at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1035) > at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155) > at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109) > at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66) > at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) > at > hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:735) > at hudson.model.Build$BuildExecution.build(Build.java:206) > at hudson.model.Build$BuildExecution.doRun(Build.java:163) > at > hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:490) > at hudson.model.Run.execute(Run.java:1735) > at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) > at hudson.model.ResourceController.execute(ResourceController.java:97) > at hudson.model.Executor.run(Executor.java:405) > {noformat} > 3. [HDFS-12531 attempt > 1|https://issues.apache.org/jira/browse/HDFS-12531?focusedCommentId=16176493=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16176493] > {noformat} > Ran on slave: asf911.gq1.ygridcore.net > Failed with following error message: > FATAL: command execution failed > Command close created at > at hudson.remoting.Command.(Command.java:60) > at hudson.remoting.Channel$CloseCommand.(Channel.java:1123) > at hudson.remoting.Channel$CloseCommand.(Channel.java:1121) > at hudson.remoting.Channel.close(Channel.java:1281) > at hudson.remoting.Channel.close(Channel.java:1263) > at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1128) > Caused: hudson.remoting.Channel$OrderlyShutdown >
[jira] [Created] (HDFS-15558) ViewDistributedFileSystem#recoverLease should call super.recoverLease when there are no mounts configured
Uma Maheswara Rao G created HDFS-15558: -- Summary: ViewDistributedFileSystem#recoverLease should call super.recoverLease when there are no mounts configured Key: HDFS-15558 URL: https://issues.apache.org/jira/browse/HDFS-15558 Project: Hadoop HDFS Issue Type: Sub-task Reporter: Uma Maheswara Rao G Assignee: Uma Maheswara Rao G -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15543) RBF: Write Should allow, when a subcluster is unavailable for RANDOM mount points with fault Tolerance enabled.
[ https://issues.apache.org/jira/browse/HDFS-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190381#comment-17190381 ] Hadoop QA commented on HDFS-15543: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 33m 8s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 1s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 31s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 37s{color} | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s{color} | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 20s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 34s{color} | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 52s{color} | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 1m 9s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s{color} | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 31s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s{color} | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 29s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 21s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 32s{color} | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 48s{color} | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 13s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 8m 39s{color} | {color:red} hadoop-hdfs-rbf in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 28s{color} | {color:green} The patch does not generate ASF License warnings. {color} |
[jira] [Work logged] (HDFS-15557) Log the reason why a file can't be deleted
[ https://issues.apache.org/jira/browse/HDFS-15557?focusedWorklogId=478769=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478769 ] ASF GitHub Bot logged work on HDFS-15557: - Author: ASF GitHub Bot Created on: 03/Sep/20 19:12 Start Date: 03/Sep/20 19:12 Worklog Time Spent: 10m Work Description: NickyYe opened a new pull request #2274: URL: https://github.com/apache/hadoop/pull/2274 https://issues.apache.org/jira/browse/HDFS-15557 ## NOTICE Please create an issue in ASF JIRA before opening a pull request, and you need to set the title of the pull request which starts with the corresponding JIRA issue number. (e.g. HADOOP-X. Fix a typo in YYY.) For more details, please see https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 478769) Remaining Estimate: 0h Time Spent: 10m > Log the reason why a file can't be deleted > -- > > Key: HDFS-15557 > URL: https://issues.apache.org/jira/browse/HDFS-15557 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ye Ni >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Before > > {code:java} > 2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] > org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage > failed on Storage Directory root= K:\data\hdfs\namenode; location= null; > type= IMAGE; isShared= false; lock= > sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= > null java.io.IOException: Could not delete original file > K:\data\hdfs\namenode\current\seen_txid{code} > > After > > {code:java} > 2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] > org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage > failed on Storage Directory root= K:\data\hdfs\namenode; location= null; > type= IMAGE; isShared= false; lock= > sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= > null java.io.IOException: Could not delete original file > K:\data\hdfs\namenode\current\seen_txid due to failure: > java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: > The process cannot access the file because it is being used by another > process.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15557) Log the reason why a file can't be deleted
[ https://issues.apache.org/jira/browse/HDFS-15557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-15557: -- Labels: pull-request-available (was: ) > Log the reason why a file can't be deleted > -- > > Key: HDFS-15557 > URL: https://issues.apache.org/jira/browse/HDFS-15557 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Ye Ni >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Before > > {code:java} > 2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] > org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage > failed on Storage Directory root= K:\data\hdfs\namenode; location= null; > type= IMAGE; isShared= false; lock= > sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= > null java.io.IOException: Could not delete original file > K:\data\hdfs\namenode\current\seen_txid{code} > > After > > {code:java} > 2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] > org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage > failed on Storage Directory root= K:\data\hdfs\namenode; location= null; > type= IMAGE; isShared= false; lock= > sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= > null java.io.IOException: Could not delete original file > K:\data\hdfs\namenode\current\seen_txid due to failure: > java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: > The process cannot access the file because it is being used by another > process.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15557) Log the reason why a file can't be deleted
Ye Ni created HDFS-15557: Summary: Log the reason why a file can't be deleted Key: HDFS-15557 URL: https://issues.apache.org/jira/browse/HDFS-15557 Project: Hadoop HDFS Issue Type: Improvement Reporter: Ye Ni Before {code:java} 2020-09-02 06:48:31,983 WARN [IPC Server handler 206 on 8020] org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage failed on Storage Directory root= K:\data\hdfs\namenode; location= null; type= IMAGE; isShared= false; lock= sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= null java.io.IOException: Could not delete original file K:\data\hdfs\namenode\current\seen_txid{code} After {code:java} 2020-09-02 17:43:29,421 WARN [IPC Server handler 111 on 8020] org.apache.hadoop.hdfs.server.common.Storage: writeTransactionIdToStorage failed on Storage Directory root= K:\data\hdfs\namenode; location= null; type= IMAGE; isShared= false; lock= sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid]; storageUuid= null java.io.IOException: Could not delete original file K:\data\hdfs\namenode\current\seen_txid due to failure: java.nio.file.FileSystemException: K:\data\hdfs\namenode\current\seen_txid: The process cannot access the file because it is being used by another process.{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15025) Applying NVDIMM storage media to HDFS
[ https://issues.apache.org/jira/browse/HDFS-15025?focusedWorklogId=478768=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478768 ] ASF GitHub Bot logged work on HDFS-15025: - Author: ASF GitHub Bot Created on: 03/Sep/20 19:03 Start Date: 03/Sep/20 19:03 Worklog Time Spent: 10m Work Description: liuml07 commented on a change in pull request #2189: URL: https://github.com/apache/hadoop/pull/2189#discussion_r483194125 ## File path: hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/StorageType.java ## @@ -34,28 +34,35 @@ @InterfaceStability.Unstable public enum StorageType { // sorted by the speed of the storage types, from fast to slow - RAM_DISK(true), - SSD(false), - DISK(false), - ARCHIVE(false), - PROVIDED(false); + RAM_DISK(true, true), + NVDIMM(false, true), + SSD(false, false), + DISK(false, false), + ARCHIVE(false, false), + PROVIDED(false, false); private final boolean isTransient; + private final boolean isRAM; public static final StorageType DEFAULT = DISK; public static final StorageType[] EMPTY_ARRAY = {}; private static final StorageType[] VALUES = values(); - StorageType(boolean isTransient) { + StorageType(boolean isTransient, boolean isRAM) { this.isTransient = isTransient; +this.isRAM = isRAM; } public boolean isTransient() { return isTransient; } + public boolean isRAM() { +return isRAM; + } Review comment: Oh, I was thinking that allowing Balancer to move the NVDIMM data is by design since they are not volatile. But if that is case, then we can update Balancer code by replacing `isTransient()` call with `isRAM()` call. Not sure if this makes more sense? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 478768) Time Spent: 1.5h (was: 1h 20m) > Applying NVDIMM storage media to HDFS > - > > Key: HDFS-15025 > URL: https://issues.apache.org/jira/browse/HDFS-15025 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, hdfs >Reporter: YaYun Wang >Assignee: YaYun Wang >Priority: Major > Labels: pull-request-available > Attachments: Applying NVDIMM to HDFS.pdf, HDFS-15025.001.patch, > HDFS-15025.002.patch, HDFS-15025.003.patch, HDFS-15025.004.patch, > HDFS-15025.005.patch, HDFS-15025.006.patch, NVDIMM_patch(WIP).patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > The non-volatile memory NVDIMM is faster than SSD, it can be used > simultaneously with RAM, DISK, SSD. The data of HDFS stored directly on > NVDIMM can not only improves the response rate of HDFS, but also ensure the > reliability of the data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15543) RBF: Write Should allow, when a subcluster is unavailable for RANDOM mount points with fault Tolerance enabled.
[ https://issues.apache.org/jira/browse/HDFS-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190363#comment-17190363 ] Íñigo Goiri commented on HDFS-15543: Another thing is that this is touching a part of the code around HDFS-1 where [~aajisaka] is trying to support general socket exceptions. > RBF: Write Should allow, when a subcluster is unavailable for RANDOM mount > points with fault Tolerance enabled. > > > Key: HDFS-15543 > URL: https://issues.apache.org/jira/browse/HDFS-15543 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.1.1 >Reporter: Harshakiran Reddy >Assignee: Hemanth Boyina >Priority: Major > Attachments: HDFS-15543.001.patch, HDFS-15543.002.patch, > HDFS-15543.003.patch, HDFS-15543_testrepro.patch > > > A RANDOM mount point should allow to creating new files if one subcluster is > down also with Fault Tolerance was enabled. but here it's failed. > MultiDestination_client]# hdfs dfsrouteradmin -ls /test_ec > *Mount Table Entries:* > Source Destinations Owner Group Mode Quota/Usage > /test_ec *hacluster->/tes_ec,hacluster1->/tes_ec* test ficommon rwxr-xr-x > [NsQuota: -/-, SsQuota: -/-] > *File Write throne the Exception:-* > 2020-08-26 19:13:21,839 WARN hdfs.DataStreamer: Abandoning blk_1073743375_2551 > 2020-08-26 19:13:21,877 WARN hdfs.DataStreamer: Excluding datanode > DatanodeInfoWithStorage[DISK] > 2020-08-26 19:13:21,878 WARN hdfs.DataStreamer: DataStreamer Exception > java.io.IOException: Unable to create new block. > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1758) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:718) > 2020-08-26 19:13:21,879 WARN hdfs.DataStreamer: Could not get block > locations. Source file "/test_ec/f1._COPYING_" - Aborting...block==null > put: Could not get block locations. Source file "/test_ec/f1._COPYING_" - > Aborting...block==null -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13522) Support observer node from Router-Based Federation
[ https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190360#comment-17190360 ] Íñigo Goiri commented on HDFS-13522: Thanks [~hemanthboyina] for the update. The patch has a couple of things that I would try to fix but it looks like the right approach to me. We may want to discuss adding the contexts and so on but I would move forward with that. > Support observer node from Router-Based Federation > -- > > Key: HDFS-13522 > URL: https://issues.apache.org/jira/browse/HDFS-13522 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: federation, namenode >Reporter: Erik Krogen >Assignee: Chao Sun >Priority: Major > Attachments: HDFS-13522.001.patch, HDFS-13522_WIP.patch, RBF_ > Observer support.pdf, Router+Observer RPC clogging.png, > ShortTerm-Routers+Observer.png > > > Changes will need to occur to the router to support the new observer node. > One such change will be to make the router understand the observer state, > e.g. {{FederationNamenodeServiceState}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-15529) getChildFilesystems should include fallback fs as well
[ https://issues.apache.org/jira/browse/HDFS-15529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uma Maheswara Rao G resolved HDFS-15529. Fix Version/s: 3.4.0 Hadoop Flags: Reviewed Resolution: Fixed Thanks [~ayushsaxena] for review. I have committed this to trunk. > getChildFilesystems should include fallback fs as well > -- > > Key: HDFS-15529 > URL: https://issues.apache.org/jira/browse/HDFS-15529 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: viewfs, viewfsOverloadScheme >Affects Versions: 3.4.0 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G >Priority: Critical > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Currently getChildSystems API used by many other APIs, like > getAdditionalTokenIssuers, getTrashRoots etc. > If fallBack filesystem not included in child filesystems, Application like > YARN who uses getAdditionalTokenIssuers, would not get delegation tokens for > fallback fs. This would be a critical bug for secure clusters. > Similarly, trashRoots. when applications tried to use trashRoots, it will not > considers trash folders from fallback. So, it will leak from cleanup logics. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15529) getChildFilesystems should include fallback fs as well
[ https://issues.apache.org/jira/browse/HDFS-15529?focusedWorklogId=478737=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478737 ] ASF GitHub Bot logged work on HDFS-15529: - Author: ASF GitHub Bot Created on: 03/Sep/20 18:07 Start Date: 03/Sep/20 18:07 Worklog Time Spent: 10m Work Description: umamaheswararao commented on pull request #2234: URL: https://github.com/apache/hadoop/pull/2234#issuecomment-686660352 Thanks @ayushtkn for the review. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 478737) Time Spent: 40m (was: 0.5h) > getChildFilesystems should include fallback fs as well > -- > > Key: HDFS-15529 > URL: https://issues.apache.org/jira/browse/HDFS-15529 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: viewfs, viewfsOverloadScheme >Affects Versions: 3.4.0 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G >Priority: Critical > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Currently getChildSystems API used by many other APIs, like > getAdditionalTokenIssuers, getTrashRoots etc. > If fallBack filesystem not included in child filesystems, Application like > YARN who uses getAdditionalTokenIssuers, would not get delegation tokens for > fallback fs. This would be a critical bug for secure clusters. > Similarly, trashRoots. when applications tried to use trashRoots, it will not > considers trash folders from fallback. So, it will leak from cleanup logics. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15529) getChildFilesystems should include fallback fs as well
[ https://issues.apache.org/jira/browse/HDFS-15529?focusedWorklogId=478736=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478736 ] ASF GitHub Bot logged work on HDFS-15529: - Author: ASF GitHub Bot Created on: 03/Sep/20 18:06 Start Date: 03/Sep/20 18:06 Worklog Time Spent: 10m Work Description: umamaheswararao merged pull request #2234: URL: https://github.com/apache/hadoop/pull/2234 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 478736) Time Spent: 0.5h (was: 20m) > getChildFilesystems should include fallback fs as well > -- > > Key: HDFS-15529 > URL: https://issues.apache.org/jira/browse/HDFS-15529 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: viewfs, viewfsOverloadScheme >Affects Versions: 3.4.0 >Reporter: Uma Maheswara Rao G >Assignee: Uma Maheswara Rao G >Priority: Critical > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Currently getChildSystems API used by many other APIs, like > getAdditionalTokenIssuers, getTrashRoots etc. > If fallBack filesystem not included in child filesystems, Application like > YARN who uses getAdditionalTokenIssuers, would not get delegation tokens for > fallback fs. This would be a critical bug for secure clusters. > Similarly, trashRoots. when applications tried to use trashRoots, it will not > considers trash folders from fallback. So, it will leak from cleanup logics. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15543) RBF: Write Should allow, when a subcluster is unavailable for RANDOM mount points with fault Tolerance enabled.
[ https://issues.apache.org/jira/browse/HDFS-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190337#comment-17190337 ] Hemanth Boyina commented on HDFS-15543: --- thanks for the review [~elgoiri] have updated the patch by removing repeated things , please review > RBF: Write Should allow, when a subcluster is unavailable for RANDOM mount > points with fault Tolerance enabled. > > > Key: HDFS-15543 > URL: https://issues.apache.org/jira/browse/HDFS-15543 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.1.1 >Reporter: Harshakiran Reddy >Assignee: Hemanth Boyina >Priority: Major > Attachments: HDFS-15543.001.patch, HDFS-15543.002.patch, > HDFS-15543.003.patch, HDFS-15543_testrepro.patch > > > A RANDOM mount point should allow to creating new files if one subcluster is > down also with Fault Tolerance was enabled. but here it's failed. > MultiDestination_client]# hdfs dfsrouteradmin -ls /test_ec > *Mount Table Entries:* > Source Destinations Owner Group Mode Quota/Usage > /test_ec *hacluster->/tes_ec,hacluster1->/tes_ec* test ficommon rwxr-xr-x > [NsQuota: -/-, SsQuota: -/-] > *File Write throne the Exception:-* > 2020-08-26 19:13:21,839 WARN hdfs.DataStreamer: Abandoning blk_1073743375_2551 > 2020-08-26 19:13:21,877 WARN hdfs.DataStreamer: Excluding datanode > DatanodeInfoWithStorage[DISK] > 2020-08-26 19:13:21,878 WARN hdfs.DataStreamer: DataStreamer Exception > java.io.IOException: Unable to create new block. > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1758) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:718) > 2020-08-26 19:13:21,879 WARN hdfs.DataStreamer: Could not get block > locations. Source file "/test_ec/f1._COPYING_" - Aborting...block==null > put: Could not get block locations. Source file "/test_ec/f1._COPYING_" - > Aborting...block==null -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15543) RBF: Write Should allow, when a subcluster is unavailable for RANDOM mount points with fault Tolerance enabled.
[ https://issues.apache.org/jira/browse/HDFS-15543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hemanth Boyina updated HDFS-15543: -- Attachment: HDFS-15543.003.patch > RBF: Write Should allow, when a subcluster is unavailable for RANDOM mount > points with fault Tolerance enabled. > > > Key: HDFS-15543 > URL: https://issues.apache.org/jira/browse/HDFS-15543 > Project: Hadoop HDFS > Issue Type: Bug > Components: rbf >Affects Versions: 3.1.1 >Reporter: Harshakiran Reddy >Assignee: Hemanth Boyina >Priority: Major > Attachments: HDFS-15543.001.patch, HDFS-15543.002.patch, > HDFS-15543.003.patch, HDFS-15543_testrepro.patch > > > A RANDOM mount point should allow to creating new files if one subcluster is > down also with Fault Tolerance was enabled. but here it's failed. > MultiDestination_client]# hdfs dfsrouteradmin -ls /test_ec > *Mount Table Entries:* > Source Destinations Owner Group Mode Quota/Usage > /test_ec *hacluster->/tes_ec,hacluster1->/tes_ec* test ficommon rwxr-xr-x > [NsQuota: -/-, SsQuota: -/-] > *File Write throne the Exception:-* > 2020-08-26 19:13:21,839 WARN hdfs.DataStreamer: Abandoning blk_1073743375_2551 > 2020-08-26 19:13:21,877 WARN hdfs.DataStreamer: Excluding datanode > DatanodeInfoWithStorage[DISK] > 2020-08-26 19:13:21,878 WARN hdfs.DataStreamer: DataStreamer Exception > java.io.IOException: Unable to create new block. > at > org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1758) > at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:718) > 2020-08-26 19:13:21,879 WARN hdfs.DataStreamer: Could not get block > locations. Source file "/test_ec/f1._COPYING_" - Aborting...block==null > put: Could not get block locations. Source file "/test_ec/f1._COPYING_" - > Aborting...block==null -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15025) Applying NVDIMM storage media to HDFS
[ https://issues.apache.org/jira/browse/HDFS-15025?focusedWorklogId=478697=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478697 ] ASF GitHub Bot logged work on HDFS-15025: - Author: ASF GitHub Bot Created on: 03/Sep/20 17:00 Start Date: 03/Sep/20 17:00 Worklog Time Spent: 10m Work Description: brahmareddybattula commented on a change in pull request #2189: URL: https://github.com/apache/hadoop/pull/2189#discussion_r483125623 ## File path: hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/StorageType.java ## @@ -34,28 +34,35 @@ @InterfaceStability.Unstable public enum StorageType { // sorted by the speed of the storage types, from fast to slow - RAM_DISK(true), - SSD(false), - DISK(false), - ARCHIVE(false), - PROVIDED(false); + RAM_DISK(true, true), + NVDIMM(false, true), + SSD(false, false), + DISK(false, false), + ARCHIVE(false, false), + PROVIDED(false, false); private final boolean isTransient; + private final boolean isRAM; public static final StorageType DEFAULT = DISK; public static final StorageType[] EMPTY_ARRAY = {}; private static final StorageType[] VALUES = values(); - StorageType(boolean isTransient) { + StorageType(boolean isTransient, boolean isRAM) { this.isTransient = isTransient; +this.isRAM = isRAM; } public boolean isTransient() { return isTransient; } + public boolean isRAM() { +return isRAM; + } Review comment: Balancer and mover will not move the blocks based on the `isTransient` ( they call getMovableTypes(..))..The blocks which are in NVDIMM shouldn't moved I feel(as this also exists in RAM and no need to move),but as per this change it will move. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 478697) Time Spent: 1h 20m (was: 1h 10m) > Applying NVDIMM storage media to HDFS > - > > Key: HDFS-15025 > URL: https://issues.apache.org/jira/browse/HDFS-15025 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, hdfs >Reporter: YaYun Wang >Assignee: YaYun Wang >Priority: Major > Labels: pull-request-available > Attachments: Applying NVDIMM to HDFS.pdf, HDFS-15025.001.patch, > HDFS-15025.002.patch, HDFS-15025.003.patch, HDFS-15025.004.patch, > HDFS-15025.005.patch, HDFS-15025.006.patch, NVDIMM_patch(WIP).patch > > Time Spent: 1h 20m > Remaining Estimate: 0h > > The non-volatile memory NVDIMM is faster than SSD, it can be used > simultaneously with RAM, DISK, SSD. The data of HDFS stored directly on > NVDIMM can not only improves the response rate of HDFS, but also ensure the > reliability of the data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15554) RBF: force router check file existence in destinations before adding/updating mount points
[ https://issues.apache.org/jira/browse/HDFS-15554?focusedWorklogId=478696=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478696 ] ASF GitHub Bot logged work on HDFS-15554: - Author: ASF GitHub Bot Created on: 03/Sep/20 17:00 Start Date: 03/Sep/20 17:00 Worklog Time Spent: 10m Work Description: fengnanli commented on a change in pull request #2266: URL: https://github.com/apache/hadoop/pull/2266#discussion_r483126174 ## File path: hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterAdminServer.java ## @@ -562,11 +595,35 @@ public GetDestinationResponse getDestination( LOG.error("Cannot get location for {}: {}", src, ioe.getMessage()); } -if (nsIds.isEmpty() && !locations.isEmpty()) { - String nsId = locations.get(0).getNameserviceId(); - nsIds.add(nsId); +return nsIds; + } + + /** + * Verify the file exists in destination nameservices to avoid dangling + * mount points. + * + * @param entry the new mount points added, could be from add or update. + * @return destination nameservices where the file doesn't exist. + * @throws IOException + */ + private List verifyFileInDestinations(MountTable entry) Review comment: Thanks for the suggestion. I want to involve more people as well since when I started to fix the tests, I found there are quite a few tests targeting/testing cases for dangling mount points. @aajisaka Can you share your thoughts as well? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 478696) Time Spent: 1h 20m (was: 1h 10m) > RBF: force router check file existence in destinations before adding/updating > mount points > -- > > Key: HDFS-15554 > URL: https://issues.apache.org/jira/browse/HDFS-15554 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Minor > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > Adding/Updating mount points right now is only a router action without > validation in the downstream namenodes for the destination files/directories. > In practice we have set up the dangling mount points and when clients call > listStatus they would get the file returned, but then if they try to access > the file FileNotFoundException would be thrown out. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15554) RBF: force router check file existence in destinations before adding/updating mount points
[ https://issues.apache.org/jira/browse/HDFS-15554?focusedWorklogId=478693=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478693 ] ASF GitHub Bot logged work on HDFS-15554: - Author: ASF GitHub Bot Created on: 03/Sep/20 17:00 Start Date: 03/Sep/20 17:00 Worklog Time Spent: 10m Work Description: fengnanli commented on a change in pull request #2266: URL: https://github.com/apache/hadoop/pull/2266#discussion_r483126174 ## File path: hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RouterAdminServer.java ## @@ -562,11 +595,35 @@ public GetDestinationResponse getDestination( LOG.error("Cannot get location for {}: {}", src, ioe.getMessage()); } -if (nsIds.isEmpty() && !locations.isEmpty()) { - String nsId = locations.get(0).getNameserviceId(); - nsIds.add(nsId); +return nsIds; + } + + /** + * Verify the file exists in destination nameservices to avoid dangling + * mount points. + * + * @param entry the new mount points added, could be from add or update. + * @return destination nameservices where the file doesn't exist. + * @throws IOException + */ + private List verifyFileInDestinations(MountTable entry) Review comment: Thanks for the suggestion. I want to involve more people as well since when I started to fix the tests, I found there are quite a few tests targeting the logic of dangling mount points. @aajisaka Can you share your thoughts as well? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 478693) Time Spent: 1h 10m (was: 1h) > RBF: force router check file existence in destinations before adding/updating > mount points > -- > > Key: HDFS-15554 > URL: https://issues.apache.org/jira/browse/HDFS-15554 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Fengnan Li >Assignee: Fengnan Li >Priority: Minor > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > Adding/Updating mount points right now is only a router action without > validation in the downstream namenodes for the destination files/directories. > In practice we have set up the dangling mount points and when clients call > listStatus they would get the file returned, but then if they try to access > the file FileNotFoundException would be thrown out. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-15025) Applying NVDIMM storage media to HDFS
[ https://issues.apache.org/jira/browse/HDFS-15025?focusedWorklogId=478692=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-478692 ] ASF GitHub Bot logged work on HDFS-15025: - Author: ASF GitHub Bot Created on: 03/Sep/20 16:59 Start Date: 03/Sep/20 16:59 Worklog Time Spent: 10m Work Description: brahmareddybattula commented on a change in pull request #2189: URL: https://github.com/apache/hadoop/pull/2189#discussion_r483125623 ## File path: hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/StorageType.java ## @@ -34,28 +34,35 @@ @InterfaceStability.Unstable public enum StorageType { // sorted by the speed of the storage types, from fast to slow - RAM_DISK(true), - SSD(false), - DISK(false), - ARCHIVE(false), - PROVIDED(false); + RAM_DISK(true, true), + NVDIMM(false, true), + SSD(false, false), + DISK(false, false), + ARCHIVE(false, false), + PROVIDED(false, false); private final boolean isTransient; + private final boolean isRAM; public static final StorageType DEFAULT = DISK; public static final StorageType[] EMPTY_ARRAY = {}; private static final StorageType[] VALUES = values(); - StorageType(boolean isTransient) { + StorageType(boolean isTransient, boolean isRAM) { this.isTransient = isTransient; +this.isRAM = isRAM; } public boolean isTransient() { return isTransient; } + public boolean isRAM() { +return isRAM; + } Review comment: Balancer and mover will not move the blocks based on the `isTransient` ( they call getMovableTypes(..))..the blocks which NVDIMM shouldn't moved I feel,but as per this change it will move. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 478692) Time Spent: 1h 10m (was: 1h) > Applying NVDIMM storage media to HDFS > - > > Key: HDFS-15025 > URL: https://issues.apache.org/jira/browse/HDFS-15025 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, hdfs >Reporter: YaYun Wang >Assignee: YaYun Wang >Priority: Major > Labels: pull-request-available > Attachments: Applying NVDIMM to HDFS.pdf, HDFS-15025.001.patch, > HDFS-15025.002.patch, HDFS-15025.003.patch, HDFS-15025.004.patch, > HDFS-15025.005.patch, HDFS-15025.006.patch, NVDIMM_patch(WIP).patch > > Time Spent: 1h 10m > Remaining Estimate: 0h > > The non-volatile memory NVDIMM is faster than SSD, it can be used > simultaneously with RAM, DISK, SSD. The data of HDFS stored directly on > NVDIMM can not only improves the response rate of HDFS, but also ensure the > reliability of the data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13522) Support observer node from Router-Based Federation
[ https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190290#comment-17190290 ] Hemanth Boyina commented on HDFS-13522: --- thanks everyone or the discussions here at huawei , we have developed and have been using router with observer node for quite some time , please check [^HDFS-13522_WIP.patch] > Support observer node from Router-Based Federation > -- > > Key: HDFS-13522 > URL: https://issues.apache.org/jira/browse/HDFS-13522 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: federation, namenode >Reporter: Erik Krogen >Assignee: Chao Sun >Priority: Major > Attachments: HDFS-13522.001.patch, HDFS-13522_WIP.patch, RBF_ > Observer support.pdf, Router+Observer RPC clogging.png, > ShortTerm-Routers+Observer.png > > > Changes will need to occur to the router to support the new observer node. > One such change will be to make the router understand the observer state, > e.g. {{FederationNamenodeServiceState}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13522) Support observer node from Router-Based Federation
[ https://issues.apache.org/jira/browse/HDFS-13522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hemanth Boyina updated HDFS-13522: -- Attachment: HDFS-13522_WIP.patch > Support observer node from Router-Based Federation > -- > > Key: HDFS-13522 > URL: https://issues.apache.org/jira/browse/HDFS-13522 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: federation, namenode >Reporter: Erik Krogen >Assignee: Chao Sun >Priority: Major > Attachments: HDFS-13522.001.patch, HDFS-13522_WIP.patch, RBF_ > Observer support.pdf, Router+Observer RPC clogging.png, > ShortTerm-Routers+Observer.png > > > Changes will need to occur to the router to support the new observer node. > One such change will be to make the router understand the observer state, > e.g. {{FederationNamenodeServiceState}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190172#comment-17190172 ] Hadoop QA commented on HDFS-14694: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 50s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 10s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 44s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 4m 1s{color} | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 39s{color} | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 9s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 23s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 28s{color} | {color:green} trunk passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 5s{color} | {color:green} trunk passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue} 2m 57s{color} | {color:blue} Used deprecated FindBugs config; considering switching to SpotBugs. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 13s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 53s{color} | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 3m 34s{color} | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 3m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 38s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 22s{color} | {color:green} the patch passed with JDK Ubuntu-11.0.8+10-post-Ubuntu-0ubuntu118.04.1 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 55s{color} | {color:green} the patch passed with JDK Private Build-1.8.0_265-8u265-b01-0ubuntu2~18.04-b01 {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 59s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || |
[jira] [Commented] (HDFS-14694) Call recoverLease on DFSOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190162#comment-17190162 ] Xiaoqiao He commented on HDFS-14694: Thanks [~leosun08] for your continued patches. I will give my +1 about [^HDFS-14694.010.patch] after remove unused print `System.out.println("sls close:" + closed);` Thanks again. > Call recoverLease on DFSOutputStream close exception > > > Key: HDFS-14694 > URL: https://issues.apache.org/jira/browse/HDFS-14694 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Chen Zhang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, > HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, > HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, > HDFS-14694.009.patch, HDFS-14694.010.patch > > > HDFS uses file-lease to manage opened files, when a file is not closed > normally, NN will recover lease automatically after hard limit exceeded. But > for a long running service(e.g. HBase), the hdfs-client will never die and NN > don't have any chances to recover the file. > Usually client program needs to handle exceptions by themself to avoid this > condition(e.g. HBase automatically call recover lease for files that not > closed normally), but in our experience, most services (in our company) don't > process this condition properly, which will cause lots of files in abnormal > status or even data loss. > This Jira propose to add a feature that call recoverLease operation > automatically when DFSOutputSteam close encounters exception. It should be > disabled by default, but when somebody builds a long-running service based on > HDFS, they can enable this option. > We've add this feature to our internal Hadoop distribution for more than 3 > years, it's quite useful according our experience. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13351) Revert HDFS-11156 from branch-2/branch-2.8
[ https://issues.apache.org/jira/browse/HDFS-13351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190161#comment-17190161 ] Hadoop QA commented on HDFS-13351: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 11s{color} | {color:red} HDFS-13351 does not apply to branch-2. Rebase required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | HDFS-13351 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12918911/HDFS-13351-branch-2.003.patch | | Console output | https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/125/console | | versions | git=2.17.1 | | Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org | This message was automatically generated. > Revert HDFS-11156 from branch-2/branch-2.8 > -- > > Key: HDFS-13351 > URL: https://issues.apache.org/jira/browse/HDFS-13351 > Project: Hadoop HDFS > Issue Type: Task > Components: webhdfs >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Labels: release-blocker > Attachments: HDFS-13351-branch-2.001.patch, > HDFS-13351-branch-2.002.patch, HDFS-13351-branch-2.003.patch > > > Per discussion in HDFS-11156, lets revert the change from branch-2 and > branch-2.8. New patch can be tracked in HDFS-12459 . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13351) Revert HDFS-11156 from branch-2/branch-2.8
[ https://issues.apache.org/jira/browse/HDFS-13351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190159#comment-17190159 ] Masatake Iwasaki commented on HDFS-13351: - I updated the 'Target Version/s:' to 2.9.3 since HDFS-11156 already has been reverted from branch-2.10. 2.9.3 will not be released since there is ongoing vote for EOL of branch-2.9. > Revert HDFS-11156 from branch-2/branch-2.8 > -- > > Key: HDFS-13351 > URL: https://issues.apache.org/jira/browse/HDFS-13351 > Project: Hadoop HDFS > Issue Type: Task > Components: webhdfs >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Labels: release-blocker > Attachments: HDFS-13351-branch-2.001.patch, > HDFS-13351-branch-2.002.patch, HDFS-13351-branch-2.003.patch > > > Per discussion in HDFS-11156, lets revert the change from branch-2 and > branch-2.8. New patch can be tracked in HDFS-12459 . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-13351) Revert HDFS-11156 from branch-2/branch-2.8
[ https://issues.apache.org/jira/browse/HDFS-13351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-13351: Target Version/s: 2.9.3 (was: 2.10.1) > Revert HDFS-11156 from branch-2/branch-2.8 > -- > > Key: HDFS-13351 > URL: https://issues.apache.org/jira/browse/HDFS-13351 > Project: Hadoop HDFS > Issue Type: Task > Components: webhdfs >Reporter: Weiwei Yang >Assignee: Weiwei Yang >Priority: Major > Labels: release-blocker > Attachments: HDFS-13351-branch-2.001.patch, > HDFS-13351-branch-2.002.patch, HDFS-13351-branch-2.003.patch > > > Per discussion in HDFS-11156, lets revert the change from branch-2 and > branch-2.8. New patch can be tracked in HDFS-12459 . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190115#comment-17190115 ] Hongbing Wang commented on HDFS-15556: -- BPServiceActor uses `initialRegistrationComplete` variable of type `CountDownLatch(1)` to ensure that the sendLifeLine thread must be after the registration is completed. It seems that this rule does not take effect when reRegister because `initialRegistrationComplete` already countDown() in the first registration. > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: NN_DN.LOG > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: (was: NN_DN.LOG) > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12548) HDFS Jenkins build is unstable on branch-2
[ https://issues.apache.org/jira/browse/HDFS-12548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190111#comment-17190111 ] Masatake Iwasaki commented on HDFS-12548: - Since there has been no update for a long time, I updated 'Target Version/s:' for preparing 2.10.1 release. > HDFS Jenkins build is unstable on branch-2 > -- > > Key: HDFS-12548 > URL: https://issues.apache.org/jira/browse/HDFS-12548 > Project: Hadoop HDFS > Issue Type: Bug > Components: build >Affects Versions: 2.9.0 >Reporter: Rushabh Shah >Priority: Critical > > Feel free move the ticket to another project (e.g. infra). > Recently I attached branch-2 patch while working on one jira > [HDFS-12386|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180676=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180676] > There were at-least 100 failed and timed out tests. I am sure they are not > related to my patch. > Also I came across another jira which was just a javadoc related change and > there were around 100 failed tests. > Below are the details for pre-commits that failed in branch-2 > 1 [HDFS-12386 attempt > 1|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180069=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180069] > {noformat} > Ran on slave: asf912.gq1.ygridcore.net/H12 > Failed with following error message: > Build timed out (after 300 minutes). Marking the build as aborted. > Build was aborted > Performing Post build task... > {noformat} > 2. [HDFS-12386 attempt > 2|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180676=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180676] > {noformat} > Ran on slave: asf900.gq1.ygridcore.net > Failed with following error message: > FATAL: command execution failed > Command close created at > at hudson.remoting.Command.(Command.java:60) > at hudson.remoting.Channel$CloseCommand.(Channel.java:1123) > at hudson.remoting.Channel$CloseCommand.(Channel.java:1121) > at hudson.remoting.Channel.close(Channel.java:1281) > at hudson.remoting.Channel.close(Channel.java:1263) > at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1128) > Caused: hudson.remoting.Channel$OrderlyShutdown > at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1129) > at hudson.remoting.Channel$1.handle(Channel.java:527) > at > hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:83) > Caused: java.io.IOException: Backing channel 'H0' is disconnected. > at > hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:192) > at > hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:257) > at com.sun.proxy.$Proxy125.isAlive(Unknown Source) > at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1043) > at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1035) > at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155) > at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109) > at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66) > at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) > at > hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:735) > at hudson.model.Build$BuildExecution.build(Build.java:206) > at hudson.model.Build$BuildExecution.doRun(Build.java:163) > at > hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:490) > at hudson.model.Run.execute(Run.java:1735) > at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) > at hudson.model.ResourceController.execute(ResourceController.java:97) > at hudson.model.Executor.run(Executor.java:405) > {noformat} > 3. [HDFS-12531 attempt > 1|https://issues.apache.org/jira/browse/HDFS-12531?focusedCommentId=16176493=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16176493] > {noformat} > Ran on slave: asf911.gq1.ygridcore.net > Failed with following error message: > FATAL: command execution failed > Command close created at > at hudson.remoting.Command.(Command.java:60) > at hudson.remoting.Channel$CloseCommand.(Channel.java:1123) > at hudson.remoting.Channel$CloseCommand.(Channel.java:1121) > at hudson.remoting.Channel.close(Channel.java:1281) > at hudson.remoting.Channel.close(Channel.java:1263) > at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1128) > Caused: hudson.remoting.Channel$OrderlyShutdown > at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1129) >
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:48 PM: [~hexiaoqiao] Thanks for your comments. {quote} Great catch here. v001 is fair for me, it will be better if add new unit test to cover. {quote} I'll add to it later unit test {quote} I am interested that why storage is null here. Anywhere not synchronized storageMap where should do that? {quote} the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} detailed execution log [^NN_DN.LOG] Source code is: HeartbeatManager#updateLifeline {code:java} synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes, VolumeFailureSummary volumeFailureSummary) { stats.subtract(node); //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of the current ... node.updateHeartbeatState(reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception occurred here throws stats.add(node); //Here logic is never executed } {code} BlockPlacementPolicyDefault#excludeNodeByLoad {code:java} boolean excludeNodeByLoad(DatanodeDescriptor node){ final double maxLoad = considerLoadFactor * stats.getInServiceXceiverAverage(); //stats.getInServiceXceiverAverage()= heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() //the final maxLoad value will be affected final int nodeLoad = node.getXceiverCount(); if ((nodeLoad > maxLoad) && (maxLoad > 0)) { logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY, "(load: " + nodeLoad + " > " + maxLoad + ")"); return true; } return false; } {code} was (Author: haiyang hu): 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} 4. detailed execution log [^NN_DN.LOG] 5.Source code is: HeartbeatManager#updateLifeline {code:java} synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes, VolumeFailureSummary volumeFailureSummary) { stats.subtract(node); //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of the current ... node.updateHeartbeatState(reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception occurred here throws stats.add(node); //Here logic is never executed } {code} BlockPlacementPolicyDefault#excludeNodeByLoad {code:java} boolean excludeNodeByLoad(DatanodeDescriptor node){ final double maxLoad = considerLoadFactor * stats.getInServiceXceiverAverage(); //stats.getInServiceXceiverAverage()= heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() //the final maxLoad value will be affected final int nodeLoad = node.getXceiverCount(); if ((nodeLoad > maxLoad) && (maxLoad > 0)) { logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY, "(load: " + nodeLoad + " > " + maxLoad + ")"); return true; } return false; } {code} > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline
[jira] [Updated] (HDFS-12548) HDFS Jenkins build is unstable on branch-2
[ https://issues.apache.org/jira/browse/HDFS-12548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-12548: Target Version/s: 2.10.2 (was: 2.10.1) > HDFS Jenkins build is unstable on branch-2 > -- > > Key: HDFS-12548 > URL: https://issues.apache.org/jira/browse/HDFS-12548 > Project: Hadoop HDFS > Issue Type: Bug > Components: build >Affects Versions: 2.9.0 >Reporter: Rushabh Shah >Priority: Critical > > Feel free move the ticket to another project (e.g. infra). > Recently I attached branch-2 patch while working on one jira > [HDFS-12386|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180676=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180676] > There were at-least 100 failed and timed out tests. I am sure they are not > related to my patch. > Also I came across another jira which was just a javadoc related change and > there were around 100 failed tests. > Below are the details for pre-commits that failed in branch-2 > 1 [HDFS-12386 attempt > 1|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180069=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180069] > {noformat} > Ran on slave: asf912.gq1.ygridcore.net/H12 > Failed with following error message: > Build timed out (after 300 minutes). Marking the build as aborted. > Build was aborted > Performing Post build task... > {noformat} > 2. [HDFS-12386 attempt > 2|https://issues.apache.org/jira/browse/HDFS-12386?focusedCommentId=16180676=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16180676] > {noformat} > Ran on slave: asf900.gq1.ygridcore.net > Failed with following error message: > FATAL: command execution failed > Command close created at > at hudson.remoting.Command.(Command.java:60) > at hudson.remoting.Channel$CloseCommand.(Channel.java:1123) > at hudson.remoting.Channel$CloseCommand.(Channel.java:1121) > at hudson.remoting.Channel.close(Channel.java:1281) > at hudson.remoting.Channel.close(Channel.java:1263) > at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1128) > Caused: hudson.remoting.Channel$OrderlyShutdown > at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1129) > at hudson.remoting.Channel$1.handle(Channel.java:527) > at > hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:83) > Caused: java.io.IOException: Backing channel 'H0' is disconnected. > at > hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:192) > at > hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:257) > at com.sun.proxy.$Proxy125.isAlive(Unknown Source) > at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1043) > at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1035) > at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:155) > at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:109) > at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:66) > at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) > at > hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:735) > at hudson.model.Build$BuildExecution.build(Build.java:206) > at hudson.model.Build$BuildExecution.doRun(Build.java:163) > at > hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:490) > at hudson.model.Run.execute(Run.java:1735) > at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43) > at hudson.model.ResourceController.execute(ResourceController.java:97) > at hudson.model.Executor.run(Executor.java:405) > {noformat} > 3. [HDFS-12531 attempt > 1|https://issues.apache.org/jira/browse/HDFS-12531?focusedCommentId=16176493=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16176493] > {noformat} > Ran on slave: asf911.gq1.ygridcore.net > Failed with following error message: > FATAL: command execution failed > Command close created at > at hudson.remoting.Command.(Command.java:60) > at hudson.remoting.Channel$CloseCommand.(Channel.java:1123) > at hudson.remoting.Channel$CloseCommand.(Channel.java:1121) > at hudson.remoting.Channel.close(Channel.java:1281) > at hudson.remoting.Channel.close(Channel.java:1263) > at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1128) > Caused: hudson.remoting.Channel$OrderlyShutdown > at hudson.remoting.Channel$CloseCommand.execute(Channel.java:1129) > at hudson.remoting.Channel$1.handle(Channel.java:527) > at >
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:43 PM: 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} 4. detailed execution log [^NN_DN.LOG] 5.Source code is: HeartbeatManager#updateLifeline {code:java} synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes, VolumeFailureSummary volumeFailureSummary) { stats.subtract(node); //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of the current ... node.updateHeartbeatState(reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception occurred here throws stats.add(node); //Here logic is never executed } {code} BlockPlacementPolicyDefault#excludeNodeByLoad {code:java} boolean excludeNodeByLoad(DatanodeDescriptor node){ final double maxLoad = considerLoadFactor * stats.getInServiceXceiverAverage(); //stats.getInServiceXceiverAverage()= heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() //the final maxLoad value will be affected final int nodeLoad = node.getXceiverCount(); if ((nodeLoad > maxLoad) && (maxLoad > 0)) { logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY, "(load: " + nodeLoad + " > " + maxLoad + ")"); return true; } return false; } {code} was (Author: haiyang hu): 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} 4. detailed execution log [^NN_DN.LOG] 5.Source code is: {code:java} HeartbeatManager#updateLifeline synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes, VolumeFailureSummary volumeFailureSummary) { stats.subtract(node); //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of the current ... node.updateHeartbeatState(reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception occurred here throws stats.add(node); //Here logic is never executed } {code} {code:java} BlockPlacementPolicyDefault#excludeNodeByLoad boolean excludeNodeByLoad(DatanodeDescriptor node){ final double maxLoad = considerLoadFactor * stats.getInServiceXceiverAverage(); //stats.getInServiceXceiverAverage()= heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() //the final maxLoad value will be affected final int nodeLoad = node.getXceiverCount(); if ((nodeLoad > maxLoad) && (maxLoad > 0)) { logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY, "(load: " + nodeLoad + " > " + maxLoad + ")"); return true; } return false; } {code} > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. >
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:42 PM: 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} 4. detailed execution log [^NN_DN.LOG] 5.Source code is: {code:java} HeartbeatManager#updateLifeline synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes, VolumeFailureSummary volumeFailureSummary) { stats.subtract(node); //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of the current ... node.updateHeartbeatState(reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception occurred here throws stats.add(node); //Here logic is never executed } {code} {code:java} BlockPlacementPolicyDefault#excludeNodeByLoad boolean excludeNodeByLoad(DatanodeDescriptor node){ final double maxLoad = considerLoadFactor * stats.getInServiceXceiverAverage(); //stats.getInServiceXceiverAverage()= heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() //the final maxLoad value will be affected final int nodeLoad = node.getXceiverCount(); if ((nodeLoad > maxLoad) && (maxLoad > 0)) { logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY, "(load: " + nodeLoad + " > " + maxLoad + ")"); return true; } return false; } {code} was (Author: haiyang hu): 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} 4. detailed execution log [^NN_DN.LOG] 5.Source code is: {code:java} HeartbeatManager#updateLifeline synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] reports, long cacheCapacity, long cacheUsed, int xceiverCount, int failedVolumes, VolumeFailureSummary volumeFailureSummary) { stats.subtract(node); //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of the current ... node.updateHeartbeatState(reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception occurred here throws stats.add(node); //Here logic is never executed } BlockPlacementPolicyDefault#excludeNodeByLoad boolean excludeNodeByLoad(DatanodeDescriptor node){ final double maxLoad = considerLoadFactor * stats.getInServiceXceiverAverage(); //stats.getInServiceXceiverAverage()= heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() //the final maxLoad value will be affected final int nodeLoad = node.getXceiverCount(); if ((nodeLoad > maxLoad) && (maxLoad > 0)) { logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY, "(load: " + nodeLoad + " > " + maxLoad + ")"); return true; } return false; } {code} > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. >
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:41 PM: 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} 4. detailed execution log [^NN_DN.LOG] 5.Source code is: {code:java} HeartbeatManager#updateLifeline synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] reports, long cacheCapacity, long cacheUsed, int xceiverCount, int failedVolumes, VolumeFailureSummary volumeFailureSummary) { stats.subtract(node); //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of the current ... node.updateHeartbeatState(reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception occurred here throws stats.add(node); //Here logic is never executed } BlockPlacementPolicyDefault#excludeNodeByLoad boolean excludeNodeByLoad(DatanodeDescriptor node){ final double maxLoad = considerLoadFactor * stats.getInServiceXceiverAverage(); //stats.getInServiceXceiverAverage()= heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() //the final maxLoad value will be affected final int nodeLoad = node.getXceiverCount(); if ((nodeLoad > maxLoad) && (maxLoad > 0)) { logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY, "(load: " + nodeLoad + " > " + maxLoad + ")"); return true; } return false; } {code} was (Author: haiyang hu): 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} //execution log [^NN_DN.LOG] > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at >
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:39 PM: 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} //execution log [^NN_DN.LOG] was (Author: haiyang hu): 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} //execution log > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:38 PM: 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} //execution log was (Author: haiyang hu): 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} {code:java} //execution log //NameNode LOG: #registered DN: 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: xxx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: xx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: [DISK]:NORMAL:xxx:50010 failed. 2020-08-25 00:58:53,978 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010 ... #sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It keeps occurred the NPE 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... 2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8022, call Call#67833 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... #DN sendHeartBeat the NN will add storageMap: 2020-08-25 00:59:46,632 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new storage ID xxx for DN xxx:50010 DN LOG: #DN run DNA_REGISTER 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from NN:8021 with active state 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake with NN 2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in LifelineSender for Block pool XXX service to NN:8021 org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422)
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: NN_DN.LOG > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:36 PM: 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} {code:java} //execution log //NameNode LOG: #registered DN: 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: xxx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: xx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: [DISK]:NORMAL:xxx:50010 failed. 2020-08-25 00:58:53,978 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010 ... #sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It keeps occurred the NPE 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... 2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8022, call Call#67833 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... #DN sendHeartBeat the NN will add storageMap: 2020-08-25 00:59:46,632 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new storage ID xxx for DN xxx:50010 DN LOG: #DN run DNA_REGISTER 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from NN:8021 with active state 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake with NN 2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in LifelineSender for Block pool XXX service to NN:8021 org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at org.apache.hadoop.ipc.Client.call(Client.java:1457) at org.apache.hadoop.ipc.Client.call(Client.java:1367) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:36 PM: 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run-->offerService-->processCommand-->reRegister-->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} {code:java} //execution log //NameNode LOG: #registered DN: 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: xxx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: xx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: [DISK]:NORMAL:xxx:50010 failed. 2020-08-25 00:58:53,978 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010 ... #sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It keeps occurred the NPE 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... 2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8022, call Call#67833 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... #DN sendHeartBeat the NN will add storageMap: 2020-08-25 00:59:46,632 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new storage ID xxx for DN xxx:50010 DN LOG: #DN run DNA_REGISTER 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from NN:8021 with active state 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake with NN 2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in LifelineSender for Block pool XXX service to NN:8021 org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at org.apache.hadoop.ipc.Client.call(Client.java:1457) at org.apache.hadoop.ipc.Client.call(Client.java:1367) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at
[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190108#comment-17190108 ] huhaiyang commented on HDFS-15556: -- 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: #BPServiceActor#run-->offerService-->processCommand-->reRegister-->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} {code:java} //execution log //NameNode LOG: #registered DN: 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: xxx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: xx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: [DISK]:NORMAL:xxx:50010 failed. 2020-08-25 00:58:53,978 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010 ... #sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It keeps occurred the NPE 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... 2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8022, call Call#67833 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... #DN sendHeartBeat the NN will add storageMap: 2020-08-25 00:59:46,632 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new storage ID xxx for DN xxx:50010 DN LOG: #DN run DNA_REGISTER 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from NN:8021 with active state 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake with NN 2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in LifelineSender for Block pool XXX service to NN:8021 org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at org.apache.hadoop.ipc.Client.call(Client.java:1457) at org.apache.hadoop.ipc.Client.call(Client.java:1367) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy21.sendLifeline(Unknown Source) at
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190102#comment-17190102 ] Xiaoqiao He edited comment on HDFS-15556 at 9/3/20, 12:30 PM: -- [~haiyang Hu] Thanks for report. Great catch here. v001 is fair for me, it will be better if add new unit test to cover. I am interested that why {{storage}} is null here. Anywhere not synchronized {{storageMap}} where should do that? was (Author: hexiaoqiao): [~haiyang Hu] Great catch here. v001 is fair for me, it will be better if add new unit test to cover. I am interested that why {{storage}} is null here. Anywhere not synchronized {{storageMap}} where should do that? > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17190102#comment-17190102 ] Xiaoqiao He commented on HDFS-15556: [~haiyang Hu] Great catch here. v001 is fair for me, it will be better if add new unit test to cover. I am interested that why {{storage}} is null here. Anywhere not synchronized {{storageMap}} where should do that? > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15163) hdfs-2.10.0-webapps-secondary-status.html miss moment.js
[ https://issues.apache.org/jira/browse/HDFS-15163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-15163: Target Version/s: 2.10.1 (was: 2.10.0) > hdfs-2.10.0-webapps-secondary-status.html miss moment.js > > > Key: HDFS-15163 > URL: https://issues.apache.org/jira/browse/HDFS-15163 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.10.0 >Reporter: 谢波 >Priority: Minor > Fix For: 2.10.1 > > Attachments: 微信截图_20200212183444.png > > Original Estimate: 96h > Remaining Estimate: 96h > > hdfs-2.10.0-webapps-secondary-status.html miss moment.js > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15163) hdfs-2.10.0-webapps-secondary-status.html miss moment.js
[ https://issues.apache.org/jira/browse/HDFS-15163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-15163: Fix Version/s: (was: 2.10.1) > hdfs-2.10.0-webapps-secondary-status.html miss moment.js > > > Key: HDFS-15163 > URL: https://issues.apache.org/jira/browse/HDFS-15163 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 2.10.0 >Reporter: 谢波 >Priority: Minor > Attachments: 微信截图_20200212183444.png > > Original Estimate: 96h > Remaining Estimate: 96h > > hdfs-2.10.0-webapps-secondary-status.html miss moment.js > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14694) Call recoverLease on DFSOutputStream close exception
[ https://issues.apache.org/jira/browse/HDFS-14694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14694: --- Attachment: HDFS-14694.010.patch > Call recoverLease on DFSOutputStream close exception > > > Key: HDFS-14694 > URL: https://issues.apache.org/jira/browse/HDFS-14694 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Reporter: Chen Zhang >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-14694.001.patch, HDFS-14694.002.patch, > HDFS-14694.003.patch, HDFS-14694.004.patch, HDFS-14694.005.patch, > HDFS-14694.006.patch, HDFS-14694.007.patch, HDFS-14694.008.patch, > HDFS-14694.009.patch, HDFS-14694.010.patch > > > HDFS uses file-lease to manage opened files, when a file is not closed > normally, NN will recover lease automatically after hard limit exceeded. But > for a long running service(e.g. HBase), the hdfs-client will never die and NN > don't have any chances to recover the file. > Usually client program needs to handle exceptions by themself to avoid this > condition(e.g. HBase automatically call recover lease for files that not > closed normally), but in our experience, most services (in our company) don't > process this condition properly, which will cause lots of files in abnormal > status or even data loss. > This Jira propose to add a feature that call recoverLease operation > automatically when DFSOutputSteam close encounters exception. It should be > disabled by default, but when somebody builds a long-running service based on > HDFS, they can enable this option. > We've add this feature to our internal Hadoop distribution for more than 3 > years, it's quite useful according our experience. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189991#comment-17189991 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 9:30 AM: --- 1. CPU NameNode high, thread stack is {code:java} "IPC Server handler 59 on 8020" #244 daemon prio=5 os_prio=0 tid=0x7f18b0ff7800 nid=0x1c006 runnable [0x7f185cbfc000] java.lang.Thread.State: RUNNABLE at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.hadoop.net.NetworkTopology.getNode(NetworkTopology.java:263) at org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:678) at org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533) at org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:903) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:800) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:768) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseFromNextRack(BlockPlacementPolicyDefault.java:719) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:687) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:534) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:440) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:310) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:149) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:174) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2239) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2828) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:913) {code} 2.there are a large number of logs, and in extreme cases, all DN nodes of the cluster are not satisfied with the allocation {code:java} 2020-08-25 01:38:50,370 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=xxx} 2020-08-25 01:38:50,370 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 3 to reach 3 (unavailableStoragrom storage xxx node DatanodeRegistration(:50010, datanodeUuid=xxx, infoPort=50075, infoSecurePor t=0, ipcPort=50020, storageInfo=lv=-57;cid=xxx;nsid=;c=0), blocks: 2266, hasStaleStorage: false, processing time: 7 msecs, invalidatedBlocks: 0 2020-08-25 01:38:50,370 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=xxx} {code} was (Author: haiyang hu): 1. CPU NameNode high, thread stack is {code:java} "IPC Server handler 59 on 8020" #244 daemon prio=5 os_prio=0 tid=0x7f18b0ff7800 nid=0x1c006 runnable [0x7f185cbfc000] java.lang.Thread.State: RUNNABLE at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.hadoop.net.NetworkTopology.getNode(NetworkTopology.java:263) at org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:678) at org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533) at org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:903) at
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189991#comment-17189991 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 9:25 AM: --- 1. CPU NameNode high, thread stack is {code:java} "IPC Server handler 59 on 8020" #244 daemon prio=5 os_prio=0 tid=0x7f18b0ff7800 nid=0x1c006 runnable [0x7f185cbfc000] java.lang.Thread.State: RUNNABLE at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.hadoop.net.NetworkTopology.getNode(NetworkTopology.java:263) at org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:678) at org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533) at org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:903) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:800) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:768) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseFromNextRack(BlockPlacementPolicyDefault.java:719) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:687) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:534) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:440) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:310) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:149) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:174) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2239) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2828) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:913) {code} 2. was (Author: haiyang hu): # CPU NameNode high, thread stack is # > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at >
[jira] [Commented] (HDFS-14351) RBF: Optimize configuration item resolving for monitor namenode
[ https://issues.apache.org/jira/browse/HDFS-14351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189992#comment-17189992 ] Fei Hui commented on HDFS-14351: Maybe it's helpful that backport it to other 3.x branches. Thanks > RBF: Optimize configuration item resolving for monitor namenode > --- > > Key: HDFS-14351 > URL: https://issues.apache.org/jira/browse/HDFS-14351 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: rbf >Reporter: Xiaoqiao He >Assignee: Xiaoqiao He >Priority: Major > Fix For: 3.3.0, HDFS-13891 > > Attachments: HDFS-14351-HDFS-13891.001.patch, > HDFS-14351-HDFS-13891.002.patch, HDFS-14351-HDFS-13891.003.patch, > HDFS-14351-HDFS-13891.004.patch, HDFS-14351-HDFS-13891.005.patch, > HDFS-14351-HDFS-13891.006.patch, HDFS-14351.001.patch, HDFS-14351.002.patch > > > We invoke {{configuration.get}} to resolve configuration item > `dfs.federation.router.monitor.namenode` at `Router.java`, then split the > value by comma to get nsid and nnid, it may confused users since this is not > compatible with blank space but other common parameters could do. The > following segment show example that resolve fails. > {code:java} > > dfs.federation.router.monitor.namenode > nameservice1.nn1, nameservice1.nn2 > > The identifier of the namenodes to monitor and heartbeat. > > > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189991#comment-17189991 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 9:24 AM: --- # CPU NameNode high, thread stack is # was (Author: haiyang hu): # CPU NameNode high, thread stack is !NN-jstack.png! # > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: (was: NN-jstack.png) > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: NN-jstack.png > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189991#comment-17189991 ] huhaiyang commented on HDFS-15556: -- # CPU NameNode high, thread stack is !NN-jstack.png! # > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: screenshot-1.png > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: (was: screenshot-1.png) > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: NN-CPU.png > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: HDFS-15556.001.patch > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: (was: NN-CPU.png) > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: NN-CPU.png > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: NN-CPU.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. because DataNode is identified as busy and unable to allocate available nodes in choose DataNode, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} {code:java} // DatanodeDescriptor#updateStorageStats ... for (StorageReport report : reports) { DatanodeStorageInfo storage = null; synchronized (storageMap) { storage = storageMap.get(report.getStorage().getStorageID()); } if (checkFailedStorages) { failedStorageInfos.remove(storage); } storage.receivedHeartbeat(report); // NPE exception occurred here // skip accounting for capacity of PROVIDED storages! if (StorageType.PROVIDED.equals(storage.getStorageType())) { continue; } ... {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. because DataNode is identified as busy and unable to allocate available nodes in choose DataNode, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. because DataNode is identified as busy and unable to allocate available nodes in choose DataNode, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} {code:java} // DatanodeDescriptor#updateStorageStats ... for (StorageReport report : reports) { DatanodeStorageInfo storage = null; synchronized (storageMap) { storage = storageMap.get(report.getStorage().getStorageID()); } if (checkFailedStorages) { failedStorageInfos.remove(storage); } storage.receivedHeartbeat(report); // NPE exception occurred here // skip accounting for capacity of PROVIDED storages! if (StorageType.PROVIDED.equals(storage.getStorageType())) { continue; } ... {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} {code:java} // DatanodeDescriptor#updateStorageStats ... for (StorageReport report : reports) { DatanodeStorageInfo storage = null; synchronized (storageMap) { storage = storageMap.get(report.getStorage().getStorageID()); } if (checkFailedStorages) { failedStorageInfos.remove(storage); } storage.receivedHeartbeat(report); // NPE exception occurred here // skip accounting for capacity of PROVIDED storages! if (StorageType.PROVIDED.equals(storage.getStorageType())) { continue; } ... {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at