[jira] [Commented] (HDFS-15792) ClasscastException while loading FSImage

2021-02-02 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277780#comment-17277780
 ] 

Xiaoqiao He commented on HDFS-15792:


[~prasad-acit] the following lambda expression need to change to general 
expression for branch-2.10.
{code:java}
   @Override
   public int decrementAndGetRefCount() {
-return (refCount > 0) ? --refCount : 0;
+return value.updateAndGet(i -> i > 0 ? i - 1 : i);
   }
{code}

> ClasscastException while loading FSImage
> 
>
> Key: HDFS-15792
> URL: https://issues.apache.org/jira/browse/HDFS-15792
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nn
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-15792-branch-2.10.001.patch, HDFS-15792.001.patch, 
> HDFS-15792.002.patch, HDFS-15792.003.patch, HDFS-15792.004.patch, 
> HDFS-15792.005.patch, HDFS-15792.addendum.001.patch, 
> image-2021-01-27-12-00-34-846.png
>
>
> FSImage loading has failed with ClasscastException - 
> java.lang.ClassCastException: java.util.HashMap$Node cannot be cast to 
> java.util.HashMap$TreeNode.
> This is the usage issue with Hashmap in concurrent scenarios.
> Same issue has been reported on Java & closed as usage issue.  - 
> https://bugs.openjdk.java.net/browse/JDK-8173671
> 2020-12-28 11:36:26,127 | ERROR | main | An exception occurred when loading 
> INODE from fsiamge. | FSImageFormatProtobuf.java:442
> java.lang.
> : java.util.HashMap$Node cannot be cast to java.util.HashMap$TreeNode
>   at java.util.HashMap$TreeNode.moveRootToFront(HashMap.java:1835)
>   at java.util.HashMap$TreeNode.treeify(HashMap.java:1951)
>   at java.util.HashMap.treeifyBin(HashMap.java:772)
>   at java.util.HashMap.putVal(HashMap.java:644)
>   at java.util.HashMap.put(HashMap.java:612)
>   at 
> org.apache.hadoop.hdfs.util.ReferenceCountMap.put(ReferenceCountMap.java:53)
>   at 
> org.apache.hadoop.hdfs.server.namenode.AclStorage.addAclFeature(AclStorage.java:391)
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeWithAdditionalFields.addAclFeature(INodeWithAdditionalFields.java:349)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectory(FSImageFormatPBINode.java:225)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINode(FSImageFormatPBINode.java:406)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.readPBINodes(FSImageFormatPBINode.java:367)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeSection(FSImageFormatPBINode.java:342)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader$2.call(FSImageFormatProtobuf.java:469)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2020-12-28 11:36:26,130 | ERROR | main | Failed to load image from 
> FSImageFile(file=/srv/BigData/namenode/current/fsimage_00198227480, 
> cpktTxId=00198227480) | FSImage.java:738
> java.io.IOException: java.lang.ClassCastException: java.util.HashMap$Node 
> cannot be cast to java.util.HashMap$TreeNode
>   at 
> org.apache.hadoop.io.MultipleIOException$Builder.add(MultipleIOException.java:68)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.runLoaderTasks(FSImageFormatProtobuf.java:444)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:360)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:263)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:227)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:971)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:955)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:820)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:733)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1113)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:730)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:648)
>   at 
> 

[jira] [Commented] (HDFS-15815) if required storageType are unavailable log the failed reason during choosing Datanode

2021-02-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277743#comment-17277743
 ] 

Hadoop QA commented on HDFS-15815:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
45s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
50s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
21s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
15s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 1s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
19s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 13s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
54s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
30s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  3m  
5s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs config; 
considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m  
3s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
12s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
13s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
13s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
8s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
8s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 53s{color} | 
{color:orange}https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/453/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt{color}
 | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 
70 unchanged - 0 fixed = 71 total (was 70) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
14s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 36s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
48s{color} | {color:green}{color} | 

[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=546884=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546884
 ]

ASF GitHub Bot logged work on HDFS-15624:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 06:47
Start Date: 03/Feb/21 06:47
Worklog Time Spent: 10m 
  Work Description: liuml07 commented on pull request #2377:
URL: https://github.com/apache/hadoop/pull/2377#issuecomment-772277416


   Merged and resolved the JIRA. Thank you all!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546884)
Time Spent: 9h 40m  (was: 9.5h)

>  Fix the SetQuotaByStorageTypeOp problem after updating hadoop 
> ---
>
> Key: HDFS-15624
> URL: https://issues.apache.org/jira/browse/HDFS-15624
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: YaYun Wang
>Assignee: huangtianhua
>Priority: Major
>  Labels: pull-request-available, release-blocker
> Fix For: 3.4.0
>
>  Time Spent: 9h 40m
>  Remaining Estimate: 0h
>
> HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum 
> of StorageType. And, setting the quota by storageType depends on the 
> ordinal(), therefore, it may cause the setting of quota to be invalid after 
> upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop

2021-02-02 Thread Mingliang Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingliang Liu resolved HDFS-15624.
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed

Committed to trunk branch. Thank you [~huangtianhua] and su xu for your 
contribution. Thank you [~ayushtkn] and [~vinayakumarb] for your helpful review.

>  Fix the SetQuotaByStorageTypeOp problem after updating hadoop 
> ---
>
> Key: HDFS-15624
> URL: https://issues.apache.org/jira/browse/HDFS-15624
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: YaYun Wang
>Assignee: huangtianhua
>Priority: Major
>  Labels: pull-request-available, release-blocker
> Fix For: 3.4.0
>
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum 
> of StorageType. And, setting the quota by storageType depends on the 
> ordinal(), therefore, it may cause the setting of quota to be invalid after 
> upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=546882=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546882
 ]

ASF GitHub Bot logged work on HDFS-15624:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 06:45
Start Date: 03/Feb/21 06:45
Worklog Time Spent: 10m 
  Work Description: huangtianhua commented on pull request #2377:
URL: https://github.com/apache/hadoop/pull/2377#issuecomment-772276331


   @ayushtkn would you please to approve this, thanks very much.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546882)
Time Spent: 9h 20m  (was: 9h 10m)

>  Fix the SetQuotaByStorageTypeOp problem after updating hadoop 
> ---
>
> Key: HDFS-15624
> URL: https://issues.apache.org/jira/browse/HDFS-15624
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: YaYun Wang
>Assignee: YaYun Wang
>Priority: Major
>  Labels: pull-request-available, release-blocker
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum 
> of StorageType. And, setting the quota by storageType depends on the 
> ordinal(), therefore, it may cause the setting of quota to be invalid after 
> upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop

2021-02-02 Thread Mingliang Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingliang Liu reassigned HDFS-15624:


Assignee: YaYun Wang

>  Fix the SetQuotaByStorageTypeOp problem after updating hadoop 
> ---
>
> Key: HDFS-15624
> URL: https://issues.apache.org/jira/browse/HDFS-15624
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: YaYun Wang
>Assignee: YaYun Wang
>Priority: Major
>  Labels: pull-request-available, release-blocker
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum 
> of StorageType. And, setting the quota by storageType depends on the 
> ordinal(), therefore, it may cause the setting of quota to be invalid after 
> upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=546883=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546883
 ]

ASF GitHub Bot logged work on HDFS-15624:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 06:45
Start Date: 03/Feb/21 06:45
Worklog Time Spent: 10m 
  Work Description: huangtianhua removed a comment on pull request #2377:
URL: https://github.com/apache/hadoop/pull/2377#issuecomment-772276331


   @ayushtkn would you please to approve this, thanks very much.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546883)
Time Spent: 9.5h  (was: 9h 20m)

>  Fix the SetQuotaByStorageTypeOp problem after updating hadoop 
> ---
>
> Key: HDFS-15624
> URL: https://issues.apache.org/jira/browse/HDFS-15624
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: YaYun Wang
>Assignee: huangtianhua
>Priority: Major
>  Labels: pull-request-available, release-blocker
>  Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum 
> of StorageType. And, setting the quota by storageType depends on the 
> ordinal(), therefore, it may cause the setting of quota to be invalid after 
> upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop

2021-02-02 Thread Mingliang Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingliang Liu reassigned HDFS-15624:


Assignee: huangtianhua  (was: YaYun Wang)

>  Fix the SetQuotaByStorageTypeOp problem after updating hadoop 
> ---
>
> Key: HDFS-15624
> URL: https://issues.apache.org/jira/browse/HDFS-15624
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: YaYun Wang
>Assignee: huangtianhua
>Priority: Major
>  Labels: pull-request-available, release-blocker
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum 
> of StorageType. And, setting the quota by storageType depends on the 
> ordinal(), therefore, it may cause the setting of quota to be invalid after 
> upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=546881=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546881
 ]

ASF GitHub Bot logged work on HDFS-15624:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 06:44
Start Date: 03/Feb/21 06:44
Worklog Time Spent: 10m 
  Work Description: liuml07 merged pull request #2377:
URL: https://github.com/apache/hadoop/pull/2377


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546881)
Time Spent: 9h 10m  (was: 9h)

>  Fix the SetQuotaByStorageTypeOp problem after updating hadoop 
> ---
>
> Key: HDFS-15624
> URL: https://issues.apache.org/jira/browse/HDFS-15624
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: YaYun Wang
>Priority: Major
>  Labels: pull-request-available, release-blocker
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>
> HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum 
> of StorageType. And, setting the quota by storageType depends on the 
> ordinal(), therefore, it may cause the setting of quota to be invalid after 
> upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15812) after deleting data of hbase table hdfs size is not decreasing

2021-02-02 Thread Satya Gaurav (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277725#comment-17277725
 ] 

Satya Gaurav commented on HDFS-15812:
-

[~surendralilhore] I have sent an email on u...@hadoop.apache.org

> after deleting data of hbase table hdfs size is not decreasing
> --
>
> Key: HDFS-15812
> URL: https://issues.apache.org/jira/browse/HDFS-15812
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.0.2-alpha
> Environment: HDP 3.1.4.0-315
> Hbase 2.0.2.3.1.4.0-315
>Reporter: Satya Gaurav
>Priority: Major
>
> I am deleting the data from hbase table, it's deleting from hbase table but 
> the size of the hdfs directory is not reducing. Even I ran the major 
> compaction but after that also hdfs size didn't reduce. Any solution for this 
> issue?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15624) Fix the SetQuotaByStorageTypeOp problem after updating hadoop

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15624?focusedWorklogId=546874=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546874
 ]

ASF GitHub Bot logged work on HDFS-15624:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 06:12
Start Date: 03/Feb/21 06:12
Worklog Time Spent: 10m 
  Work Description: huangtianhua commented on pull request #2377:
URL: https://github.com/apache/hadoop/pull/2377#issuecomment-772261120


   @liuml07 could you approve this? Thanks very much.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546874)
Time Spent: 9h  (was: 8h 50m)

>  Fix the SetQuotaByStorageTypeOp problem after updating hadoop 
> ---
>
> Key: HDFS-15624
> URL: https://issues.apache.org/jira/browse/HDFS-15624
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: YaYun Wang
>Priority: Major
>  Labels: pull-request-available, release-blocker
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> HDFS-15025 adds a new storage Type NVDIMM, changes the ordinal() of the enum 
> of StorageType. And, setting the quota by storageType depends on the 
> ordinal(), therefore, it may cause the setting of quota to be invalid after 
> upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15812) after deleting data of hbase table hdfs size is not decreasing

2021-02-02 Thread Satya Gaurav (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277710#comment-17277710
 ] 

Satya Gaurav commented on HDFS-15812:
-

[~surendralilhore] it's not moving into trash also after 2 days also the size 
is same.

> after deleting data of hbase table hdfs size is not decreasing
> --
>
> Key: HDFS-15812
> URL: https://issues.apache.org/jira/browse/HDFS-15812
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.0.2-alpha
> Environment: HDP 3.1.4.0-315
> Hbase 2.0.2.3.1.4.0-315
>Reporter: Satya Gaurav
>Priority: Major
>
> I am deleting the data from hbase table, it's deleting from hbase table but 
> the size of the hdfs directory is not reducing. Even I ran the major 
> compaction but after that also hdfs size didn't reduce. Any solution for this 
> issue?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15812) after deleting data of hbase table hdfs size is not decreasing

2021-02-02 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277683#comment-17277683
 ] 

Surendra Singh Lilhore commented on HDFS-15812:
---

please send your query on 
[u...@hadoop.apache.org.|mailto:u...@hadoop.apache.org]

> after deleting data of hbase table hdfs size is not decreasing
> --
>
> Key: HDFS-15812
> URL: https://issues.apache.org/jira/browse/HDFS-15812
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.0.2-alpha
> Environment: HDP 3.1.4.0-315
> Hbase 2.0.2.3.1.4.0-315
>Reporter: Satya Gaurav
>Priority: Major
>
> I am deleting the data from hbase table, it's deleting from hbase table but 
> the size of the hdfs directory is not reducing. Even I ran the major 
> compaction but after that also hdfs size didn't reduce. Any solution for this 
> issue?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15812) after deleting data of hbase table hdfs size is not decreasing

2021-02-02 Thread Surendra Singh Lilhore (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277682#comment-17277682
 ] 

Surendra Singh Lilhore commented on HDFS-15812:
---

[~satycse06], it will take time to delete data from hdfs if is moved to trash.

> after deleting data of hbase table hdfs size is not decreasing
> --
>
> Key: HDFS-15812
> URL: https://issues.apache.org/jira/browse/HDFS-15812
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.0.2-alpha
> Environment: HDP 3.1.4.0-315
> Hbase 2.0.2.3.1.4.0-315
>Reporter: Satya Gaurav
>Priority: Major
>
> I am deleting the data from hbase table, it's deleting from hbase table but 
> the size of the hdfs directory is not reducing. Even I ran the major 
> compaction but after that also hdfs size didn't reduce. Any solution for this 
> issue?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-13609) [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

2021-02-02 Thread xuzq (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277680#comment-17277680
 ] 

xuzq edited comment on HDFS-13609 at 2/3/21, 4:59 AM:
--

Give one example to illustrate what I think.

We have 5 journals, like jn1 ~ jn5.

And Active write edits like:
|Txid|SuccessWriteJournalId|FailedJournalId|
|TxId1|jn2, jn3, jn4, jn5|jn1(write into cache, write disk failed)|
|TxId2|jn2, jn3, jn4, jn5| |
|TxId3|jn2, jn3, jn4, jn5| |
|TxId4|jn2, jn3, jn4, jn5| |
|TxId5|jn2, jn3, jn4, jn5| |

 

 

When we attempt to failover standby to active, standby need to catchup all 
edits from TxId1 ~ TxId5 from TxId1, and change to active.

But before to failover standby to active, jn4 and jn5 have some delay times, 
caused responseCounts like (0(jn1), 5(jn2), 5(jn3)) when 
_editLogTailer.catchupDuringFailover()._

 

Standby NameNode expect to get all edits from TxId1 ~ TxId5, but only get 
txId1.  TxId2 ~ TxId5 don't applied into fsImage.

And it will caused StandbyNameNode cashed when 
_getFSImage().editLog.openForWrite()._

 

I think we should use responseCounts(2) ~ responseCounts(4) to ensure can 
catchup all edits.

But the last edit in responseCounts(2) ~ responseCounts(4) maybe is writing by 
active, maybe not on a quorum of JNs.

It will cause Obsever NameNode or Standby NameNode tail UnQuorum edits.

 

Or maybe we can write disk first, then write cache in Journal Node.

 

[~xkrogen] On this question, if you have some good ideas, please tell me, 
thanks.

 

 

 


was (Author: xuzq_zander):
Give one example to illustrate what I think.

We have 5 journals, like jn1 ~ jn5.

And Active write edits like:
|Txid|SuccessWriteJournalId|FailedJournalId|
|TxId1|jn2, jn3, jn4, jn5|jn1(write into cache, write disk failed)|
|TxId2|jn2, jn3, jn4, jn5| |
|TxId3|jn2, jn3, jn4, jn5| |
|TxId4|jn2, jn3, jn4, jn5| |
|TxId5|jn2, jn3, jn4, jn5| |

 

 

When we attempt to failover standby to active, standby need to catchup all 
edits from TxId1 ~ TxId5 from TxId1, and change to active.

But before to failover standby to active, jn4 and jn5 have some delay times, 
caused responseCounts like (0(jn1), 5(jn2), 5(jn3)) when 
_editLogTailer.catchupDuringFailover()._

 

Standby NameNode expect to get all edits from TxId1 ~ TxId5, but only get 
txId1.  TxId2 ~ TxId5 don't applied into fsImage.

And it will caused StandbyNameNode cashed when 
_getFSImage().editLog.openForWrite()._

 

 

I think we should use responseCounts(2) ~ responseCounts(4) to ensure can 
catchup all edits.

But the last edit in responseCounts(2) ~ responseCounts(4) maybe is writing by 
active, maybe not on a quorum of JNs.

It will cause Obsever NameNode or Standby NameNode tail UnQuorum edits.

 

[~xkrogen] On this question, if you have some good ideas, please tell me, 
thanks.

 

 

 

> [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via 
> RPC
> -
>
> Key: HDFS-13609
> URL: https://issues.apache.org/jira/browse/HDFS-13609
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, namenode
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: HDFS-12943, 3.3.0
>
> Attachments: HDFS-13609-HDFS-12943.000.patch, 
> HDFS-13609-HDFS-12943.001.patch, HDFS-13609-HDFS-12943.002.patch, 
> HDFS-13609-HDFS-12943.003.patch, HDFS-13609-HDFS-12943.004.patch
>
>
> See HDFS-13150 for the full design.
> This JIRA is targetted at the NameNode-side changes to enable tailing 
> in-progress edits via the RPC mechanism added in HDFS-13608. Most changes are 
> in the QuorumJournalManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13609) [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

2021-02-02 Thread xuzq (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277680#comment-17277680
 ] 

xuzq commented on HDFS-13609:
-

Give one example to illustrate what I think.

We have 5 journals, like jn1 ~ jn5.

And Active write edits like:
|Txid|SuccessWriteJournalId|FailedJournalId|
|TxId1|jn2, jn3, jn4, jn5|jn1(write into cache, write disk failed)|
|TxId2|jn2, jn3, jn4, jn5| |
|TxId3|jn2, jn3, jn4, jn5| |
|TxId4|jn2, jn3, jn4, jn5| |
|TxId5|jn2, jn3, jn4, jn5| |

 

 

When we attempt to failover standby to active, standby need to catchup all 
edits from TxId1 ~ TxId5 from TxId1, and change to active.

But before to failover standby to active, jn4 and jn5 have some delay times, 
caused responseCounts like (0(jn1), 5(jn2), 5(jn3)) when 
_editLogTailer.catchupDuringFailover()._

 

Standby NameNode expect to get all edits from TxId1 ~ TxId5, but only get 
txId1.  TxId2 ~ TxId5 don't applied into fsImage.

And it will caused StandbyNameNode cashed when 
_getFSImage().editLog.openForWrite()._

 

 

I think we should use responseCounts(2) ~ responseCounts(4) to ensure can 
catchup all edits.

But the last edit in responseCounts(2) ~ responseCounts(4) maybe is writing by 
active, maybe not on a quorum of JNs.

It will cause Obsever NameNode or Standby NameNode tail UnQuorum edits.

 

[~xkrogen] On this question, if you have some good ideas, please tell me, 
thanks.

 

 

 

> [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via 
> RPC
> -
>
> Key: HDFS-13609
> URL: https://issues.apache.org/jira/browse/HDFS-13609
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, namenode
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: HDFS-12943, 3.3.0
>
> Attachments: HDFS-13609-HDFS-12943.000.patch, 
> HDFS-13609-HDFS-12943.001.patch, HDFS-13609-HDFS-12943.002.patch, 
> HDFS-13609-HDFS-12943.003.patch, HDFS-13609-HDFS-12943.004.patch
>
>
> See HDFS-13150 for the full design.
> This JIRA is targetted at the NameNode-side changes to enable tailing 
> in-progress edits via the RPC mechanism added in HDFS-13608. Most changes are 
> in the QuorumJournalManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15792) ClasscastException while loading FSImage

2021-02-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277670#comment-17277670
 ] 

Hadoop QA commented on HDFS-15792:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  0m 
59s{color} | {color:red}{color} | {color:red} Docker failed to build 
yetus/hadoop:7257b17793d. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-15792 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13019864/HDFS-15792-branch-2.10.001.patch
 |
| Console output | 
https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/455/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org |


This message was automatically generated.



> ClasscastException while loading FSImage
> 
>
> Key: HDFS-15792
> URL: https://issues.apache.org/jira/browse/HDFS-15792
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nn
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-15792-branch-2.10.001.patch, HDFS-15792.001.patch, 
> HDFS-15792.002.patch, HDFS-15792.003.patch, HDFS-15792.004.patch, 
> HDFS-15792.005.patch, HDFS-15792.addendum.001.patch, 
> image-2021-01-27-12-00-34-846.png
>
>
> FSImage loading has failed with ClasscastException - 
> java.lang.ClassCastException: java.util.HashMap$Node cannot be cast to 
> java.util.HashMap$TreeNode.
> This is the usage issue with Hashmap in concurrent scenarios.
> Same issue has been reported on Java & closed as usage issue.  - 
> https://bugs.openjdk.java.net/browse/JDK-8173671
> 2020-12-28 11:36:26,127 | ERROR | main | An exception occurred when loading 
> INODE from fsiamge. | FSImageFormatProtobuf.java:442
> java.lang.
> : java.util.HashMap$Node cannot be cast to java.util.HashMap$TreeNode
>   at java.util.HashMap$TreeNode.moveRootToFront(HashMap.java:1835)
>   at java.util.HashMap$TreeNode.treeify(HashMap.java:1951)
>   at java.util.HashMap.treeifyBin(HashMap.java:772)
>   at java.util.HashMap.putVal(HashMap.java:644)
>   at java.util.HashMap.put(HashMap.java:612)
>   at 
> org.apache.hadoop.hdfs.util.ReferenceCountMap.put(ReferenceCountMap.java:53)
>   at 
> org.apache.hadoop.hdfs.server.namenode.AclStorage.addAclFeature(AclStorage.java:391)
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeWithAdditionalFields.addAclFeature(INodeWithAdditionalFields.java:349)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectory(FSImageFormatPBINode.java:225)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINode(FSImageFormatPBINode.java:406)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.readPBINodes(FSImageFormatPBINode.java:367)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeSection(FSImageFormatPBINode.java:342)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader$2.call(FSImageFormatProtobuf.java:469)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2020-12-28 11:36:26,130 | ERROR | main | Failed to load image from 
> FSImageFile(file=/srv/BigData/namenode/current/fsimage_00198227480, 
> cpktTxId=00198227480) | FSImage.java:738
> java.io.IOException: java.lang.ClassCastException: java.util.HashMap$Node 
> cannot be cast to java.util.HashMap$TreeNode
>   at 
> org.apache.hadoop.io.MultipleIOException$Builder.add(MultipleIOException.java:68)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.runLoaderTasks(FSImageFormatProtobuf.java:444)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:360)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:263)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:227)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:971)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:955)
>   at 
> 

[jira] [Commented] (HDFS-15792) ClasscastException while loading FSImage

2021-02-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277654#comment-17277654
 ] 

Hadoop QA commented on HDFS-15792:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} docker {color} | {color:red}  9m 
58s{color} | {color:red}{color} | {color:red} Docker failed to build 
yetus/hadoop:7257b17793d. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-15792 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13019864/HDFS-15792-branch-2.10.001.patch
 |
| Console output | 
https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/454/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org |


This message was automatically generated.



> ClasscastException while loading FSImage
> 
>
> Key: HDFS-15792
> URL: https://issues.apache.org/jira/browse/HDFS-15792
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nn
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-15792-branch-2.10.001.patch, HDFS-15792.001.patch, 
> HDFS-15792.002.patch, HDFS-15792.003.patch, HDFS-15792.004.patch, 
> HDFS-15792.005.patch, HDFS-15792.addendum.001.patch, 
> image-2021-01-27-12-00-34-846.png
>
>
> FSImage loading has failed with ClasscastException - 
> java.lang.ClassCastException: java.util.HashMap$Node cannot be cast to 
> java.util.HashMap$TreeNode.
> This is the usage issue with Hashmap in concurrent scenarios.
> Same issue has been reported on Java & closed as usage issue.  - 
> https://bugs.openjdk.java.net/browse/JDK-8173671
> 2020-12-28 11:36:26,127 | ERROR | main | An exception occurred when loading 
> INODE from fsiamge. | FSImageFormatProtobuf.java:442
> java.lang.
> : java.util.HashMap$Node cannot be cast to java.util.HashMap$TreeNode
>   at java.util.HashMap$TreeNode.moveRootToFront(HashMap.java:1835)
>   at java.util.HashMap$TreeNode.treeify(HashMap.java:1951)
>   at java.util.HashMap.treeifyBin(HashMap.java:772)
>   at java.util.HashMap.putVal(HashMap.java:644)
>   at java.util.HashMap.put(HashMap.java:612)
>   at 
> org.apache.hadoop.hdfs.util.ReferenceCountMap.put(ReferenceCountMap.java:53)
>   at 
> org.apache.hadoop.hdfs.server.namenode.AclStorage.addAclFeature(AclStorage.java:391)
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeWithAdditionalFields.addAclFeature(INodeWithAdditionalFields.java:349)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectory(FSImageFormatPBINode.java:225)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINode(FSImageFormatPBINode.java:406)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.readPBINodes(FSImageFormatPBINode.java:367)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeSection(FSImageFormatPBINode.java:342)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader$2.call(FSImageFormatProtobuf.java:469)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2020-12-28 11:36:26,130 | ERROR | main | Failed to load image from 
> FSImageFile(file=/srv/BigData/namenode/current/fsimage_00198227480, 
> cpktTxId=00198227480) | FSImage.java:738
> java.io.IOException: java.lang.ClassCastException: java.util.HashMap$Node 
> cannot be cast to java.util.HashMap$TreeNode
>   at 
> org.apache.hadoop.io.MultipleIOException$Builder.add(MultipleIOException.java:68)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.runLoaderTasks(FSImageFormatProtobuf.java:444)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:360)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:263)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:227)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:971)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:955)
>   at 
> 

[jira] [Updated] (HDFS-15792) ClasscastException while loading FSImage

2021-02-02 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He updated HDFS-15792:
---
Status: Patch Available  (was: Reopened)

> ClasscastException while loading FSImage
> 
>
> Key: HDFS-15792
> URL: https://issues.apache.org/jira/browse/HDFS-15792
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nn
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-15792-branch-2.10.001.patch, HDFS-15792.001.patch, 
> HDFS-15792.002.patch, HDFS-15792.003.patch, HDFS-15792.004.patch, 
> HDFS-15792.005.patch, HDFS-15792.addendum.001.patch, 
> image-2021-01-27-12-00-34-846.png
>
>
> FSImage loading has failed with ClasscastException - 
> java.lang.ClassCastException: java.util.HashMap$Node cannot be cast to 
> java.util.HashMap$TreeNode.
> This is the usage issue with Hashmap in concurrent scenarios.
> Same issue has been reported on Java & closed as usage issue.  - 
> https://bugs.openjdk.java.net/browse/JDK-8173671
> 2020-12-28 11:36:26,127 | ERROR | main | An exception occurred when loading 
> INODE from fsiamge. | FSImageFormatProtobuf.java:442
> java.lang.
> : java.util.HashMap$Node cannot be cast to java.util.HashMap$TreeNode
>   at java.util.HashMap$TreeNode.moveRootToFront(HashMap.java:1835)
>   at java.util.HashMap$TreeNode.treeify(HashMap.java:1951)
>   at java.util.HashMap.treeifyBin(HashMap.java:772)
>   at java.util.HashMap.putVal(HashMap.java:644)
>   at java.util.HashMap.put(HashMap.java:612)
>   at 
> org.apache.hadoop.hdfs.util.ReferenceCountMap.put(ReferenceCountMap.java:53)
>   at 
> org.apache.hadoop.hdfs.server.namenode.AclStorage.addAclFeature(AclStorage.java:391)
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeWithAdditionalFields.addAclFeature(INodeWithAdditionalFields.java:349)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectory(FSImageFormatPBINode.java:225)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINode(FSImageFormatPBINode.java:406)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.readPBINodes(FSImageFormatPBINode.java:367)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeSection(FSImageFormatPBINode.java:342)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader$2.call(FSImageFormatProtobuf.java:469)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2020-12-28 11:36:26,130 | ERROR | main | Failed to load image from 
> FSImageFile(file=/srv/BigData/namenode/current/fsimage_00198227480, 
> cpktTxId=00198227480) | FSImage.java:738
> java.io.IOException: java.lang.ClassCastException: java.util.HashMap$Node 
> cannot be cast to java.util.HashMap$TreeNode
>   at 
> org.apache.hadoop.io.MultipleIOException$Builder.add(MultipleIOException.java:68)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.runLoaderTasks(FSImageFormatProtobuf.java:444)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:360)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:263)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:227)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:971)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:955)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:820)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:733)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1113)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:730)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:648)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:710)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:953)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:926)
>   at 
> 

[jira] [Reopened] (HDFS-15792) ClasscastException while loading FSImage

2021-02-02 Thread Xiaoqiao He (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaoqiao He reopened HDFS-15792:


> ClasscastException while loading FSImage
> 
>
> Key: HDFS-15792
> URL: https://issues.apache.org/jira/browse/HDFS-15792
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nn
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-15792-branch-2.10.001.patch, HDFS-15792.001.patch, 
> HDFS-15792.002.patch, HDFS-15792.003.patch, HDFS-15792.004.patch, 
> HDFS-15792.005.patch, HDFS-15792.addendum.001.patch, 
> image-2021-01-27-12-00-34-846.png
>
>
> FSImage loading has failed with ClasscastException - 
> java.lang.ClassCastException: java.util.HashMap$Node cannot be cast to 
> java.util.HashMap$TreeNode.
> This is the usage issue with Hashmap in concurrent scenarios.
> Same issue has been reported on Java & closed as usage issue.  - 
> https://bugs.openjdk.java.net/browse/JDK-8173671
> 2020-12-28 11:36:26,127 | ERROR | main | An exception occurred when loading 
> INODE from fsiamge. | FSImageFormatProtobuf.java:442
> java.lang.
> : java.util.HashMap$Node cannot be cast to java.util.HashMap$TreeNode
>   at java.util.HashMap$TreeNode.moveRootToFront(HashMap.java:1835)
>   at java.util.HashMap$TreeNode.treeify(HashMap.java:1951)
>   at java.util.HashMap.treeifyBin(HashMap.java:772)
>   at java.util.HashMap.putVal(HashMap.java:644)
>   at java.util.HashMap.put(HashMap.java:612)
>   at 
> org.apache.hadoop.hdfs.util.ReferenceCountMap.put(ReferenceCountMap.java:53)
>   at 
> org.apache.hadoop.hdfs.server.namenode.AclStorage.addAclFeature(AclStorage.java:391)
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeWithAdditionalFields.addAclFeature(INodeWithAdditionalFields.java:349)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectory(FSImageFormatPBINode.java:225)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINode(FSImageFormatPBINode.java:406)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.readPBINodes(FSImageFormatPBINode.java:367)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeSection(FSImageFormatPBINode.java:342)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader$2.call(FSImageFormatProtobuf.java:469)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2020-12-28 11:36:26,130 | ERROR | main | Failed to load image from 
> FSImageFile(file=/srv/BigData/namenode/current/fsimage_00198227480, 
> cpktTxId=00198227480) | FSImage.java:738
> java.io.IOException: java.lang.ClassCastException: java.util.HashMap$Node 
> cannot be cast to java.util.HashMap$TreeNode
>   at 
> org.apache.hadoop.io.MultipleIOException$Builder.add(MultipleIOException.java:68)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.runLoaderTasks(FSImageFormatProtobuf.java:444)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:360)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:263)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:227)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:971)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:955)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:820)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:733)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1113)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:730)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:648)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:710)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:953)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:926)
>   at 
> 

[jira] [Updated] (HDFS-15815) if required storageType are unavailable log the failed reason during choosing Datanode

2021-02-02 Thread Yang Yun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Yun updated HDFS-15815:

Attachment: HDFS-15815.001.patch
Status: Patch Available  (was: Open)

>  if required storageType are unavailable log the failed reason during 
> choosing Datanode
> ---
>
> Key: HDFS-15815
> URL: https://issues.apache.org/jira/browse/HDFS-15815
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: block placement
>Reporter: Yang Yun
>Assignee: Yang Yun
>Priority: Minor
> Attachments: HDFS-15815.001.patch
>
>
> For better debug,  if required storageType are unavailable, log the failed 
> reason "NO_REQUIRED_STORAGE_TYPE" when choosing Datanode.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15815) if required storageType are unavailable log the failed reason during choosing Datanode

2021-02-02 Thread Yang Yun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Yun updated HDFS-15815:

Summary:  if required storageType are unavailable log the failed reason 
during choosing Datanode  (was:  if required storageType are unavailable Log 
the failed reason when choosing Datanode)

>  if required storageType are unavailable log the failed reason during 
> choosing Datanode
> ---
>
> Key: HDFS-15815
> URL: https://issues.apache.org/jira/browse/HDFS-15815
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: block placement
>Reporter: Yang Yun
>Assignee: Yang Yun
>Priority: Minor
>
> For better debug,  if required storageType are unavailable, log the failed 
> reason "NO_REQUIRED_STORAGE_TYPE" when choosing Datanode.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15815) if required storageType are unavailable Log the failed reason when choosing Datanode

2021-02-02 Thread Yang Yun (Jira)
Yang Yun created HDFS-15815:
---

 Summary:  if required storageType are unavailable Log the failed 
reason when choosing Datanode
 Key: HDFS-15815
 URL: https://issues.apache.org/jira/browse/HDFS-15815
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: block placement
Reporter: Yang Yun
Assignee: Yang Yun


For better debug,  if required storageType are unavailable, log the failed 
reason "NO_REQUIRED_STORAGE_TYPE" when choosing Datanode.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-13609) [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

2021-02-02 Thread xuzq (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277640#comment-17277640
 ] 

xuzq edited comment on HDFS-13609 at 2/3/21, 2:36 AM:
--

Thanks [~xkrogen], It is when onlyDurableTxns is true that we get 
responseCounts.get(0).

In our production environment, one nameNode is crashed when we failover it to 
active, and cached one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1658)
at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslator
PB.java:111)
at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:54
09)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:620)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1125)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3246)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3242)
at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:689)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3240)
{code}
 

After looking at code, I think is _editLogTailer.catchupDuringFailover()_ can't 
catchup all edits, caused check failed when 
_getFSImage().editLog.openForWrite()_.

And one journal is wrong when write edit into disk after write it into cache 
successfully.

And as _onlyDurableTxns_ is true, then we get _responseCounts.get(0),_ and the 
wrong journal's response is _responseCounts.get(0),_ so caused 
_editLogTailer.catchupDuringFailover()_ can't catchup all edits.

And the response like  {{(0, 1000, 1000).}}

 
{quote}Thus since we only got 3 responses, we have to take the lowest txn that 
any of those responses are aware of.
{quote}
It maybe causes we can't tail any edits when the first response Journal is 
wrong.
 * It maybe caused _editLogTailer.catchupDuringFailover()_ can't catchup all 
edits, and cause NN crash when failover it to active.
 * It maybe caused Observer NameNode can't supported read rpc.


was (Author: xuzq_zander):
Thanks [~xkrogen], It is when onlyDurableTxns is true that we get 
responseCounts.get(0).

In our production environment, one nameNode is crashed when we failover it to 
active, and cached one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
at 

[jira] [Comment Edited] (HDFS-13609) [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

2021-02-02 Thread xuzq (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277640#comment-17277640
 ] 

xuzq edited comment on HDFS-13609 at 2/3/21, 2:35 AM:
--

Thanks [~xkrogen], It is when onlyDurableTxns is true that we get 
responseCounts.get(0).

In our production environment, one nameNode is crashed when we failover it to 
active, and cached one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1658)
at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslator
PB.java:111)
at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:54
09)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:620)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1125)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3246)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3242)
at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:689)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3240)
{code}
 

After looking at code, I think is _editLogTailer.catchupDuringFailover()_ can't 
catchup all edits, caused check failed when 
_getFSImage().editLog.openForWrite()_.

And one journal is wrong when write edit into disk after write it into cache 
successfully.

And as _onlyDurableTxns_ is true, then we get _responseCounts.get(0),_ and the 
wrong journal's response is _responseCounts.get(0),_ so caused 
_editLogTailer.catchupDuringFailover()_ can't catchup all edits.

And the response like  {{(0, 1000, 1000).}}

 
{quote}Thus since we only got 3 responses, we have to take the lowest txn that 
any of those responses are aware of.
{quote}
It maybe causes we can't tail any edits when the first response Journal is 
wrong.
 * It maybe caused _editLogTailer.catchupDuringFailover()_ can't catchup all 
edits, and NN crashed when change to active.
 * It maybe caused Observer NameNode can't supported read rpc.


was (Author: xuzq_zander):
Thanks [~xkrogen], It is when onlyDurableTxns is true that we get 
responseCounts.get(0).

In our production environment, one nameNode is down when we failover it to 
active, and cached one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
at 

[jira] [Work logged] (HDFS-15683) Allow configuring DISK/ARCHIVE capacity for individual volumes

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15683?focusedWorklogId=546818=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546818
 ]

ASF GitHub Bot logged work on HDFS-15683:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 02:32
Start Date: 03/Feb/21 02:32
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2625:
URL: https://github.com/apache/hadoop/pull/2625#issuecomment-772171696


   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  14m 22s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  1s |  |  No case conflicting files 
found.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |   |   0m  0s | [test4tests](test4tests) |  The patch 
appears to include 2 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  32m 22s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 21s |  |  trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   1m 18s |  |  trunk passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01  |
   | +1 :green_heart: |  checkstyle  |   1m 20s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 26s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  16m  7s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 54s |  |  trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 29s |  |  trunk passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01  |
   | +0 :ok: |  spotbugs  |   3m  6s |  |  Used deprecated FindBugs config; 
considering switching to SpotBugs.  |
   | +1 :green_heart: |  findbugs  |   3m  2s |  |  trunk passed  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 14s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 13s |  |  the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |   1m 13s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  3s |  |  the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01  |
   | +1 :green_heart: |  javac  |   1m  3s |  |  the patch passed  |
   | +1 :green_heart: |  checkstyle  |   1m 12s |  |  
hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 741 unchanged - 1 
fixed = 741 total (was 742)  |
   | +1 :green_heart: |  mvnsite  |   1m 13s |  |  the patch passed  |
   | +1 :green_heart: |  whitespace  |   0m  0s |  |  The patch has no 
whitespace issues.  |
   | +1 :green_heart: |  xml  |   0m  1s |  |  The patch has no ill-formed XML 
file.  |
   | +1 :green_heart: |  shadedclient  |  12m 54s |  |  patch has no errors 
when building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 50s |  |  the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   1m 25s |  |  the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01  |
   | +1 :green_heart: |  findbugs  |   3m  7s |  |  the patch passed  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 197m 57s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2625/7/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 42s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 297m 55s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.namenode.TestNamenodeRetryCache |
   |   | hadoop.hdfs.server.datanode.fsdataset.impl.TestFsDatasetImpl |
   |   | hadoop.hdfs.server.namenode.TestFSEditLogLoader |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2625/7/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2625 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient findbugs checkstyle xml |
   | uname | Linux b84ecd5c541c 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / f37bf651993 |
   | Default Java | Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 |
   | Multi-JDK versions | 

[jira] [Comment Edited] (HDFS-13609) [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

2021-02-02 Thread xuzq (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277640#comment-17277640
 ] 

xuzq edited comment on HDFS-13609 at 2/3/21, 2:32 AM:
--

Thanks [~xkrogen], It is when onlyDurableTxns is true that we get 
responseCounts.get(0).

In our production environment, one nameNode is down when we failover it to 
active, and cached one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1658)
at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslator
PB.java:111)
at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:54
09)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:620)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1125)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3246)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3242)
at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:689)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3240)
{code}
 

After looking at code, I think is _editLogTailer.catchupDuringFailover()_ can't 
catchup all edits, caused check failed when 
_getFSImage().editLog.openForWrite()_.

And one journal is wrong when write edit into disk after write it into cache 
successfully.

And as _onlyDurableTxns_ is true, then we get _responseCounts.get(0),_ and the 
wrong journal's response is _responseCounts.get(0),_ so caused 
_editLogTailer.catchupDuringFailover()_ can't catchup all edits.

And the response like  {{(0, 1000, 1000).}}

 
{quote}Thus since we only got 3 responses, we have to take the lowest txn that 
any of those responses are aware of.
{quote}
It maybe causes we can't tail any edits when the first response Journal is 
wrong.
 * It maybe caused _editLogTailer.catchupDuringFailover()_ can't catchup all 
edits, and NN crashed when change to active.
 * It maybe caused Observer NameNode can't supported read rpc.


was (Author: xuzq_zander):
Thanks [~xkrogen], It is when onlyDurableTxns is true that we get 
responseCounts.get(0).

In our production environment, one nameNode is down when we failover it to 
active, and cached one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
at 

[jira] [Commented] (HDFS-13609) [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

2021-02-02 Thread xuzq (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277637#comment-17277637
 ] 

xuzq commented on HDFS-13609:
-

Thanks [~xkrogen] for the comment. It is when {{onlyDurableTxns}} is true that 
we get {{responseCounts.get(0)}}

In our production  environment,  one nameNode is down when we failover it to 
active, and cache one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1658)
at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslator
PB.java:111)
at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:54
09)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:620)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1125)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3246)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3242)
at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:689)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3240)
{code}
 

After looking at the code, i think _editLogTailer.catchupDuringFailover()_ 
can't catchup all edits, cause check failed when 
_getFSImage().editLog.openForWrite()_.

As when _{{onlyDurableTxns}}_ is true that we get {{_responseCounts.get(0)_, 
}}cause _editLogTailer.catchupDuringFailover()_ can't catchup all edits, 
because one journal is wrong when write journal on disk after write into cache, 
and this journal response is {{_responseCounts.get(0)_.}}

 
{quote}Thus since we only got 3 responses, we have to take the lowest txn that 
any of those responses are aware of.
{quote} * It maybe cause *_editLogTailer.catchupDuringFailover()_ can't catchup 
all edits* when _maxAllowedTxns={{responseCounts.get(0)=0.}}_
 * __And It maybe cause doTailEdits can't tail any edits too.

 

 

> [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via 
> RPC
> -
>
> Key: HDFS-13609
> URL: https://issues.apache.org/jira/browse/HDFS-13609
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, namenode
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: HDFS-12943, 3.3.0
>
> Attachments: HDFS-13609-HDFS-12943.000.patch, 
> HDFS-13609-HDFS-12943.001.patch, HDFS-13609-HDFS-12943.002.patch, 
> HDFS-13609-HDFS-12943.003.patch, HDFS-13609-HDFS-12943.004.patch
>
>
> See HDFS-13150 for the full design.
> This JIRA is targetted at the NameNode-side changes to enable tailing 
> in-progress edits via the RPC mechanism added in HDFS-13608. Most changes are 
> in the QuorumJournalManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-13609) [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

2021-02-02 Thread xuzq (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277640#comment-17277640
 ] 

xuzq edited comment on HDFS-13609 at 2/3/21, 2:29 AM:
--

Thanks [~xkrogen], It is when onlyDurableTxns is true that we get 
responseCounts.get(0).

In our production environment, one nameNode is down when we failover it to 
active, and cached one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1658)
at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslator
PB.java:111)
at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:54
09)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:620)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1125)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3246)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3242)
at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:689)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3240)
{code}
 

After looking at code, I think is _editLogTailer.catchupDuringFailover()_ can't 
catchup all edits, caused check failed when 
_getFSImage().editLog.openForWrite()_.

And one journal is wrong when write edit into disk after write is into cache 
successfully.

And as _onlyDurableTxns_ is true, then we get _responseCounts.get(0),_ and the 
wrong journal's response is _responseCounts.get(0),_ so caused 
_editLogTailer.catchupDuringFailover()_ can't catchup all edits.

And the response like  {{(0, 1000, 1000).}}

 
{quote}Thus since we only got 3 responses, we have to take the lowest txn that 
any of those responses are aware of.
{quote}
It maybe causes we can't tail any edits when the first response Journal is 
wrong.
 * It maybe caused _editLogTailer.catchupDuringFailover()_ can't catchup all 
edits, and NN crashed when change to active.
 * It maybe caused Observer NameNode can't supported read rpc.


was (Author: xuzq_zander):
Thanks [~xkrogen], It is when onlyDurableTxns is true that we get 
responseCounts.get(0).

In our production environment, one nameNode is down when we failover it to 
active, and cached one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
at 

[jira] [Comment Edited] (HDFS-13609) [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

2021-02-02 Thread xuzq (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277637#comment-17277637
 ] 

xuzq edited comment on HDFS-13609 at 2/3/21, 2:18 AM:
--

Thanks [~xkrogen] for the comment. It is when {{onlyDurableTxns}} is true that 
we get {{responseCounts.get(0)}}

In our production  environment,  one nameNode is down when we failover it to 
active, and cache one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1658)
at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslator
PB.java:111)
at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:54
09)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:620)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1125)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3246)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3242)
at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:689)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3240)
{code}
 After looking at the code, i think _editLogTailer.catchupDuringFailover()_ 
can't catchup all edits, cause check failed when 
_getFSImage().editLog.openForWrite()_.

As when _{{onlyDurableTxns}}_ is true that we get {{_responseCounts.get(0)_, 
caused }}{{_editLogTailer.catchupDuringFailover()_ can't catchup all edits, 
because one journal is wrong when write edit on disk after write it into cache 
successfully, and this wrong journal's response is 
}}{{_responseCounts.get(0)._}}{{}}{{}}

 
{quote}Thus since we only got 3 responses, we have to take the lowest txn that 
any of those responses are aware of.
{quote} * It maybe cause *_editLogTailer.catchupDuringFailover()_ can't catchup 
all edits* when _maxAllowedTxns={{responseCounts.get(0)=0.}}_
 * And It maybe cause doTailEdits can't tail any edits too.

 

 


was (Author: xuzq_zander):
Thanks [~xkrogen] for the comment. It is when {{onlyDurableTxns}} is true that 
we get {{responseCounts.get(0)}}

In our production  environment,  one nameNode is down when we failover it to 
active, and cache one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1658)
at 

[jira] [Comment Edited] (HDFS-13609) [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

2021-02-02 Thread xuzq (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277637#comment-17277637
 ] 

xuzq edited comment on HDFS-13609 at 2/3/21, 2:18 AM:
--

Thanks [~xkrogen] for the comment. It is when {{onlyDurableTxns}} is true that 
we get {{responseCounts.get(0)}}

In our production  environment,  one nameNode is down when we failover it to 
active, and cache one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1658)
at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslator
PB.java:111)
at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:54
09)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:620)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1125)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3246)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3242)
at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:689)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3240)
{code}
 After looking at the code, i think _editLogTailer.catchupDuringFailover()_ 
can't catchup all edits, cause check failed when 
_getFSImage().editLog.openForWrite()_.

As when _{{onlyDurableTxns}}_ is true that we get {{_responseCounts.get(0)_, 
caused }}{{_editLogTailer.catchupDuringFailover()_ can't catchup all edits, 
because one journal is wrong when write edit on disk after write it into cache 
successfully, and this wrong journal's response is 
}}{{_responseCounts.get(0)._}}{{}}{{}}

 
{quote}Thus since we only got 3 responses, we have to take the lowest txn that 
any of those responses are aware of.
{quote} * It maybe cause *_editLogTailer.catchupDuringFailover()_ can't catchup 
all edits* when _maxAllowedTxns={{responseCounts.get(0)=0.}}_
 * And It maybe cause doTailEdits can't tail any edits too.

 

 


was (Author: xuzq_zander):
Thanks [~xkrogen] for the comment. It is when {{onlyDurableTxns}} is true that 
we get {{responseCounts.get(0)}}

In our production  environment,  one nameNode is down when we failover it to 
active, and cache one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1658)
at 

[jira] [Commented] (HDFS-13609) [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

2021-02-02 Thread xuzq (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277640#comment-17277640
 ] 

xuzq commented on HDFS-13609:
-

Thanks [~xkrogen], It is when onlyDurableTxns is true that we get 
responseCounts.get(0).

In our production environment, one nameNode is down when we failover it to 
active, and cached one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1658)
at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslator
PB.java:111)
at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:54
09)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:620)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1125)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3246)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3242)
at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:689)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3240)
{code}
 

After looking at code, I think is _editLogTailer.catchupDuringFailover()_ can't 
catchup all edits, caused check failed when 
_getFSImage().editLog.openForWrite()_.

And one journal is wrong when write edit into disk after write is into cache 
successfully.

And as _onlyDurableTxns_ is true, then we get _responseCounts.get(0),_ and the 
wrong journal's response is _responseCounts.get(0),_ so caused 
_editLogTailer.catchupDuringFailover()_ can't catchup all edits.

 
{quote}Thus since we only got 3 responses, we have to take the lowest txn that 
any of those responses are aware of.
{quote}
It maybe causes we can't tail any edits when the first response Journal is 
wrong.
 * It maybe caused _editLogTailer.catchupDuringFailover()_ can't catchup all 
edits, and NN crashed when change to active.
 * It maybe caused Observer NameNode can't supported read rpc.

> [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via 
> RPC
> -
>
> Key: HDFS-13609
> URL: https://issues.apache.org/jira/browse/HDFS-13609
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, namenode
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: HDFS-12943, 3.3.0
>
> Attachments: HDFS-13609-HDFS-12943.000.patch, 
> HDFS-13609-HDFS-12943.001.patch, HDFS-13609-HDFS-12943.002.patch, 
> HDFS-13609-HDFS-12943.003.patch, HDFS-13609-HDFS-12943.004.patch
>
>
> See HDFS-13150 for the full design.
> This JIRA is targetted at the NameNode-side changes to enable tailing 
> in-progress edits via the RPC mechanism added in HDFS-13608. Most changes are 
> in the QuorumJournalManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (HDFS-13609) [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

2021-02-02 Thread xuzq (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuzq updated HDFS-13609:

Comment: was deleted

(was: Thanks [~xkrogen] for the comment. It is when {{onlyDurableTxns}} is true 
that we get {{responseCounts.get(0)}}

In our production  environment,  one nameNode is down when we failover it to 
active, and cache one exception like:

 
{code:java}
2021-02-01 20:38:23,402 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode 
IPC Server handler 227 on 8022: Error encountered requiring N
N shutdown. Shutting down immediately.
java.lang.IllegalStateException: Cannot start writing at txid 58504771317 when 
there is a stream available for read: org.apache.hadoop.hdfs
.server.namenode.RedundantEditLogInputStream@57d3ac44
at 
org.apache.hadoop.hdfs.server.namenode.FSEditLog.openForWrite(FSEditLog.java:324)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:1417)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startActiveServices(NameNode.java:1969)
at 
org.apache.hadoop.hdfs.server.namenode.ha.ActiveState.enterState(ActiveState.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAState.setStateInternal(HAState.java:64)
at 
org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.setState(StandbyState.java:58)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1826)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1658)
at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslator
PB.java:111)
at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:54
09)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:620)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1125)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3246)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:3242)
at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:689)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1796)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3240)
{code}
 After looking at the code, i think _editLogTailer.catchupDuringFailover()_ 
can't catchup all edits, cause check failed when 
_getFSImage().editLog.openForWrite()_.

As when _{{onlyDurableTxns}}_ is true that we get {{_responseCounts.get(0)_, 
caused }}{{_editLogTailer.catchupDuringFailover()_ can't catchup all edits, 
because one journal is wrong when write edit on disk after write it into cache 
successfully, and this wrong journal's response is 
}}{{_responseCounts.get(0)._}}{{}}{{}}

 
{quote}Thus since we only got 3 responses, we have to take the lowest txn that 
any of those responses are aware of.
{quote} * It maybe cause *_editLogTailer.catchupDuringFailover()_ can't catchup 
all edits* when _maxAllowedTxns={{responseCounts.get(0)=0.}}_
 * And It maybe cause doTailEdits can't tail any edits too.

 

 )

> [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via 
> RPC
> -
>
> Key: HDFS-13609
> URL: https://issues.apache.org/jira/browse/HDFS-13609
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, namenode
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: HDFS-12943, 3.3.0
>
> Attachments: HDFS-13609-HDFS-12943.000.patch, 
> HDFS-13609-HDFS-12943.001.patch, HDFS-13609-HDFS-12943.002.patch, 
> HDFS-13609-HDFS-12943.003.patch, HDFS-13609-HDFS-12943.004.patch
>
>
> See HDFS-13150 for the full design.
> This JIRA is targetted at the NameNode-side changes to enable tailing 
> in-progress edits via the RPC mechanism added in HDFS-13608. Most changes are 
> in the QuorumJournalManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-02-02 Thread Hui Fei (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277629#comment-17277629
 ] 

Hui Fei commented on HDFS-15798:


[~sodonnell] Thanks for comments, [~haiyang Hu] thanks for update.

+1 on [^HDFS-15798.003.patch]

 

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch, 
> HDFS-15798.003.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
> increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  task failed 
> decrement,  XmitsInProgress is decremented by the previous value
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete 
> decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15799) Make DisallowedDatanodeException terse

2021-02-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277625#comment-17277625
 ] 

Hadoop QA commented on HDFS-15799:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  6m 
31s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to 
include any new or modified tests. Please justify why no new tests are needed 
for this patch. Also please list what manual steps were performed to verify 
this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 23m 
 4s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
22s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
13s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
59s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
24s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 41s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
53s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
21s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  3m 
17s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs 
config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
14s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
18s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
16s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
16s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
7s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
7s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
53s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
13s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m  0s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
47s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| 

[jira] [Work logged] (HDFS-15795) EC: Wrong checksum when reconstruction was failed by exception

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15795?focusedWorklogId=546756=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546756
 ]

ASF GitHub Bot logged work on HDFS-15795:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 01:35
Start Date: 03/Feb/21 01:35
Worklog Time Spent: 10m 
  Work Description: sodonnel merged pull request #2657:
URL: https://github.com/apache/hadoop/pull/2657


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546756)
Time Spent: 1h 50m  (was: 1h 40m)

> EC: Wrong checksum when reconstruction was failed by exception
> --
>
> Key: HDFS-15795
> URL: https://issues.apache.org/jira/browse/HDFS-15795
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, ec, erasure-coding
>Reporter: Yushi Hayasaka
>Assignee: Yushi Hayasaka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> If the reconstruction task is failed on StripedBlockChecksumReconstructor by 
> exception, the checksum becomes wrong one because it is calculated with 
> blocks except a failure one.
> It is caused by catching exception with not appropriate way. As a result, the 
> failed block is not fetched again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15757) RBF: Improving Router Connection Management

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15757?focusedWorklogId=546625=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546625
 ]

ASF GitHub Bot logged work on HDFS-15757:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 01:23
Start Date: 03/Feb/21 01:23
Worklog Time Spent: 10m 
  Work Description: fengnanli commented on a change in pull request #2651:
URL: https://github.com/apache/hadoop/pull/2651#discussion_r568866912



##
File path: 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/ConnectionPool.java
##
@@ -252,19 +252,23 @@ public synchronized void addConnection(ConnectionContext 
conn) {
*/
   public synchronized List removeConnections(int num) {
 List removed = new LinkedList<>();
-
-// Remove and close the last connection
-List tmpConnections = new ArrayList<>();
-for (int i=0; i this.minSize) {
+  int targetCount = Math.min(num, this.connections.size() - this.minSize);

Review comment:
   I don't think it can negative here since the only place connections 
become less is in this function at the swap part with the tmpConnections. The 
other place where this var gets assigned is in the creation part and it can 
only increase the value.

##
File path: 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/ConnectionContext.java
##
@@ -57,6 +62,17 @@ public synchronized boolean isActive() {
 return this.numThreads > 0;
   }
 
+  /**
+   * Check if the connection is/was active recently.
+   *
+   * @return True if the connection is active or
+   * was active in the past period of time.
+   */
+  public synchronized boolean isActiveRecently() {
+return isActive() ||

Review comment:
   That can removed since the timewindow calculation covers the active 
case. Updated.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546625)
Time Spent: 3h 10m  (was: 3h)

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ 
> Improving Router Connection Management_v3.pdf, RBF_ Router Connection 
> Management.pdf
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15757) RBF: Improving Router Connection Management

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15757?focusedWorklogId=546612=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546612
 ]

ASF GitHub Bot logged work on HDFS-15757:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 01:21
Start Date: 03/Feb/21 01:21
Worklog Time Spent: 10m 
  Work Description: fengnanli commented on pull request #2651:
URL: https://github.com/apache/hadoop/pull/2651#issuecomment-771903717


   > Thanks @fengnanli for you works here. Leave some nit comment inline.
   > Sorry I do not get why the change can reduce connections here after review 
the changes, is it related "Be greedy here to close as many connections as 
possible in one shot"? It will be helpful if we add some javadocs explicitly. 
Thanks.
   
   Thanks for the review @Hexiaoqiao  I put the reason behind this change in 
the design doc in the original JIRA ticket. In short, I did synchronous 
connection closing + better picking connections + greedy closing connections. I 
have seen 50% reduce in number of connections and better ProxyTime. It will be 
great if you can try in your setup as well.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546612)
Time Spent: 3h  (was: 2h 50m)

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ 
> Improving Router Connection Management_v3.pdf, RBF_ Router Connection 
> Management.pdf
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-02-02 Thread Hui Fei (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277592#comment-17277592
 ] 

Hui Fei commented on HDFS-15779:


[~wanghongbing] Thanks for update, will commit later.

> EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
> -
>
> Key: HDFS-15779
> URL: https://issues.apache.org/jira/browse/HDFS-15779
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Attachments: HDFS-15779.001.patch, HDFS-15779.002.patch
>
>
> The NullPointerException in DN log as follows: 
> {code:java}
> 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
> //...
> 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Connection timed out
> 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Failed to reconstruct striped block: 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
> src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
> 010
> {code}
> NPE occurs at `writer.getTargetBuffer()` in codes:
> {code:java}
> // StripedWriter#clearBuffers
> void clearBuffers() {
>   for (StripedBlockWriter writer : writers) {
> ByteBuffer targetBuffer = writer.getTargetBuffer();
> if (targetBuffer != null) {
>   targetBuffer.clear();
> }
>   }
> }
> {code}
> So, why is the writer null? Let's track when the writer is initialized and 
> when reconstruct() is called,  as follows:
> {code:java}
> // StripedBlockReconstructor#run
> public void run() {
>   try {
> initDecoderIfNecessary();
> getStripedReader().init();
> stripedWriter.init();  //①
> reconstruct();  //②
> stripedWriter.endTargetBlocks();
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> // ...{code}
> They are called at ① and ② above respectively. `stripedWriter.init()` -> 
> `initTargetStreams()`, as follows:
> {code:java}
> // StripedWriter#initTargetStreams
> int initTargetStreams() {
>   int nSuccess = 0;
>   for (short i = 0; i < targets.length; i++) {
> try {
>   writers[i] = createWriter(i);
>   nSuccess++;
>   targetsStatus[i] = true;
> } catch (Throwable e) {
>   LOG.warn(e.getMessage());
> }
>   }
>   return nSuccess;
> }
> {code}
> NPE occurs when createWriter() gets an exception and  0 < nSuccess < 
> targets.length. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15814) Make some parameters configurable for DataNodeDiskMetrics

2021-02-02 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-15814:
---
External issue URL: https://github.com/apache/hadoop/pull/2676

> Make some parameters configurable for DataNodeDiskMetrics
> -
>
> Key: HDFS-15814
> URL: https://issues.apache.org/jira/browse/HDFS-15814
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: hdfs
>Reporter: tomscut
>Priority: Major
>
> For ease of use, especially for small clusters, we can change some 
> parameters(MIN_OUTLIER_DETECTION_DISKS, SLOW_DISK_LOW_THRESHOLD_MS) 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15814) Make some parameters configurable for DataNodeDiskMetrics

2021-02-02 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-15814:
---
External issue URL:   (was: https://github.com/apache/hadoop/pull/2676)

> Make some parameters configurable for DataNodeDiskMetrics
> -
>
> Key: HDFS-15814
> URL: https://issues.apache.org/jira/browse/HDFS-15814
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: hdfs
>Reporter: tomscut
>Priority: Major
>
> For ease of use, especially for small clusters, we can change some 
> parameters(MIN_OUTLIER_DETECTION_DISKS, SLOW_DISK_LOW_THRESHOLD_MS) 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15814) Make some parameters configurable for DataNodeDiskMetrics

2021-02-02 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-15814:
---
External issue URL: https://github.com/apache/hadoop/pull/2676

> Make some parameters configurable for DataNodeDiskMetrics
> -
>
> Key: HDFS-15814
> URL: https://issues.apache.org/jira/browse/HDFS-15814
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: hdfs
>Reporter: tomscut
>Priority: Major
>
> For ease of use, especially for small clusters, we can change some 
> parameters(MIN_OUTLIER_DETECTION_DISKS, SLOW_DISK_LOW_THRESHOLD_MS) 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15814) Make some parameters configurable for DataNodeDiskMetrics

2021-02-02 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-15814:
---
External issue URL: https://github.com/apache/hadoop/pull/2676

> Make some parameters configurable for DataNodeDiskMetrics
> -
>
> Key: HDFS-15814
> URL: https://issues.apache.org/jira/browse/HDFS-15814
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: hdfs
>Reporter: tomscut
>Priority: Major
>
> For ease of use, especially for small clusters, we can change some 
> parameters(MIN_OUTLIER_DETECTION_DISKS, SLOW_DISK_LOW_THRESHOLD_MS) 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15814) Make some parameters configurable for DataNodeDiskMetrics

2021-02-02 Thread tomscut (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tomscut updated HDFS-15814:
---
External issue URL:   (was: https://github.com/apache/hadoop/pull/2676)

> Make some parameters configurable for DataNodeDiskMetrics
> -
>
> Key: HDFS-15814
> URL: https://issues.apache.org/jira/browse/HDFS-15814
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: hdfs
>Reporter: tomscut
>Priority: Major
>
> For ease of use, especially for small clusters, we can change some 
> parameters(MIN_OUTLIER_DETECTION_DISKS, SLOW_DISK_LOW_THRESHOLD_MS) 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15801) Backport HDFS-14582 to branch-2.10 (Failed to start DN with ArithmeticException when NULL checksum used)

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15801?focusedWorklogId=546491=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546491
 ]

ASF GitHub Bot logged work on HDFS-15801:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 01:10
Start Date: 03/Feb/21 01:10
Worklog Time Spent: 10m 
  Work Description: jojochuang merged pull request #2659:
URL: https://github.com/apache/hadoop/pull/2659


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546491)
Time Spent: 1h  (was: 50m)

> Backport HDFS-14582 to branch-2.10 (Failed to start DN with 
> ArithmeticException when NULL checksum used)
> 
>
> Key: HDFS-15801
> URL: https://issues.apache.org/jira/browse/HDFS-15801
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Janus Chow
>Assignee: Janus Chow
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.10.2
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In HDFS-14582, the error message is more clear as follows:
> {code:java}
> Caused by: java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.validateIntegrityAndSetLength(BlockPoolSlice.java:823)
> at 
> {code}
> But in branch-2.10.1, the exception message is omitted as follows:
> {code:java}
> 2021-01-29 14:20:30,694 INFO  impl.FsDatasetImpl (FsVolumeList.java:run(204)) 
> - Caught exception while adding replicas from /mnt/disk/0/hdfs/data/current. 
> Will throw later.
> java.io.IOException: Failed to start sub tasks to add replica in replica map 
> :java.lang.ArithmeticExceptionjava.io.IOException: Failed to start sub tasks 
> to add replica in replica map :java.lang.ArithmeticException at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.getVolumeMap(BlockPoolSlice.java:434)
>  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getVolumeMap(FsVolumeImpl.java:930)
>  at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$1.run(FsVolumeList.java:196)
> {code}
> The specific error message is omitted, causing it harder to find the root 
> cause.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15808) Add metrics for FSNamesystem read/write lock warnings

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15808?focusedWorklogId=546495=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546495
 ]

ASF GitHub Bot logged work on HDFS-15808:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 01:10
Start Date: 03/Feb/21 01:10
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #2668:
URL: https://github.com/apache/hadoop/pull/2668#issuecomment-771457006


   Failed junit tests 
   hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints 
   hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks 
   
   Sorry. I didn't change those two unit tests, and they worked fine locally.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546495)
Time Spent: 1h  (was: 50m)

> Add metrics for FSNamesystem read/write lock warnings
> -
>
> Key: HDFS-15808
> URL: https://issues.apache.org/jira/browse/HDFS-15808
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: hdfs
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: hdfs, lock, metrics, pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> To monitor how often read/write locks exceed thresholds, we can add two 
> metrics(ReadLockWarning/WriteLockWarning), which are exposed in JMX.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15814) Make some parameters configurable for DataNodeDiskMetrics

2021-02-02 Thread tomscut (Jira)
tomscut created HDFS-15814:
--

 Summary: Make some parameters configurable for DataNodeDiskMetrics
 Key: HDFS-15814
 URL: https://issues.apache.org/jira/browse/HDFS-15814
 Project: Hadoop HDFS
  Issue Type: Wish
  Components: hdfs
Reporter: tomscut


For ease of use, especially for small clusters, we can change some 
parameters(MIN_OUTLIER_DETECTION_DISKS, SLOW_DISK_LOW_THRESHOLD_MS) 
configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15757) RBF: Improving Router Connection Management

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15757?focusedWorklogId=546454=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546454
 ]

ASF GitHub Bot logged work on HDFS-15757:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 01:06
Start Date: 03/Feb/21 01:06
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2651:
URL: https://github.com/apache/hadoop/pull/2651#issuecomment-771379941







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546454)
Time Spent: 2h 50m  (was: 2h 40m)

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ 
> Improving Router Connection Management_v3.pdf, RBF_ Router Connection 
> Management.pdf
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15757) RBF: Improving Router Connection Management

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15757?focusedWorklogId=546422=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546422
 ]

ASF GitHub Bot logged work on HDFS-15757:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 01:03
Start Date: 03/Feb/21 01:03
Worklog Time Spent: 10m 
  Work Description: goiri commented on a change in pull request #2651:
URL: https://github.com/apache/hadoop/pull/2651#discussion_r568315552



##
File path: 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/ConnectionContext.java
##
@@ -42,7 +44,10 @@
   private int numThreads = 0;
   /** If the connection is closed. */
   private boolean closed = false;
-
+  /** Last timestamp the connection was active. */
+  private long lastActiveTs = 0;
+  /** The connection's active status would expire after this window. */
+  private long activeWindow = TimeUnit.SECONDS.toMillis(30);

Review comment:
   I agree. and called ACTIVE_WINDOW_TIME

##
File path: 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/ConnectionContext.java
##
@@ -57,6 +62,17 @@ public synchronized boolean isActive() {
 return this.numThreads > 0;
   }
 
+  /**
+   * Check if the connection is/was active recently.
+   *
+   * @return True if the connection is active or
+   * was active in the past period of time.
+   */
+  public synchronized boolean isActiveRecently() {
+return isActive() ||

Review comment:
   If we do the timestamp based, do we even need to check for the 
isActive() or the timestamp comparisson is enoug?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546422)
Time Spent: 2h 40m  (was: 2.5h)

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ 
> Improving Router Connection Management_v3.pdf, RBF_ Router Connection 
> Management.pdf
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15811) completeFile should log final file size

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15811?focusedWorklogId=546376=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546376
 ]

ASF GitHub Bot logged work on HDFS-15811:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 00:58
Start Date: 03/Feb/21 00:58
Worklog Time Spent: 10m 
  Work Description: sunchao commented on a change in pull request #2670:
URL: https://github.com/apache/hadoop/pull/2670#discussion_r568348691



##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
##
@@ -3146,23 +3148,30 @@ INodeFile checkLease(INodesInPath iip, String holder, 
long fileId)
   boolean completeFile(final String src, String holder,
ExtendedBlock last, long fileId)
 throws IOException {
+final String operationName = CMD_COMPLETE_FILE;
 boolean success = false;
+FileStatus stat = null;
 checkOperation(OperationCategory.WRITE);
 final FSPermissionChecker pc = getPermissionChecker();
 FSPermissionChecker.setOperationType(null);
 writeLock();
 try {
   checkOperation(OperationCategory.WRITE);
   checkNameNodeSafeMode("Cannot complete file " + src);
-  success = FSDirWriteFileOp.completeFile(this, pc, src, holder, last,
+  INodesInPath iip = dir.resolvePath(pc, src, fileId);
+  success = FSDirWriteFileOp.completeFile(this, iip, src, holder, last,
   fileId);
+  if (success) {
+stat = dir.getAuditFileInfo(iip);
+  }
 } finally {
-  writeUnlock("completeFile");
+  writeUnlock(operationName);

Review comment:
   why change this?

##
File path: 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java
##
@@ -8667,6 +8676,9 @@ public void logAuditEvent(boolean succeeded, String 
userName,
 }
 sb.append("\t").append("proto=")
 .append(Server.getProtocol());
+if (cmd.equals(CMD_COMPLETE_FILE) && status != null) {
+  sb.append("\t").append("fileSize=").append(status.getLen());

Review comment:
   we shouldn't only add a new field for this particular command, as it 
will probably break lots of applications parsing audit log. See 
https://issues.apache.org/jira/browse/HDFS-9184 for some more context. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546376)
Time Spent: 0.5h  (was: 20m)

> completeFile should log final file size
> ---
>
> Key: HDFS-15811
> URL: https://issues.apache.org/jira/browse/HDFS-15811
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Zehao Chen
>Assignee: Zehao Chen
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Jobs, particularly hive queries by non-headless users, can create an 
> excessive number of files (many hundreds of thousands). A single user's query 
> can generate a sustained burst of 60-80% of all creates for tens of minutes 
> or more and impact overall cluster performance. Adding the file size to the 
> logline allows us to identify excessive tiny or large files.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15808) Add metrics for FSNamesystem read/write lock warnings

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15808?focusedWorklogId=546358=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546358
 ]

ASF GitHub Bot logged work on HDFS-15808:
-

Author: ASF GitHub Bot
Created on: 03/Feb/21 00:56
Start Date: 03/Feb/21 00:56
Worklog Time Spent: 10m 
  Work Description: tomscut edited a comment on pull request #2668:
URL: https://github.com/apache/hadoop/pull/2668#issuecomment-771457006


   Failed junit tests 
   hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints 
   hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks 
   
   Sorry. I didn't update those two unit tests, and they worked fine locally.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546358)
Time Spent: 50m  (was: 40m)

> Add metrics for FSNamesystem read/write lock warnings
> -
>
> Key: HDFS-15808
> URL: https://issues.apache.org/jira/browse/HDFS-15808
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: hdfs
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: hdfs, lock, metrics, pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> To monitor how often read/write locks exceed thresholds, we can add two 
> metrics(ReadLockWarning/WriteLockWarning), which are exposed in JMX.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13148) Unit test for EZ with KMS and Federation

2021-02-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277555#comment-17277555
 ] 

Hadoop QA commented on HDFS-13148:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
50s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:green}+1{color} | {color:green} {color} | {color:green}  0m  0s{color} 
| {color:green}test4tests{color} | {color:green} The patch appears to include 1 
new or modified test files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 
49s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
42s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
25s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
12s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
32s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
19m 58s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
1s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
39s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  3m 
49s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs 
config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
46s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
29s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
31s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  1m 31s{color} 
| 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/451/artifact/out/diff-compile-javac-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04.txt{color}
 | {color:red} 
hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 with 
JDK Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 generated 1 new + 594 unchanged - 0 
fixed = 595 total (was 594) {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
18s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:red}-1{color} | {color:red} javac {color} | {color:red}  1m 18s{color} 
| 
{color:red}https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/451/artifact/out/diff-compile-javac-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01.txt{color}
 | {color:red} 
hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01
 with JDK Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 generated 1 new 
+ 578 unchanged - 0 fixed = 579 total (was 578) {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 2s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
27s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} 

[jira] [Work logged] (HDFS-15757) RBF: Improving Router Connection Management

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15757?focusedWorklogId=546266=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546266
 ]

ASF GitHub Bot logged work on HDFS-15757:
-

Author: ASF GitHub Bot
Created on: 02/Feb/21 21:24
Start Date: 02/Feb/21 21:24
Worklog Time Spent: 10m 
  Work Description: hadoop-yetus commented on pull request #2651:
URL: https://github.com/apache/hadoop/pull/2651#issuecomment-771998442


   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   6m 45s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |   |   0m  0s | [test4tests](test4tests) |  The patch 
appears to include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  40m 40s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 50s |  |  trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  compile  |   0m 46s |  |  trunk passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01  |
   | +1 :green_heart: |  checkstyle  |   0m 34s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 53s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  19m 42s |  |  branch has no errors 
when building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 48s |  |  trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   1m  6s |  |  trunk passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01  |
   | +0 :ok: |  spotbugs  |   1m 48s |  |  Used deprecated FindBugs config; 
considering switching to SpotBugs.  |
   | +1 :green_heart: |  findbugs  |   1m 42s |  |  trunk passed  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 44s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 45s |  |  the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javac  |   0m 45s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 36s |  |  the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01  |
   | +1 :green_heart: |  javac  |   0m 36s |  |  the patch passed  |
   | +1 :green_heart: |  checkstyle  |   0m 17s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 36s |  |  the patch passed  |
   | +1 :green_heart: |  whitespace  |   0m  0s |  |  The patch has no 
whitespace issues.  |
   | +1 :green_heart: |  shadedclient  |  17m  1s |  |  patch has no errors 
when building and testing our client artifacts.  |
   | +1 :green_heart: |  javadoc  |   0m 39s |  |  the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04  |
   | +1 :green_heart: |  javadoc  |   0m 57s |  |  the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01  |
   | +1 :green_heart: |  findbugs  |   1m 36s |  |  the patch passed  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  20m  1s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 33s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 120m 39s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2651/6/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/2651 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient findbugs checkstyle |
   | uname | Linux b7d09608d928 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 
23:41:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / f37bf651993 |
   | Default Java | Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2651/6/testReport/ |
   | Max. process+thread count | 2389 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-2651/6/console |
   | versions | git=2.25.1 maven=3.6.3 findbugs=4.0.6 |
   | Powered by | Apache Yetus 0.13.0-SNAPSHOT 

[jira] [Updated] (HDFS-15799) Make DisallowedDatanodeException terse

2021-02-02 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated HDFS-15799:
---
Affects Version/s: 3.4.0
   2.10.1

> Make DisallowedDatanodeException terse
> --
>
> Key: HDFS-15799
> URL: https://issues.apache.org/jira/browse/HDFS-15799
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 2.10.1, 3.4.0
>Reporter: Richard
>Assignee: Richard
>Priority: Minor
> Attachments: HDFS-15799.001.patch
>
>
> When org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException is 
> thrown back to a datanode, the namenode logs a full stack trace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15799) Make DisallowedDatanodeException terse

2021-02-02 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated HDFS-15799:
---
Target Version/s:   (was: 3.3.0)

> Make DisallowedDatanodeException terse
> --
>
> Key: HDFS-15799
> URL: https://issues.apache.org/jira/browse/HDFS-15799
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Richard
>Assignee: Richard
>Priority: Minor
> Attachments: HDFS-15799.001.patch
>
>
> When org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException is 
> thrown back to a datanode, the namenode logs a full stack trace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15799) Make DisallowedDatanodeException terse

2021-02-02 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated HDFS-15799:
---
Status: Patch Available  (was: Open)

> Make DisallowedDatanodeException terse
> --
>
> Key: HDFS-15799
> URL: https://issues.apache.org/jira/browse/HDFS-15799
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Richard
>Assignee: Richard
>Priority: Minor
> Attachments: HDFS-15799.001.patch
>
>
> When org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException is 
> thrown back to a datanode, the namenode logs a full stack trace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15757) RBF: Improving Router Connection Management

2021-02-02 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277429#comment-17277429
 ] 

Íñigo Goiri commented on HDFS-15757:


Thank you for the updated document with the data.
I think these results justify this improvement.
I'm fine going forward with this.

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ 
> Improving Router Connection Management_v3.pdf, RBF_ Router Connection 
> Management.pdf
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15757) RBF: Improving Router Connection Management

2021-02-02 Thread Fengnan Li (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277412#comment-17277412
 ] 

Fengnan Li commented on HDFS-15757:
---

[~elgoiri] [~hexiaoqiao] 
Addressed comments in the PR. What's more important is that you guys can try 
this from your setup since this essentially is an optimization where only 
metrics improvement can justify it.

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ 
> Improving Router Connection Management_v3.pdf, RBF_ Router Connection 
> Management.pdf
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15757) RBF: Improving Router Connection Management

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15757?focusedWorklogId=546221=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546221
 ]

ASF GitHub Bot logged work on HDFS-15757:
-

Author: ASF GitHub Bot
Created on: 02/Feb/21 19:21
Start Date: 02/Feb/21 19:21
Worklog Time Spent: 10m 
  Work Description: fengnanli commented on a change in pull request #2651:
URL: https://github.com/apache/hadoop/pull/2651#discussion_r568868167



##
File path: 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/ConnectionContext.java
##
@@ -57,6 +62,17 @@ public synchronized boolean isActive() {
 return this.numThreads > 0;
   }
 
+  /**
+   * Check if the connection is/was active recently.
+   *
+   * @return True if the connection is active or
+   * was active in the past period of time.
+   */
+  public synchronized boolean isActiveRecently() {
+return isActive() ||

Review comment:
   That can removed since the timewindow calculation covers the active 
case. Updated.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546221)
Time Spent: 2h 20m  (was: 2h 10m)

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ 
> Improving Router Connection Management_v3.pdf, RBF_ Router Connection 
> Management.pdf
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15757) RBF: Improving Router Connection Management

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15757?focusedWorklogId=546220=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546220
 ]

ASF GitHub Bot logged work on HDFS-15757:
-

Author: ASF GitHub Bot
Created on: 02/Feb/21 19:20
Start Date: 02/Feb/21 19:20
Worklog Time Spent: 10m 
  Work Description: fengnanli commented on a change in pull request #2651:
URL: https://github.com/apache/hadoop/pull/2651#discussion_r568866912



##
File path: 
hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/ConnectionPool.java
##
@@ -252,19 +252,23 @@ public synchronized void addConnection(ConnectionContext 
conn) {
*/
   public synchronized List removeConnections(int num) {
 List removed = new LinkedList<>();
-
-// Remove and close the last connection
-List tmpConnections = new ArrayList<>();
-for (int i=0; i this.minSize) {
+  int targetCount = Math.min(num, this.connections.size() - this.minSize);

Review comment:
   I don't think it can negative here since the only place connections 
become less is in this function at the swap part with the tmpConnections. The 
other place where this var gets assigned is in the creation part and it can 
only increase the value.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546220)
Time Spent: 2h 10m  (was: 2h)

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ 
> Improving Router Connection Management_v3.pdf, RBF_ Router Connection 
> Management.pdf
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15757) RBF: Improving Router Connection Management

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15757?focusedWorklogId=546216=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-546216
 ]

ASF GitHub Bot logged work on HDFS-15757:
-

Author: ASF GitHub Bot
Created on: 02/Feb/21 19:14
Start Date: 02/Feb/21 19:14
Worklog Time Spent: 10m 
  Work Description: fengnanli commented on pull request #2651:
URL: https://github.com/apache/hadoop/pull/2651#issuecomment-771903717


   > Thanks @fengnanli for you works here. Leave some nit comment inline.
   > Sorry I do not get why the change can reduce connections here after review 
the changes, is it related "Be greedy here to close as many connections as 
possible in one shot"? It will be helpful if we add some javadocs explicitly. 
Thanks.
   
   Thanks for the review @Hexiaoqiao  I put the reason behind this change in 
the design doc in the original JIRA ticket. In short, I did synchronous 
connection closing + better picking connections + greedy closing connections. I 
have seen 50% reduce in number of connections and better ProxyTime. It will be 
great if you can try in your setup as well.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 546216)
Time Spent: 2h  (was: 1h 50m)

> RBF: Improving Router Connection Management
> ---
>
> Key: HDFS-15757
> URL: https://issues.apache.org/jira/browse/HDFS-15757
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Reporter: Fengnan Li
>Assignee: Fengnan Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: RBF_ Improving Router Connection Management_v2.pdf, RBF_ 
> Improving Router Connection Management_v3.pdf, RBF_ Router Connection 
> Management.pdf
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> We have seen high number of connections from Router to namenodes, leaving 
> namenodes unstable.
> This ticket is trying to reduce connections through some changes. Please take 
> a look at the design and leave comments. 
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15813) DataStreamer: keep sending heartbeat packets while streaming

2021-02-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277389#comment-17277389
 ] 

Hadoop QA commented on HDFS-15813:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
41s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to 
include any new or modified tests. Please justify why no new tests are needed 
for this patch. Also please list what manual steps were performed to verify 
this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
 6s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
30s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
56s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 28s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
38s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
37s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  2m 
23s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs 
config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
21s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
47s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
50s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
50s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
43s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
43s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
20s{color} | {color:green}{color} | {color:green} 
hadoop-hdfs-project/hadoop-hdfs-client: The patch generated 0 new + 73 
unchanged - 1 fixed = 73 total (was 74) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
45s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 38s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
36s{color} | {color:green}{color} | 

[jira] [Updated] (HDFS-15799) Make DisallowedDatanodeException terse

2021-02-02 Thread Richard (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard updated HDFS-15799:
---
Attachment: HDFS-15799.001.patch

> Make DisallowedDatanodeException terse
> --
>
> Key: HDFS-15799
> URL: https://issues.apache.org/jira/browse/HDFS-15799
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Reporter: Richard
>Assignee: Richard
>Priority: Minor
> Attachments: HDFS-15799.001.patch
>
>
> When org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException is 
> thrown back to a datanode, the namenode logs a full stack trace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-02-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277354#comment-17277354
 ] 

Hadoop QA commented on HDFS-15798:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
45s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to 
include any new or modified tests. Please justify why no new tests are needed 
for this patch. Also please list what manual steps were performed to verify 
this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 
 8s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
25s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
19s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
59s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
27s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 21s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
57s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
39s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  3m 
22s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs 
config; considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
19s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
18s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
20s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
20s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
6s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
6s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
57s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
15s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m  2s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
54s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| 

[jira] [Updated] (HDFS-15813) DataStreamer: keep sending heartbeat packets while streaming

2021-02-02 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated HDFS-15813:
---
Attachment: HDFS-15813.002.patch

> DataStreamer: keep sending heartbeat packets while streaming
> 
>
> Key: HDFS-15813
> URL: https://issues.apache.org/jira/browse/HDFS-15813
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: HDFS-15813.001.patch, HDFS-15813.002.patch
>
>
> In response to [HDFS-5032], [~daryn] made a change to our internal code to 
> ensure that heartbeats continue during data steaming, even in the face of a 
> slow disk.
> As [~kihwal] noted, absence of heartbeat during flush will be fixed in a 
> separate jira.  It doesn't look like this change was ever pushed back to 
> apache, so I am providing it here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15813) DataStreamer: keep sending heartbeat packets while streaming

2021-02-02 Thread Jim Brennan (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277309#comment-17277309
 ] 

Jim Brennan commented on HDFS-15813:


Looks like I need to update the patch.


> DataStreamer: keep sending heartbeat packets while streaming
> 
>
> Key: HDFS-15813
> URL: https://issues.apache.org/jira/browse/HDFS-15813
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: HDFS-15813.001.patch
>
>
> In response to [HDFS-5032], [~daryn] made a change to our internal code to 
> ensure that heartbeats continue during data steaming, even in the face of a 
> slow disk.
> As [~kihwal] noted, absence of heartbeat during flush will be fixed in a 
> separate jira.  It doesn't look like this change was ever pushed back to 
> apache, so I am providing it here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15813) DataStreamer: keep sending heartbeat packets while streaming

2021-02-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277303#comment-17277303
 ] 

Hadoop QA commented on HDFS-15813:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m 11s{color} 
| {color:red}{color} | {color:red} HDFS-15813 does not apply to trunk. Rebase 
required? Wrong Branch? See https://wiki.apache.org/hadoop/HowToContribute for 
help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-15813 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/13019865/HDFS-15813.001.patch |
| Console output | 
https://ci-hadoop.apache.org/job/PreCommit-HDFS-Build/449/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.13.0-SNAPSHOT https://yetus.apache.org |


This message was automatically generated.



> DataStreamer: keep sending heartbeat packets while streaming
> 
>
> Key: HDFS-15813
> URL: https://issues.apache.org/jira/browse/HDFS-15813
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: HDFS-15813.001.patch
>
>
> In response to [HDFS-5032], [~daryn] made a change to our internal code to 
> ensure that heartbeats continue during data steaming, even in the face of a 
> slow disk.
> As [~kihwal] noted, absence of heartbeat during flush will be fixed in a 
> separate jira.  It doesn't look like this change was ever pushed back to 
> apache, so I am providing it here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15813) DataStreamer: keep sending heartbeat packets while streaming

2021-02-02 Thread Jim Brennan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Brennan updated HDFS-15813:
---
Attachment: HDFS-15813.001.patch
Status: Patch Available  (was: Open)

Submitting patch - we have been running with this change in production for 
years.


> DataStreamer: keep sending heartbeat packets while streaming
> 
>
> Key: HDFS-15813
> URL: https://issues.apache.org/jira/browse/HDFS-15813
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0
>Reporter: Jim Brennan
>Assignee: Jim Brennan
>Priority: Major
> Attachments: HDFS-15813.001.patch
>
>
> In response to [HDFS-5032], [~daryn] made a change to our internal code to 
> ensure that heartbeats continue during data steaming, even in the face of a 
> slow disk.
> As [~kihwal] noted, absence of heartbeat during flush will be fixed in a 
> separate jira.  It doesn't look like this change was ever pushed back to 
> apache, so I am providing it here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15813) DataStreamer: keep sending heartbeat packets while streaming

2021-02-02 Thread Jim Brennan (Jira)
Jim Brennan created HDFS-15813:
--

 Summary: DataStreamer: keep sending heartbeat packets while 
streaming
 Key: HDFS-15813
 URL: https://issues.apache.org/jira/browse/HDFS-15813
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.4.0
Reporter: Jim Brennan
Assignee: Jim Brennan


In response to [HDFS-5032], [~daryn] made a change to our internal code to 
ensure that heartbeats continue during data steaming, even in the face of a 
slow disk.
As [~kihwal] noted, absence of heartbeat during flush will be fixed in a 
separate jira.  It doesn't look like this change was ever pushed back to 
apache, so I am providing it here.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15792) ClasscastException while loading FSImage

2021-02-02 Thread Renukaprasad C (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277279#comment-17277279
 ] 

Renukaprasad C edited comment on HDFS-15792 at 2/2/21, 4:51 PM:


[~hexiaoqiao] I have added the patch for branch-2.10. Included checkstyle fixes 
as well. Please review. Thank you.


was (Author: prasad-acit):
[~hexiaoqiao] I have added the patch for branch-2.10. Included checkstyle 
issues as well. Please review. Thank you.

> ClasscastException while loading FSImage
> 
>
> Key: HDFS-15792
> URL: https://issues.apache.org/jira/browse/HDFS-15792
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nn
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-15792-branch-2.10.001.patch, HDFS-15792.001.patch, 
> HDFS-15792.002.patch, HDFS-15792.003.patch, HDFS-15792.004.patch, 
> HDFS-15792.005.patch, HDFS-15792.addendum.001.patch, 
> image-2021-01-27-12-00-34-846.png
>
>
> FSImage loading has failed with ClasscastException - 
> java.lang.ClassCastException: java.util.HashMap$Node cannot be cast to 
> java.util.HashMap$TreeNode.
> This is the usage issue with Hashmap in concurrent scenarios.
> Same issue has been reported on Java & closed as usage issue.  - 
> https://bugs.openjdk.java.net/browse/JDK-8173671
> 2020-12-28 11:36:26,127 | ERROR | main | An exception occurred when loading 
> INODE from fsiamge. | FSImageFormatProtobuf.java:442
> java.lang.
> : java.util.HashMap$Node cannot be cast to java.util.HashMap$TreeNode
>   at java.util.HashMap$TreeNode.moveRootToFront(HashMap.java:1835)
>   at java.util.HashMap$TreeNode.treeify(HashMap.java:1951)
>   at java.util.HashMap.treeifyBin(HashMap.java:772)
>   at java.util.HashMap.putVal(HashMap.java:644)
>   at java.util.HashMap.put(HashMap.java:612)
>   at 
> org.apache.hadoop.hdfs.util.ReferenceCountMap.put(ReferenceCountMap.java:53)
>   at 
> org.apache.hadoop.hdfs.server.namenode.AclStorage.addAclFeature(AclStorage.java:391)
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeWithAdditionalFields.addAclFeature(INodeWithAdditionalFields.java:349)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectory(FSImageFormatPBINode.java:225)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINode(FSImageFormatPBINode.java:406)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.readPBINodes(FSImageFormatPBINode.java:367)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeSection(FSImageFormatPBINode.java:342)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader$2.call(FSImageFormatProtobuf.java:469)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2020-12-28 11:36:26,130 | ERROR | main | Failed to load image from 
> FSImageFile(file=/srv/BigData/namenode/current/fsimage_00198227480, 
> cpktTxId=00198227480) | FSImage.java:738
> java.io.IOException: java.lang.ClassCastException: java.util.HashMap$Node 
> cannot be cast to java.util.HashMap$TreeNode
>   at 
> org.apache.hadoop.io.MultipleIOException$Builder.add(MultipleIOException.java:68)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.runLoaderTasks(FSImageFormatProtobuf.java:444)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:360)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:263)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:227)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:971)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:955)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:820)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:733)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1113)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:730)
>   at 
> 

[jira] [Commented] (HDFS-15792) ClasscastException while loading FSImage

2021-02-02 Thread Renukaprasad C (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277279#comment-17277279
 ] 

Renukaprasad C commented on HDFS-15792:
---

[~hexiaoqiao] I have added the patch for branch-2.10. Included checkstyle 
issues as well. Please review. Thank you.

> ClasscastException while loading FSImage
> 
>
> Key: HDFS-15792
> URL: https://issues.apache.org/jira/browse/HDFS-15792
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nn
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-15792-branch-2.10.001.patch, HDFS-15792.001.patch, 
> HDFS-15792.002.patch, HDFS-15792.003.patch, HDFS-15792.004.patch, 
> HDFS-15792.005.patch, HDFS-15792.addendum.001.patch, 
> image-2021-01-27-12-00-34-846.png
>
>
> FSImage loading has failed with ClasscastException - 
> java.lang.ClassCastException: java.util.HashMap$Node cannot be cast to 
> java.util.HashMap$TreeNode.
> This is the usage issue with Hashmap in concurrent scenarios.
> Same issue has been reported on Java & closed as usage issue.  - 
> https://bugs.openjdk.java.net/browse/JDK-8173671
> 2020-12-28 11:36:26,127 | ERROR | main | An exception occurred when loading 
> INODE from fsiamge. | FSImageFormatProtobuf.java:442
> java.lang.
> : java.util.HashMap$Node cannot be cast to java.util.HashMap$TreeNode
>   at java.util.HashMap$TreeNode.moveRootToFront(HashMap.java:1835)
>   at java.util.HashMap$TreeNode.treeify(HashMap.java:1951)
>   at java.util.HashMap.treeifyBin(HashMap.java:772)
>   at java.util.HashMap.putVal(HashMap.java:644)
>   at java.util.HashMap.put(HashMap.java:612)
>   at 
> org.apache.hadoop.hdfs.util.ReferenceCountMap.put(ReferenceCountMap.java:53)
>   at 
> org.apache.hadoop.hdfs.server.namenode.AclStorage.addAclFeature(AclStorage.java:391)
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeWithAdditionalFields.addAclFeature(INodeWithAdditionalFields.java:349)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectory(FSImageFormatPBINode.java:225)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINode(FSImageFormatPBINode.java:406)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.readPBINodes(FSImageFormatPBINode.java:367)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeSection(FSImageFormatPBINode.java:342)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader$2.call(FSImageFormatProtobuf.java:469)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2020-12-28 11:36:26,130 | ERROR | main | Failed to load image from 
> FSImageFile(file=/srv/BigData/namenode/current/fsimage_00198227480, 
> cpktTxId=00198227480) | FSImage.java:738
> java.io.IOException: java.lang.ClassCastException: java.util.HashMap$Node 
> cannot be cast to java.util.HashMap$TreeNode
>   at 
> org.apache.hadoop.io.MultipleIOException$Builder.add(MultipleIOException.java:68)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.runLoaderTasks(FSImageFormatProtobuf.java:444)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:360)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:263)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:227)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:971)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:955)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:820)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:733)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1113)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:730)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:648)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:710)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:953)
> 

[jira] [Updated] (HDFS-15792) ClasscastException while loading FSImage

2021-02-02 Thread Renukaprasad C (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renukaprasad C updated HDFS-15792:
--
Attachment: HDFS-15792-branch-2.10.001.patch

> ClasscastException while loading FSImage
> 
>
> Key: HDFS-15792
> URL: https://issues.apache.org/jira/browse/HDFS-15792
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nn
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-15792-branch-2.10.001.patch, HDFS-15792.001.patch, 
> HDFS-15792.002.patch, HDFS-15792.003.patch, HDFS-15792.004.patch, 
> HDFS-15792.005.patch, HDFS-15792.addendum.001.patch, 
> image-2021-01-27-12-00-34-846.png
>
>
> FSImage loading has failed with ClasscastException - 
> java.lang.ClassCastException: java.util.HashMap$Node cannot be cast to 
> java.util.HashMap$TreeNode.
> This is the usage issue with Hashmap in concurrent scenarios.
> Same issue has been reported on Java & closed as usage issue.  - 
> https://bugs.openjdk.java.net/browse/JDK-8173671
> 2020-12-28 11:36:26,127 | ERROR | main | An exception occurred when loading 
> INODE from fsiamge. | FSImageFormatProtobuf.java:442
> java.lang.
> : java.util.HashMap$Node cannot be cast to java.util.HashMap$TreeNode
>   at java.util.HashMap$TreeNode.moveRootToFront(HashMap.java:1835)
>   at java.util.HashMap$TreeNode.treeify(HashMap.java:1951)
>   at java.util.HashMap.treeifyBin(HashMap.java:772)
>   at java.util.HashMap.putVal(HashMap.java:644)
>   at java.util.HashMap.put(HashMap.java:612)
>   at 
> org.apache.hadoop.hdfs.util.ReferenceCountMap.put(ReferenceCountMap.java:53)
>   at 
> org.apache.hadoop.hdfs.server.namenode.AclStorage.addAclFeature(AclStorage.java:391)
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeWithAdditionalFields.addAclFeature(INodeWithAdditionalFields.java:349)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectory(FSImageFormatPBINode.java:225)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINode(FSImageFormatPBINode.java:406)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.readPBINodes(FSImageFormatPBINode.java:367)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeSection(FSImageFormatPBINode.java:342)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader$2.call(FSImageFormatProtobuf.java:469)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2020-12-28 11:36:26,130 | ERROR | main | Failed to load image from 
> FSImageFile(file=/srv/BigData/namenode/current/fsimage_00198227480, 
> cpktTxId=00198227480) | FSImage.java:738
> java.io.IOException: java.lang.ClassCastException: java.util.HashMap$Node 
> cannot be cast to java.util.HashMap$TreeNode
>   at 
> org.apache.hadoop.io.MultipleIOException$Builder.add(MultipleIOException.java:68)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.runLoaderTasks(FSImageFormatProtobuf.java:444)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:360)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:263)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:227)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:971)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:955)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:820)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:733)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1113)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:730)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:648)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:710)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:953)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:926)
>   at 
> 

[jira] [Commented] (HDFS-13609) [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

2021-02-02 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277250#comment-17277250
 ] 

Erik Krogen commented on HDFS-13609:


Hi [~xuzq_zander], thanks for taking a look.

{quote}
when onlyDurableTxns is false, maxAllowedTxns = responseCounts.get(0)
{quote}
Correct me if I'm wrong but I think you have this backwards. If 
{{onlyDurableTxns}} is false, then {{maxAllowedTxns = highestTxnCount}} which 
is {{responseCounts.get(2)}}

It is when {{onlyDurableTxns}} is true that you get {{responseCounts.get(0)}}. 
In this case, we really do need to take the lowest of the returned values. 
Since we only got 3 responses, we can't make any assumptions about the other 2 
JNs, so just assume they have 0 txns. We only want to take txns that have 
landed on a quorum of JNs (thus becoming durable). Thus since we only got 3 
responses, we have to take the lowest txn that any of those responses are aware 
of. For example if we got back {{(5, 10, 20)}}, then only txns 1-5 are 
available on all 3 JNs we got responses from, so those are the only 
transactions we know are durable. Of course more _might_ be durable if they 
were persisted on the two JNs we didn't get responses from, but we don't know 
that.

Let me know if that clears things up.

> [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via 
> RPC
> -
>
> Key: HDFS-13609
> URL: https://issues.apache.org/jira/browse/HDFS-13609
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, namenode
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: HDFS-12943, 3.3.0
>
> Attachments: HDFS-13609-HDFS-12943.000.patch, 
> HDFS-13609-HDFS-12943.001.patch, HDFS-13609-HDFS-12943.002.patch, 
> HDFS-13609-HDFS-12943.003.patch, HDFS-13609-HDFS-12943.004.patch
>
>
> See HDFS-13150 for the full design.
> This JIRA is targetted at the NameNode-side changes to enable tailing 
> in-progress edits via the RPC mechanism added in HDFS-13608. Most changes are 
> in the QuorumJournalManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15812) after deleting data of hbase table hdfs size is not decreasing

2021-02-02 Thread Satya Gaurav (Jira)
Satya Gaurav created HDFS-15812:
---

 Summary: after deleting data of hbase table hdfs size is not 
decreasing
 Key: HDFS-15812
 URL: https://issues.apache.org/jira/browse/HDFS-15812
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Affects Versions: 2.0.2-alpha
 Environment: HDP 3.1.4.0-315

Hbase 2.0.2.3.1.4.0-315
Reporter: Satya Gaurav


I am deleting the data from hbase table, it's deleting from hbase table but the 
size of the hdfs directory is not reducing. Even I ran the major compaction but 
after that also hdfs size didn't reduce. Any solution for this issue?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-02-02 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277174#comment-17277174
 ] 

Hadoop QA commented on HDFS-15779:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime ||  Logfile || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 24m 
34s{color} | {color:blue}{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} || ||
| {color:green}+1{color} | {color:green} dupname {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} No case conflicting files 
found. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green}{color} | {color:green} The patch does not contain any 
@author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red}{color} | {color:red} The patch doesn't appear to 
include any new or modified tests. Please justify why no new tests are needed 
for this patch. Also please list what manual steps were performed to verify 
this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
24s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
20s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
15s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
56s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
17s{color} | {color:green}{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 37s{color} | {color:green}{color} | {color:green} branch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
51s{color} | {color:green}{color} | {color:green} trunk passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
27s{color} | {color:green}{color} | {color:green} trunk passed with JDK Private 
Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:blue}0{color} | {color:blue} spotbugs {color} | {color:blue}  3m  
1s{color} | {color:blue}{color} | {color:blue} Used deprecated FindBugs config; 
considering switching to SpotBugs. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
59s{color} | {color:green}{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} || ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
12s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
13s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
13s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
12s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Private Build-1.8.0_275-8u275-b01-0ubuntu1~20.04-b01 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
12s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
54s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
18s{color} | {color:green}{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green}{color} | {color:green} The patch has no whitespace 
issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 56s{color} | {color:green}{color} | {color:green} patch has no errors when 
building and testing our client artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
48s{color} | {color:green}{color} | {color:green} the patch passed with JDK 
Ubuntu-11.0.9.1+1-Ubuntu-0ubuntu1.20.04 {color} |
| 

[jira] [Commented] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-02-02 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277120#comment-17277120
 ] 

huhaiyang commented on HDFS-15798:
--

Upload v003 patch according to your suggestions. 
 

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch, 
> HDFS-15798.003.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
> increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  task failed 
> decrement,  XmitsInProgress is decremented by the previous value
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete 
> decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-02-02 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Attachment: HDFS-15798.003.patch

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch, 
> HDFS-15798.003.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
> increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  task failed 
> decrement,  XmitsInProgress is decremented by the previous value
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete 
> decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-02-02 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277075#comment-17277075
 ] 

huhaiyang commented on HDFS-15798:
--

[~ferhui]    [~sodonnell] Thank you for your advice!

I think it makes sense to ,I later submit a new patch.

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
> increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  task failed 
> decrement,  XmitsInProgress is decremented by the previous value
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete 
> decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15792) ClasscastException while loading FSImage

2021-02-02 Thread Renukaprasad C (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277069#comment-17277069
 ] 

Renukaprasad C commented on HDFS-15792:
---

Thanks [~hexiaoqiao] 
Sure, i will create for branch 2.10 & submit.

> ClasscastException while loading FSImage
> 
>
> Key: HDFS-15792
> URL: https://issues.apache.org/jira/browse/HDFS-15792
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: nn
>Reporter: Renukaprasad C
>Assignee: Renukaprasad C
>Priority: Major
> Fix For: 3.3.1, 3.4.0
>
> Attachments: HDFS-15792.001.patch, HDFS-15792.002.patch, 
> HDFS-15792.003.patch, HDFS-15792.004.patch, HDFS-15792.005.patch, 
> HDFS-15792.addendum.001.patch, image-2021-01-27-12-00-34-846.png
>
>
> FSImage loading has failed with ClasscastException - 
> java.lang.ClassCastException: java.util.HashMap$Node cannot be cast to 
> java.util.HashMap$TreeNode.
> This is the usage issue with Hashmap in concurrent scenarios.
> Same issue has been reported on Java & closed as usage issue.  - 
> https://bugs.openjdk.java.net/browse/JDK-8173671
> 2020-12-28 11:36:26,127 | ERROR | main | An exception occurred when loading 
> INODE from fsiamge. | FSImageFormatProtobuf.java:442
> java.lang.
> : java.util.HashMap$Node cannot be cast to java.util.HashMap$TreeNode
>   at java.util.HashMap$TreeNode.moveRootToFront(HashMap.java:1835)
>   at java.util.HashMap$TreeNode.treeify(HashMap.java:1951)
>   at java.util.HashMap.treeifyBin(HashMap.java:772)
>   at java.util.HashMap.putVal(HashMap.java:644)
>   at java.util.HashMap.put(HashMap.java:612)
>   at 
> org.apache.hadoop.hdfs.util.ReferenceCountMap.put(ReferenceCountMap.java:53)
>   at 
> org.apache.hadoop.hdfs.server.namenode.AclStorage.addAclFeature(AclStorage.java:391)
>   at 
> org.apache.hadoop.hdfs.server.namenode.INodeWithAdditionalFields.addAclFeature(INodeWithAdditionalFields.java:349)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectory(FSImageFormatPBINode.java:225)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINode(FSImageFormatPBINode.java:406)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.readPBINodes(FSImageFormatPBINode.java:367)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeSection(FSImageFormatPBINode.java:342)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader$2.call(FSImageFormatProtobuf.java:469)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 2020-12-28 11:36:26,130 | ERROR | main | Failed to load image from 
> FSImageFile(file=/srv/BigData/namenode/current/fsimage_00198227480, 
> cpktTxId=00198227480) | FSImage.java:738
> java.io.IOException: java.lang.ClassCastException: java.util.HashMap$Node 
> cannot be cast to java.util.HashMap$TreeNode
>   at 
> org.apache.hadoop.io.MultipleIOException$Builder.add(MultipleIOException.java:68)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.runLoaderTasks(FSImageFormatProtobuf.java:444)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:360)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:263)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:227)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:971)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:955)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:820)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:733)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1113)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:730)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:648)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:710)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:953)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:926)
>   

[jira] [Commented] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-02-02 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277067#comment-17277067
 ] 

Stephen O'Donnell commented on HDFS-15798:
--

Yes I had wondered about that too. I think it makes sense to have:

{code}
...
stripedReconstructionPool.submit(task);
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted);
...
{code}

That way, if we have some issue submitting the task the xmits will not get 
incremented at all.

I think we can also drop the change in the test.

[~haiyang Hu] Would you like to submit a new patch with these changes?



> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
> increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  task failed 
> decrement,  XmitsInProgress is decremented by the previous value
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete 
> decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15803) EC: Remove unnecessary method (getWeight) in StripedReconstructionInfo

2021-02-02 Thread Hui Fei (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277060#comment-17277060
 ] 

Hui Fei commented on HDFS-15803:


+1

[~haiyang Hu] Thanks for report and fix, [~sodonnell] Thanks for review!

Will commit tommorow

> EC: Remove unnecessary method (getWeight) in StripedReconstructionInfo 
> ---
>
> Key: HDFS-15803
> URL: https://issues.apache.org/jira/browse/HDFS-15803
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Trivial
> Attachments: HDFS-15803_001.patch
>
>
>  Removing the unused method from StripedReconstructionInfo
> {code:java}
> // StripedReconstructionInfo.java
> /**
>  * Return the weight of this EC reconstruction task.
>  *
>  * DN uses it to coordinate with NN to adjust the speed of scheduling the
>  * reconstructions tasks to this DN.
>  *
>  * @return the weight of this reconstruction task.
>  * @see HDFS-12044
>  */
> int getWeight() {
>   // See HDFS-12044. The weight of a RS(n, k) is calculated by the network
>   // connections it opens.
>   return sources.length + targets.length;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-15795) EC: Wrong checksum when reconstruction was failed by exception

2021-02-02 Thread Stephen O'Donnell (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen O'Donnell resolved HDFS-15795.
--
Resolution: Fixed

> EC: Wrong checksum when reconstruction was failed by exception
> --
>
> Key: HDFS-15795
> URL: https://issues.apache.org/jira/browse/HDFS-15795
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, ec, erasure-coding
>Reporter: Yushi Hayasaka
>Assignee: Yushi Hayasaka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> If the reconstruction task is failed on StripedBlockChecksumReconstructor by 
> exception, the checksum becomes wrong one because it is calculated with 
> blocks except a failure one.
> It is caused by catching exception with not appropriate way. As a result, the 
> failed block is not fetched again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15795) EC: Wrong checksum when reconstruction was failed by exception

2021-02-02 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277035#comment-17277035
 ] 

Stephen O'Donnell commented on HDFS-15795:
--

Committed to trunk on github and it cherry-picked cleanly down to 3.1.

Thanks for the contribution [~yhaya]. This was a good find.

> EC: Wrong checksum when reconstruction was failed by exception
> --
>
> Key: HDFS-15795
> URL: https://issues.apache.org/jira/browse/HDFS-15795
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, ec, erasure-coding
>Reporter: Yushi Hayasaka
>Assignee: Yushi Hayasaka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> If the reconstruction task is failed on StripedBlockChecksumReconstructor by 
> exception, the checksum becomes wrong one because it is calculated with 
> blocks except a failure one.
> It is caused by catching exception with not appropriate way. As a result, the 
> failed block is not fetched again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15795) EC: Wrong checksum when reconstruction was failed by exception

2021-02-02 Thread Stephen O'Donnell (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen O'Donnell updated HDFS-15795:
-
Fix Version/s: 3.2.3
   3.1.5
   3.4.0
   3.3.1

> EC: Wrong checksum when reconstruction was failed by exception
> --
>
> Key: HDFS-15795
> URL: https://issues.apache.org/jira/browse/HDFS-15795
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, ec, erasure-coding
>Reporter: Yushi Hayasaka
>Assignee: Yushi Hayasaka
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0, 3.1.5, 3.2.3
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> If the reconstruction task is failed on StripedBlockChecksumReconstructor by 
> exception, the checksum becomes wrong one because it is calculated with 
> blocks except a failure one.
> It is caused by catching exception with not appropriate way. As a result, the 
> failed block is not fetched again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-02-02 Thread Hongbing Wang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277006#comment-17277006
 ] 

Hongbing Wang commented on HDFS-15779:
--

[~ferhui] Thanks for the guidance. Fix code style in [^HDFS-15779.002.patch].

> EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
> -
>
> Key: HDFS-15779
> URL: https://issues.apache.org/jira/browse/HDFS-15779
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Attachments: HDFS-15779.001.patch, HDFS-15779.002.patch
>
>
> The NullPointerException in DN log as follows: 
> {code:java}
> 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
> //...
> 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Connection timed out
> 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Failed to reconstruct striped block: 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
> src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
> 010
> {code}
> NPE occurs at `writer.getTargetBuffer()` in codes:
> {code:java}
> // StripedWriter#clearBuffers
> void clearBuffers() {
>   for (StripedBlockWriter writer : writers) {
> ByteBuffer targetBuffer = writer.getTargetBuffer();
> if (targetBuffer != null) {
>   targetBuffer.clear();
> }
>   }
> }
> {code}
> So, why is the writer null? Let's track when the writer is initialized and 
> when reconstruct() is called,  as follows:
> {code:java}
> // StripedBlockReconstructor#run
> public void run() {
>   try {
> initDecoderIfNecessary();
> getStripedReader().init();
> stripedWriter.init();  //①
> reconstruct();  //②
> stripedWriter.endTargetBlocks();
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> // ...{code}
> They are called at ① and ② above respectively. `stripedWriter.init()` -> 
> `initTargetStreams()`, as follows:
> {code:java}
> // StripedWriter#initTargetStreams
> int initTargetStreams() {
>   int nSuccess = 0;
>   for (short i = 0; i < targets.length; i++) {
> try {
>   writers[i] = createWriter(i);
>   nSuccess++;
>   targetsStatus[i] = true;
> } catch (Throwable e) {
>   LOG.warn(e.getMessage());
> }
>   }
>   return nSuccess;
> }
> {code}
> NPE occurs when createWriter() gets an exception and  0 < nSuccess < 
> targets.length. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13609) [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via RPC

2021-02-02 Thread xuzq (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17277005#comment-17277005
 ] 

xuzq commented on HDFS-13609:
-

Hi [~xkrogen] and [~linyiqun] , recently I am learning *Consistent Reads from 
Standby Node*.  

 
{code:java}
private void selectRpcInputStreams(Collection streams,
long fromTxnId, boolean onlyDurableTxns) throws IOException {
  QuorumCall q =
  loggers.getJournaledEdits(fromTxnId, maxTxnsPerRpc);
  Map responseMap =
  loggers.waitForWriteQuorum(q, selectInputStreamsTimeoutMs,
  "selectRpcInputStreams");
  assert responseMap.size() >= loggers.getMajoritySize() :
  "Quorum call returned without a majority";

  List responseCounts = new ArrayList<>();
  for (GetJournaledEditsResponseProto resp : responseMap.values()) {
responseCounts.add(resp.getTxnCount());
  }
  Collections.sort(responseCounts);
  int highestTxnCount = responseCounts.get(responseCounts.size() - 1);
  ...
  // Cancel any outstanding calls to JN's.
  q.cancelCalls();

  int maxAllowedTxns = !onlyDurableTxns ? highestTxnCount :
  responseCounts.get(responseCounts.size() - loggers.getMajoritySize());
  if (maxAllowedTxns == 0) {
LOG.debug("No new edits available in logs; requested starting from " +
"ID " + fromTxnId);
return;
  }
  ...
}
{code}
 

 

Maybe somethings wrong in 
{code:java}
int maxAllowedTxns = !onlyDurableTxns ? highestTxnCount : 
responseCounts.get(responseCounts.size() - loggers.getMajoritySize());{code}
 * Let's say we have 5 JournalNodes, and loggers.getMajoritySize() is 3.
 * _loggers.getJournaledEdits(fromTxnId, maxTxnsPerRpc)_ just need quorum 
result, so responseCounts.size() maybe is 3.
 * when _onlyDurableTxns_ is false, _maxAllowedTxns_ = responseCounts.get(0)
 * _responseCounts.get(0)_ maybe not expect Quorum Result, and it maybe even 
doesn't have any results from _fromTxnId_
 ** maybe one journal disk is wrong, and only write into cache for _fromTxnId_

 

[~xkrogen] and [~linyiqun], if have time, please look at this question, thanks.

 

 

 

 

> [Edit Tail Fast Path Pt 3] NameNode-side changes to support tailing edits via 
> RPC
> -
>
> Key: HDFS-13609
> URL: https://issues.apache.org/jira/browse/HDFS-13609
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: ha, namenode
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: HDFS-12943, 3.3.0
>
> Attachments: HDFS-13609-HDFS-12943.000.patch, 
> HDFS-13609-HDFS-12943.001.patch, HDFS-13609-HDFS-12943.002.patch, 
> HDFS-13609-HDFS-12943.003.patch, HDFS-13609-HDFS-12943.004.patch
>
>
> See HDFS-13150 for the full design.
> This JIRA is targetted at the NameNode-side changes to enable tailing 
> in-progress edits via the RPC mechanism added in HDFS-13608. Most changes are 
> in the QuorumJournalManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15779) EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block

2021-02-02 Thread Hongbing Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hongbing Wang updated HDFS-15779:
-
Attachment: HDFS-15779.002.patch

> EC: fix NPE caused by StripedWriter.clearBuffers during reconstruct block
> -
>
> Key: HDFS-15779
> URL: https://issues.apache.org/jira/browse/HDFS-15779
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: Hongbing Wang
>Assignee: Hongbing Wang
>Priority: Major
> Attachments: HDFS-15779.001.patch, HDFS-15779.002.patch
>
>
> The NullPointerException in DN log as follows: 
> {code:java}
> 2020-12-28 15:49:25,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY
> //...
> 2020-12-28 15:51:25,551 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Connection timed out
> 2020-12-28 15:51:25,553 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Failed to reconstruct striped block: 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036804064064_6311920695
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedWriter.clearBuffers(StripedWriter.java:299)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.clearBuffers(StripedBlockReconstructor.java:139)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:115)
> at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2020-12-28 15:51:25,749 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving 
> BP-1922004198-10.83.xx.xx-1515033360950:blk_-9223372036799445643_6313197139 
> src: /10.83.xxx.52:53198 dest: /10.83.xxx.52:50
> 010
> {code}
> NPE occurs at `writer.getTargetBuffer()` in codes:
> {code:java}
> // StripedWriter#clearBuffers
> void clearBuffers() {
>   for (StripedBlockWriter writer : writers) {
> ByteBuffer targetBuffer = writer.getTargetBuffer();
> if (targetBuffer != null) {
>   targetBuffer.clear();
> }
>   }
> }
> {code}
> So, why is the writer null? Let's track when the writer is initialized and 
> when reconstruct() is called,  as follows:
> {code:java}
> // StripedBlockReconstructor#run
> public void run() {
>   try {
> initDecoderIfNecessary();
> getStripedReader().init();
> stripedWriter.init();  //①
> reconstruct();  //②
> stripedWriter.endTargetBlocks();
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> // ...{code}
> They are called at ① and ② above respectively. `stripedWriter.init()` -> 
> `initTargetStreams()`, as follows:
> {code:java}
> // StripedWriter#initTargetStreams
> int initTargetStreams() {
>   int nSuccess = 0;
>   for (short i = 0; i < targets.length; i++) {
> try {
>   writers[i] = createWriter(i);
>   nSuccess++;
>   targetsStatus[i] = true;
> } catch (Throwable e) {
>   LOG.warn(e.getMessage());
> }
>   }
>   return nSuccess;
> }
> {code}
> NPE occurs when createWriter() gets an exception and  0 < nSuccess < 
> targets.length. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15795) EC: Wrong checksum when reconstruction was failed by exception

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15795?focusedWorklogId=545883=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545883
 ]

ASF GitHub Bot logged work on HDFS-15795:
-

Author: ASF GitHub Bot
Created on: 02/Feb/21 09:02
Start Date: 02/Feb/21 09:02
Worklog Time Spent: 10m 
  Work Description: sodonnel merged pull request #2657:
URL: https://github.com/apache/hadoop/pull/2657


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 545883)
Time Spent: 1h 40m  (was: 1.5h)

> EC: Wrong checksum when reconstruction was failed by exception
> --
>
> Key: HDFS-15795
> URL: https://issues.apache.org/jira/browse/HDFS-15795
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, ec, erasure-coding
>Reporter: Yushi Hayasaka
>Assignee: Yushi Hayasaka
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> If the reconstruction task is failed on StripedBlockChecksumReconstructor by 
> exception, the checksum becomes wrong one because it is calculated with 
> blocks except a failure one.
> It is caused by catching exception with not appropriate way. As a result, the 
> failed block is not fetched again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15795) EC: Wrong checksum when reconstruction was failed by exception

2021-02-02 Thread Stephen O'Donnell (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen O'Donnell updated HDFS-15795:
-
Summary: EC: Wrong checksum when reconstruction was failed by exception  
(was: EC: Returned wrong checksum when reconstruction was failed by exception)

> EC: Wrong checksum when reconstruction was failed by exception
> --
>
> Key: HDFS-15795
> URL: https://issues.apache.org/jira/browse/HDFS-15795
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, ec, erasure-coding
>Reporter: Yushi Hayasaka
>Assignee: Yushi Hayasaka
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> If the reconstruction task is failed on StripedBlockChecksumReconstructor by 
> exception, the checksum becomes wrong one because it is calculated with 
> blocks except a failure one.
> It is caused by catching exception with not appropriate way. As a result, the 
> failed block is not fetched again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15808) Add metrics for FSNamesystem read/write lock warnings

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15808?focusedWorklogId=545865=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545865
 ]

ASF GitHub Bot logged work on HDFS-15808:
-

Author: ASF GitHub Bot
Created on: 02/Feb/21 08:22
Start Date: 02/Feb/21 08:22
Worklog Time Spent: 10m 
  Work Description: tomscut edited a comment on pull request #2668:
URL: https://github.com/apache/hadoop/pull/2668#issuecomment-771457006


   Failed junit tests 
   hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints 
   hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks 
   
   Sorry. I didn't update those two unit tests, and they worked fine locally.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 545865)
Time Spent: 40m  (was: 0.5h)

> Add metrics for FSNamesystem read/write lock warnings
> -
>
> Key: HDFS-15808
> URL: https://issues.apache.org/jira/browse/HDFS-15808
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: hdfs
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: hdfs, lock, metrics, pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> To monitor how often read/write locks exceed thresholds, we can add two 
> metrics(ReadLockWarning/WriteLockWarning), which are exposed in JMX.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDFS-15808) Add metrics for FSNamesystem read/write lock warnings

2021-02-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15808?focusedWorklogId=545864=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-545864
 ]

ASF GitHub Bot logged work on HDFS-15808:
-

Author: ASF GitHub Bot
Created on: 02/Feb/21 08:21
Start Date: 02/Feb/21 08:21
Worklog Time Spent: 10m 
  Work Description: tomscut commented on pull request #2668:
URL: https://github.com/apache/hadoop/pull/2668#issuecomment-771457006


   Failed junit tests 
   hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints 
   hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks 
   
   Sorry. I didn't change those two unit tests, and they worked fine locally.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 545864)
Time Spent: 0.5h  (was: 20m)

> Add metrics for FSNamesystem read/write lock warnings
> -
>
> Key: HDFS-15808
> URL: https://issues.apache.org/jira/browse/HDFS-15808
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: hdfs
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: hdfs, lock, metrics, pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> To monitor how often read/write locks exceed thresholds, we can add two 
> metrics(ReadLockWarning/WriteLockWarning), which are exposed in JMX.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org