[jira] [Commented] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk

2020-02-20 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041633#comment-17041633
 ] 

Ayush Saxena commented on HDFS-15182:
-

Jenkins seems having complains.
Have triggered again.

> TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
> 
>
> Key: HDFS-15182
> URL: https://issues.apache.org/jira/browse/HDFS-15182
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-15182-001.patch, HDFS-15182-002.patch, 
> HDFS-15182-003.patch
>
>
> when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it 
> will fail and throw NullPointerException.
> Since NameNode#metrics is static variable,run all uts in TestBlockManager and 
> other ut has init metrics.
> But  that it runs only testOneOfTwoRacksDecommissioned without initialing 
> metrics throws NullPointerException.
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15174) Optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations

2020-02-20 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041619#comment-17041619
 ] 

Lisheng Sun commented on HDFS-15174:


XiaoMi HDFS uses the replica size in memory by replicaInfo.getBytesOnDisk().  
so this reduce io operations and disk pressure.

> Optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations
> -
>
> Key: HDFS-15174
> URL: https://issues.apache.org/jira/browse/HDFS-15174
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-15174-001.patch
>
>
> Calculating the size of each block and the size of the meta file requires io 
> operation In ReplicaCachingGetSpaceUsed#refresh(). Pressure on disk 
> performance when there are many block. HDFS-14313 is intended to reduce io 
> operation. So get block size by ReplicaInfo and meta size by 
> DataChecksum#getChecksumSize().
> {code:java}
> @Override
>   protected void refresh() {
>   if (CollectionUtils.isNotEmpty(replicaInfos)) {
> for (ReplicaInfo replicaInfo : replicaInfos) {
>   if (Objects.equals(replicaInfo.getVolume().getStorageID(),
>   volume.getStorageID())) {
> dfsUsed += replicaInfo.getBlockDataLength();
> dfsUsed += replicaInfo.getMetadataLength();
> count++;
>   }
> }
>   }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15174) Optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations

2020-02-20 Thread Lisheng Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lisheng Sun updated HDFS-15174:
---
Attachment: HDFS-15174-001.patch
Status: Patch Available  (was: Open)

> Optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations
> -
>
> Key: HDFS-15174
> URL: https://issues.apache.org/jira/browse/HDFS-15174
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-15174-001.patch
>
>
> Calculating the size of each block and the size of the meta file requires io 
> operation In ReplicaCachingGetSpaceUsed#refresh(). Pressure on disk 
> performance when there are many block. HDFS-14313 is intended to reduce io 
> operation. So get block size by ReplicaInfo and meta size by 
> DataChecksum#getChecksumSize().
> {code:java}
> @Override
>   protected void refresh() {
>   if (CollectionUtils.isNotEmpty(replicaInfos)) {
> for (ReplicaInfo replicaInfo : replicaInfos) {
>   if (Objects.equals(replicaInfo.getVolume().getStorageID(),
>   volume.getStorageID())) {
> dfsUsed += replicaInfo.getBlockDataLength();
> dfsUsed += replicaInfo.getMetadataLength();
> count++;
>   }
> }
>   }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15039) Cache meta file length of FinalizedReplica to reduce call File.length()

2020-02-20 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041583#comment-17041583
 ] 

Lisheng Sun edited comment on HDFS-15039 at 2/21/20 6:53 AM:
-

Such the problem that excessive io operations does exist
LGTM.

HDFS-15174 optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io 
operations. e.g.  Get block size by ReplicaInfo and meta size by 
DataChecksum#getChecksumSize().


was (Author: leosun08):
Such the problem that excessive io operations does exist
But there are also io operations  in LocalReplica#getMetaFile of this patch.

HDFS-15174 optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io 
operations. e.g.  Get block size by ReplicaInfo and meta size by 
DataChecksum#getChecksumSize().

> Cache meta file length of FinalizedReplica to reduce call File.length()
> ---
>
> Key: HDFS-15039
> URL: https://issues.apache.org/jira/browse/HDFS-15039
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Yang Yun
>Assignee: Yang Yun
>Priority: Minor
> Attachments: HDFS-15039.patch
>
>
> When use ReplicaCachingGetSpaceUsed to get the volume space used.  It will 
> call File.length() for every meta file of replica. That add more disk IO, we 
> found the slow log as below. For finalized replica, the size of meta file is 
> not changed, i think we can cache the value.
> {code:java}
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed:
>  Refresh dfs used, bpid: BP-898717543-10.75.1.240-1519386995727 replicas 
> size: 1166 dfsUsed: 72227113183 on volume: 
> DS-3add8d62-d69a-4f5a-a29f-b7bbb400af2e duration: 17206ms{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15039) Cache meta file length of FinalizedReplica to reduce call File.length()

2020-02-20 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041583#comment-17041583
 ] 

Lisheng Sun edited comment on HDFS-15039 at 2/21/20 6:49 AM:
-

Such the problem that excessive io operations does exist
But there are also io operations  in LocalReplica#getMetaFile of this patch.

HDFS-15174 optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io 
operations. e.g.  Get block size by ReplicaInfo and meta size by 
DataChecksum#getChecksumSize().


was (Author: leosun08):
LGTM. 
Such the problem that excessive io operations does exist
HDFS-15174 optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io 
operations. e.g.  Get block size by ReplicaInfo and meta size by 
DataChecksum#getChecksumSize().

> Cache meta file length of FinalizedReplica to reduce call File.length()
> ---
>
> Key: HDFS-15039
> URL: https://issues.apache.org/jira/browse/HDFS-15039
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Yang Yun
>Assignee: Yang Yun
>Priority: Minor
> Attachments: HDFS-15039.patch
>
>
> When use ReplicaCachingGetSpaceUsed to get the volume space used.  It will 
> call File.length() for every meta file of replica. That add more disk IO, we 
> found the slow log as below. For finalized replica, the size of meta file is 
> not changed, i think we can cache the value.
> {code:java}
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed:
>  Refresh dfs used, bpid: BP-898717543-10.75.1.240-1519386995727 replicas 
> size: 1166 dfsUsed: 72227113183 on volume: 
> DS-3add8d62-d69a-4f5a-a29f-b7bbb400af2e duration: 17206ms{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover

2020-02-20 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041603#comment-17041603
 ] 

Hadoop QA commented on HDFS-15187:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m  
7s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
46s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
17m 37s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
53s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
41s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m  1s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}106m 38s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
32s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}175m 17s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.namenode.TestDeadDatanode |
|   | hadoop.hdfs.server.balancer.TestBalancer |
|   | hadoop.hdfs.TestRollingUpgrade |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 |
| JIRA Issue | HDFS-15187 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12994046/HDFS-15187-02.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux b2d096ad0770 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 7f35676 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_232 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28819/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28819/testReport/ |
| Max. process+thread count | 3053 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 

[jira] [Comment Edited] (HDFS-15039) Cache meta file length of FinalizedReplica to reduce call File.length()

2020-02-20 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041583#comment-17041583
 ] 

Lisheng Sun edited comment on HDFS-15039 at 2/21/20 6:29 AM:
-

LGTM. 
Such the problem that excessive io operations does exist
HDFS-15174 optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io 
operations. e.g.  Get block size by ReplicaInfo and meta size by 
DataChecksum#getChecksumSize().


was (Author: leosun08):
LGTM.
HDFS-15174 optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io 
operations. e.g.  Get block size by ReplicaInfo and meta size by 
DataChecksum#getChecksumSize().

> Cache meta file length of FinalizedReplica to reduce call File.length()
> ---
>
> Key: HDFS-15039
> URL: https://issues.apache.org/jira/browse/HDFS-15039
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Yang Yun
>Assignee: Yang Yun
>Priority: Minor
> Attachments: HDFS-15039.patch
>
>
> When use ReplicaCachingGetSpaceUsed to get the volume space used.  It will 
> call File.length() for every meta file of replica. That add more disk IO, we 
> found the slow log as below. For finalized replica, the size of meta file is 
> not changed, i think we can cache the value.
> {code:java}
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed:
>  Refresh dfs used, bpid: BP-898717543-10.75.1.240-1519386995727 replicas 
> size: 1166 dfsUsed: 72227113183 on volume: 
> DS-3add8d62-d69a-4f5a-a29f-b7bbb400af2e duration: 17206ms{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15039) Cache meta file length of FinalizedReplica to reduce call File.length()

2020-02-20 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041583#comment-17041583
 ] 

Lisheng Sun commented on HDFS-15039:


LGTM.
HDFS-15174 optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io 
operations. e.g.  Get block size by ReplicaInfo and meta size by 
DataChecksum#getChecksumSize().

> Cache meta file length of FinalizedReplica to reduce call File.length()
> ---
>
> Key: HDFS-15039
> URL: https://issues.apache.org/jira/browse/HDFS-15039
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Yang Yun
>Assignee: Yang Yun
>Priority: Minor
> Attachments: HDFS-15039.patch
>
>
> When use ReplicaCachingGetSpaceUsed to get the volume space used.  It will 
> call File.length() for every meta file of replica. That add more disk IO, we 
> found the slow log as below. For finalized replica, the size of meta file is 
> not changed, i think we can cache the value.
> {code:java}
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed:
>  Refresh dfs used, bpid: BP-898717543-10.75.1.240-1519386995727 replicas 
> size: 1166 dfsUsed: 72227113183 on volume: 
> DS-3add8d62-d69a-4f5a-a29f-b7bbb400af2e duration: 17206ms{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15176) Enable GcTimePercentage Metric in NameNode's JvmMetrics.

2020-02-20 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041582#comment-17041582
 ] 

Hadoop QA commented on HDFS-15176:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
34s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m  
8s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 
58s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
19m 12s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
54s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
24s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 14m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  2m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  2m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 12s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  5m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
55s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  8m 
55s{color} | {color:green} hadoop-common in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 95m 29s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
58s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}211m  3s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 |
| JIRA Issue | HDFS-15176 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12994059/HDFS-15176.005.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux 0c7d3cd735b8 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 
11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 181e6d0 |
| maven | version: 

[jira] [Commented] (HDFS-15171) Add a thread to call saveDfsUsed periodically, to prevent datanode too long restart time.

2020-02-20 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041574#comment-17041574
 ] 

zhuqi commented on HDFS-15171:
--

Hi [~weichiu] 
There are no cache file if the datanode shutdow ungracefully , change the 
dfs.datanode.cached-dfsused.check.interval.ms will not help my case.

The HDFS-14313  should can reduce the refresh time, i will try it.

Thanks.

> Add a thread to call saveDfsUsed periodically, to prevent datanode too long 
> restart time.  
> ---
>
> Key: HDFS-15171
> URL: https://issues.apache.org/jira/browse/HDFS-15171
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>
> There are 30 storage dirs per datanode in our production cluster , it will 
> take too many time to restart, because sometimes the datanode didn't shutdown 
> gracefully. Now only the datanode graceful shut down hook and the 
> blockpoolslice shutdown will cause the saveDfsUsed function, that cause the 
> restart of datanode can't reuse the dfsuse cache sometimes. I think if we can 
> add a thread to periodically call the saveDfsUsed function.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15171) Add a thread to call saveDfsUsed periodically, to prevent datanode too long restart time.

2020-02-20 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041574#comment-17041574
 ] 

zhuqi edited comment on HDFS-15171 at 2/21/20 6:08 AM:
---

Hi [~weichiu] 
 There are no cache file if the datanode shutdown ungracefully , change the 
dfs.datanode.cached-dfsused.check.interval.ms will not help my case.

The HDFS-14313  should can reduce the refresh time, i will try it.

Thanks.


was (Author: zhuqi):
Hi [~weichiu] 
There are no cache file if the datanode shutdow ungracefully , change the 
dfs.datanode.cached-dfsused.check.interval.ms will not help my case.

The HDFS-14313  should can reduce the refresh time, i will try it.

Thanks.

> Add a thread to call saveDfsUsed periodically, to prevent datanode too long 
> restart time.  
> ---
>
> Key: HDFS-15171
> URL: https://issues.apache.org/jira/browse/HDFS-15171
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>
> There are 30 storage dirs per datanode in our production cluster , it will 
> take too many time to restart, because sometimes the datanode didn't shutdown 
> gracefully. Now only the datanode graceful shut down hook and the 
> blockpoolslice shutdown will cause the saveDfsUsed function, that cause the 
> restart of datanode can't reuse the dfsuse cache sometimes. I think if we can 
> add a thread to periodically call the saveDfsUsed function.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15171) Add a thread to call saveDfsUsed periodically, to prevent datanode too long restart time.

2020-02-20 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041564#comment-17041564
 ] 

zhuqi commented on HDFS-15171:
--

Hi [~sodonnell]

Thanks for your patient reply.

First, the every 10 minutes thread in CachingGetSpaceUsed, now with a random 
jitter time to random the refresh operation, and if we can persist the value to 
the cache file when the value refresh, this is the most real time cache.

Second, when the value refresh, we can compare it with last one, if they are 
same, we can jump the persist operation to reduce the disk operation.

In order to reduce the disk operation, we can add a fixed time interval which 
can be configurated, when the real time fresh time exceed the fixed time 
interval , then to persist the value to disk.

Then we can remove the shutdown hook persist operation and don't need to 
caculate what dfs.datanode.cached-dfsused.check.interval.ms is suitable 
anymore. 

And also can reslove my problem, which caused by the datanode shutdown 
ungracefully. 

What do you think about my advice?

> Add a thread to call saveDfsUsed periodically, to prevent datanode too long 
> restart time.  
> ---
>
> Key: HDFS-15171
> URL: https://issues.apache.org/jira/browse/HDFS-15171
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>
> There are 30 storage dirs per datanode in our production cluster , it will 
> take too many time to restart, because sometimes the datanode didn't shutdown 
> gracefully. Now only the datanode graceful shut down hook and the 
> blockpoolslice shutdown will cause the saveDfsUsed function, that cause the 
> restart of datanode can't reuse the dfsuse cache sometimes. I think if we can 
> add a thread to periodically call the saveDfsUsed function.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover

2020-02-20 Thread Vinayakumar B (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041561#comment-17041561
 ] 

Vinayakumar B commented on HDFS-15187:
--

Thanks for the nice catch [~ayushtkn].
Changes looks fine to me.
+1, Pending typo change.
Please confirm about test failures as well.

> CORRUPT replica mismatch between namenodes after failover
> -
>
> Key: HDFS-15187
> URL: https://issues.apache.org/jira/browse/HDFS-15187
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
> Attachments: HDFS-15187-01.patch, HDFS-15187-02.patch
>
>
> The corrupt replica identified by Active Namenode, isn't identified by the 
> Other Namenode, when it is failovered to Active, in case the replica is being 
> marked corrupt due to updatePipeline.
> Scenario to repro :
> 1. Create a file, while writing turn one datanode down, to trigger update 
> pipeline.
> 2. Write some more data.
> 3. Close the file.
> 4. Turn on the shutdown datanode.
> 5. The replica in the datanode will be identifed as CORRUPT and the corrupt 
> count will be 1.
> 6. Failover to other Namenode.
> 7. Wait for all pending IBR processing.
> 8. The corrupt count will not be same, and the FSCK won't show the corrupt 
> replica.
> 9. Failover back to first namenode.
> 10. Corrupt count and corrupt replica will be there.
> Both Namenodes shows different stuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15052) WebHDFS getTrashRoot leads to OOM due to FileSystem object creation

2020-02-20 Thread Masatake Iwasaki (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Masatake Iwasaki updated HDFS-15052:

Fix Version/s: 2.10.1
   3.2.2
   3.1.4
   3.3.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Committed to trunk, branch-3.2, branch-3.1 and branch-2.10. Thanks, [~weichiu], 
[~sodonnell] and [~kihwal].

> WebHDFS getTrashRoot leads to OOM due to FileSystem object creation
> ---
>
> Key: HDFS-15052
> URL: https://issues.apache.org/jira/browse/HDFS-15052
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 2.9.0, 3.0.0-alpha2
>Reporter: Wei-Chiu Chuang
>Assignee: Masatake Iwasaki
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1
>
>
> Quoting [~daryn] in HDFS-10756 :
> {quote}Surprised nobody has discovered this will lead to an inevitable OOM in 
> the NN. The NN should not be creating filesystems to itself, and must never 
> create filesystems in a remote user's context or the cache will explode.
> {quote}
> I guess the problem lies in side NamenodeWebHdfsMethods#getTrashRoot
> {code:java}
> private static String getTrashRoot(String fullPath,
> Configuration conf) throws IOException {
>   FileSystem fs = FileSystem.get(conf != null ? conf : new 
> Configuration());
>   return fs.getTrashRoot(
>   new org.apache.hadoop.fs.Path(fullPath)).toUri().getPath();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover

2020-02-20 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041533#comment-17041533
 ] 

Íñigo Goiri commented on HDFS-15187:


Small typo: isPreviousMessgageProcessed

> CORRUPT replica mismatch between namenodes after failover
> -
>
> Key: HDFS-15187
> URL: https://issues.apache.org/jira/browse/HDFS-15187
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
> Attachments: HDFS-15187-01.patch, HDFS-15187-02.patch
>
>
> The corrupt replica identified by Active Namenode, isn't identified by the 
> Other Namenode, when it is failovered to Active, in case the replica is being 
> marked corrupt due to updatePipeline.
> Scenario to repro :
> 1. Create a file, while writing turn one datanode down, to trigger update 
> pipeline.
> 2. Write some more data.
> 3. Close the file.
> 4. Turn on the shutdown datanode.
> 5. The replica in the datanode will be identifed as CORRUPT and the corrupt 
> count will be 1.
> 6. Failover to other Namenode.
> 7. Wait for all pending IBR processing.
> 8. The corrupt count will not be same, and the FSCK won't show the corrupt 
> replica.
> 9. Failover back to first namenode.
> 10. Corrupt count and corrupt replica will be there.
> Both Namenodes shows different stuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15177) Split datanode invalide block deletion, to avoid the FsDatasetImpl lock too much time.

2020-02-20 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041510#comment-17041510
 ] 

zhuqi commented on HDFS-15177:
--

cc [~sodonnell] 

Thanks your patient reply. I will change to fair.

 

> Split datanode invalide block deletion, to avoid the FsDatasetImpl lock too 
> much time.
> --
>
> Key: HDFS-15177
> URL: https://issues.apache.org/jira/browse/HDFS-15177
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: image-2020-02-18-22-39-00-642.png, 
> image-2020-02-18-22-51-28-624.png, image-2020-02-18-22-52-59-202.png, 
> image-2020-02-18-22-55-38-661.png
>
>
> In our cluster, the datanode receive the delete command with too many blocks 
> deletion when we have many blockpools sharing the same datanode and the 
> datanode with about 30 storage dirs, it will cause the FsDatasetImpl lock too 
> much time.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk

2020-02-20 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041506#comment-17041506
 ] 

Hadoop QA commented on HDFS-15182:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
44s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 28m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
55s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
23m 31s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
36s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
32s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
20m 13s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
42s{color} | {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 17m  0s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
43s{color} | {color:red} The patch generated 22 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}109m 47s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestDFSOutputStream |
|   | hadoop.hdfs.TestDatanodeReport |
|   | hadoop.hdfs.TestFileStatusWithDefaultECPolicy |
|   | hadoop.hdfs.TestAppendSnapshotTruncate |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 |
| JIRA Issue | HDFS-15182 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12994055/HDFS-15182-003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux dc9b81484b1c 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 181e6d0 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_242 |
| findbugs | v3.1.0-RC1 |
| findbugs | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28817/artifact/out/patch-findbugs-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28817/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28817/testReport/ |
| asflicense | 

[jira] [Commented] (HDFS-15052) WebHDFS getTrashRoot leads to OOM due to FileSystem object creation

2020-02-20 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041489#comment-17041489
 ] 

Hudson commented on HDFS-15052:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17971 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17971/])
HDFS-15052. WebHDFS getTrashRoot leads to OOM due to FileSystem object (github: 
rev 2338d25dc7150d75fbda84cc95422380b564)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHDFS.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSUtilClient.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/web/resources/NamenodeWebHdfsMethods.java


> WebHDFS getTrashRoot leads to OOM due to FileSystem object creation
> ---
>
> Key: HDFS-15052
> URL: https://issues.apache.org/jira/browse/HDFS-15052
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 2.9.0, 3.0.0-alpha2
>Reporter: Wei-Chiu Chuang
>Assignee: Masatake Iwasaki
>Priority: Major
>
> Quoting [~daryn] in HDFS-10756 :
> {quote}Surprised nobody has discovered this will lead to an inevitable OOM in 
> the NN. The NN should not be creating filesystems to itself, and must never 
> create filesystems in a remote user's context or the cache will explode.
> {quote}
> I guess the problem lies in side NamenodeWebHdfsMethods#getTrashRoot
> {code:java}
> private static String getTrashRoot(String fullPath,
> Configuration conf) throws IOException {
>   FileSystem fs = FileSystem.get(conf != null ? conf : new 
> Configuration());
>   return fs.getTrashRoot(
>   new org.apache.hadoop.fs.Path(fullPath)).toUri().getPath();
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15185) StartupProgress reports edits segments until the entire startup completes

2020-02-20 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041488#comment-17041488
 ] 

Hadoop QA commented on HDFS-15185:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
21s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
10s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
7s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m  9s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
0s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 19s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 98m 25s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
55s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}163m 45s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.web.TestWebHDFS |
|   | hadoop.hdfs.TestSafeModeWithStripedFile |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:c44943d1fc3 |
| JIRA Issue | HDFS-15185 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12993964/HDFS-15185.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 45add0e916c7 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 181e6d0 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_242 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28816/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28816/testReport/ |
| Max. process+thread count | 3185 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28816/console |
| Powered by | Apache Yetus 0.8.0   

[jira] [Commented] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover

2020-02-20 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041486#comment-17041486
 ] 

Hadoop QA commented on HDFS-15187:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
40s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
19s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
4s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 54s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
8s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 44s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  3m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}147m  7s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
42s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}220m  0s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.namenode.TestDecommissioningStatus |
|   | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped |
|   | hadoop.hdfs.server.namenode.TestNamenodeCapacityReport |
|   | hadoop.hdfs.server.namenode.TestCheckPointForSecurityTokens |
|   | hadoop.hdfs.server.namenode.TestProtectedDirectories |
|   | hadoop.hdfs.server.namenode.TestFSNamesystemMBean |
|   | hadoop.hdfs.server.balancer.TestBalancer |
|   | hadoop.hdfs.server.namenode.TestStartup |
|   | hadoop.hdfs.server.namenode.TestFSImageWithSnapshot |
|   | hadoop.hdfs.server.namenode.TestFileContextAcl |
|   | hadoop.hdfs.server.namenode.TestFSImageWithAcl |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 |
| JIRA Issue | HDFS-15187 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12994046/HDFS-15187-02.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 973f6db62f7d 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 181e6d0 |
| maven | version: 

[jira] [Commented] (HDFS-15176) Enable GcTimePercentage Metric in NameNode's JvmMetrics.

2020-02-20 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041475#comment-17041475
 ] 

Jinglun commented on HDFS-15176:


Agree with [~weichiu], enable GcTimeMonitor by default in patch v05.

> Enable GcTimePercentage Metric in NameNode's JvmMetrics.
> 
>
> Key: HDFS-15176
> URL: https://issues.apache.org/jira/browse/HDFS-15176
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Minor
> Attachments: HDFS-15176.001.patch, HDFS-15176.002.patch, 
> HDFS-15176.003.patch, HDFS-15176.004.patch, HDFS-15176.005.patch
>
>
> The GcTimePercentage(computed by GcTimeMonitor) could be used as a dimension 
> to analyze the NameNode GC.  We should add a switch config to enable the 
> GcTimePercentage metric in HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15176) Enable GcTimePercentage Metric in NameNode's JvmMetrics.

2020-02-20 Thread Jinglun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinglun updated HDFS-15176:
---
Attachment: HDFS-15176.005.patch

> Enable GcTimePercentage Metric in NameNode's JvmMetrics.
> 
>
> Key: HDFS-15176
> URL: https://issues.apache.org/jira/browse/HDFS-15176
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Minor
> Attachments: HDFS-15176.001.patch, HDFS-15176.002.patch, 
> HDFS-15176.003.patch, HDFS-15176.004.patch, HDFS-15176.005.patch
>
>
> The GcTimePercentage(computed by GcTimeMonitor) could be used as a dimension 
> to analyze the NameNode GC.  We should add a switch config to enable the 
> GcTimePercentage metric in HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15172) Remove unnecessary deadNodeDetectInterval in DeadNodeDetector#checkDeadNodes()

2020-02-20 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041436#comment-17041436
 ] 

Lisheng Sun commented on HDFS-15172:


hi [~elgoiri] should we commit it to trunk?  HDFS-15149 depends on it. Thank 
you.

> Remove unnecessary  deadNodeDetectInterval in 
> DeadNodeDetector#checkDeadNodes()
> ---
>
> Key: HDFS-15172
> URL: https://issues.apache.org/jira/browse/HDFS-15172
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Attachments: HDFS-15172-001.patch, HDFS-15172-002.patch
>
>
> Every call to checkDeadNodes() will change the state to IDLE forcing the 
> DeadNodeDetector to sleep for IDLE_SLEEP_MS. So we don't need 
> deadNodeDetectInterval between every checkDeadNodes().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk

2020-02-20 Thread Lisheng Sun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041432#comment-17041432
 ] 

Lisheng Sun commented on HDFS-15182:


that do the init in @Before will affect testAsyncIBR UT, so i remove it out.

> TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
> 
>
> Key: HDFS-15182
> URL: https://issues.apache.org/jira/browse/HDFS-15182
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-15182-001.patch, HDFS-15182-002.patch, 
> HDFS-15182-003.patch
>
>
> when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it 
> will fail and throw NullPointerException.
> Since NameNode#metrics is static variable,run all uts in TestBlockManager and 
> other ut has init metrics.
> But  that it runs only testOneOfTwoRacksDecommissioned without initialing 
> metrics throws NullPointerException.
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk

2020-02-20 Thread Lisheng Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lisheng Sun updated HDFS-15182:
---
Attachment: HDFS-15182-003.patch

> TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
> 
>
> Key: HDFS-15182
> URL: https://issues.apache.org/jira/browse/HDFS-15182
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-15182-001.patch, HDFS-15182-002.patch, 
> HDFS-15182-003.patch
>
>
> when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it 
> will fail and throw NullPointerException.
> Since NameNode#metrics is static variable,run all uts in TestBlockManager and 
> other ut has init metrics.
> But  that it runs only testOneOfTwoRacksDecommissioned without initialing 
> metrics throws NullPointerException.
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15039) Cache meta file length of FinalizedReplica to reduce call File.length()

2020-02-20 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041383#comment-17041383
 ] 

Wei-Chiu Chuang commented on HDFS-15039:


Looks good to me. [~leosun08] fyi

> Cache meta file length of FinalizedReplica to reduce call File.length()
> ---
>
> Key: HDFS-15039
> URL: https://issues.apache.org/jira/browse/HDFS-15039
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: Yang Yun
>Assignee: Yang Yun
>Priority: Minor
> Attachments: HDFS-15039.patch
>
>
> When use ReplicaCachingGetSpaceUsed to get the volume space used.  It will 
> call File.length() for every meta file of replica. That add more disk IO, we 
> found the slow log as below. For finalized replica, the size of meta file is 
> not changed, i think we can cache the value.
> {code:java}
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed:
>  Refresh dfs used, bpid: BP-898717543-10.75.1.240-1519386995727 replicas 
> size: 1166 dfsUsed: 72227113183 on volume: 
> DS-3add8d62-d69a-4f5a-a29f-b7bbb400af2e duration: 17206ms{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15171) Add a thread to call saveDfsUsed periodically, to prevent datanode too long restart time.

2020-02-20 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041367#comment-17041367
 ] 

Wei-Chiu Chuang commented on HDFS-15171:


The first thing coming to my mind is increasing 
dfs.datanode.cached-dfsused.check.interval.ms like Stephen said.
Otherwise, HDFS-14313 may be useful too.

> Add a thread to call saveDfsUsed periodically, to prevent datanode too long 
> restart time.  
> ---
>
> Key: HDFS-15171
> URL: https://issues.apache.org/jira/browse/HDFS-15171
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
>
> There are 30 storage dirs per datanode in our production cluster , it will 
> take too many time to restart, because sometimes the datanode didn't shutdown 
> gracefully. Now only the datanode graceful shut down hook and the 
> blockpoolslice shutdown will cause the saveDfsUsed function, that cause the 
> restart of datanode can't reuse the dfsuse cache sometimes. I think if we can 
> add a thread to periodically call the saveDfsUsed function.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15176) Enable GcTimePercentage Metric in NameNode's JvmMetrics.

2020-02-20 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041364#comment-17041364
 ] 

Wei-Chiu Chuang commented on HDFS-15176:


Thanks i didn't even realize there's GcTimeMonitor in our code. Shouldn't we 
enable it by default? it doesn't look like a heavy weight operation...

> Enable GcTimePercentage Metric in NameNode's JvmMetrics.
> 
>
> Key: HDFS-15176
> URL: https://issues.apache.org/jira/browse/HDFS-15176
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Minor
> Attachments: HDFS-15176.001.patch, HDFS-15176.002.patch, 
> HDFS-15176.003.patch, HDFS-15176.004.patch
>
>
> The GcTimePercentage(computed by GcTimeMonitor) could be used as a dimension 
> to analyze the NameNode GC.  We should add a switch config to enable the 
> GcTimePercentage metric in HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover

2020-02-20 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-15187:

Attachment: HDFS-15187-02.patch

> CORRUPT replica mismatch between namenodes after failover
> -
>
> Key: HDFS-15187
> URL: https://issues.apache.org/jira/browse/HDFS-15187
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
> Attachments: HDFS-15187-01.patch, HDFS-15187-02.patch
>
>
> The corrupt replica identified by Active Namenode, isn't identified by the 
> Other Namenode, when it is failovered to Active, in case the replica is being 
> marked corrupt due to updatePipeline.
> Scenario to repro :
> 1. Create a file, while writing turn one datanode down, to trigger update 
> pipeline.
> 2. Write some more data.
> 3. Close the file.
> 4. Turn on the shutdown datanode.
> 5. The replica in the datanode will be identifed as CORRUPT and the corrupt 
> count will be 1.
> 6. Failover to other Namenode.
> 7. Wait for all pending IBR processing.
> 8. The corrupt count will not be same, and the FSCK won't show the corrupt 
> replica.
> 9. Failover back to first namenode.
> 10. Corrupt count and corrupt replica will be there.
> Both Namenodes shows different stuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover

2020-02-20 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-15187:

Attachment: HDFS-15187-02.patch

> CORRUPT replica mismatch between namenodes after failover
> -
>
> Key: HDFS-15187
> URL: https://issues.apache.org/jira/browse/HDFS-15187
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
> Attachments: HDFS-15187-01.patch
>
>
> The corrupt replica identified by Active Namenode, isn't identified by the 
> Other Namenode, when it is failovered to Active, in case the replica is being 
> marked corrupt due to updatePipeline.
> Scenario to repro :
> 1. Create a file, while writing turn one datanode down, to trigger update 
> pipeline.
> 2. Write some more data.
> 3. Close the file.
> 4. Turn on the shutdown datanode.
> 5. The replica in the datanode will be identifed as CORRUPT and the corrupt 
> count will be 1.
> 6. Failover to other Namenode.
> 7. Wait for all pending IBR processing.
> 8. The corrupt count will not be same, and the FSCK won't show the corrupt 
> replica.
> 9. Failover back to first namenode.
> 10. Corrupt count and corrupt replica will be there.
> Both Namenodes shows different stuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover

2020-02-20 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-15187:

Attachment: (was: HDFS-15187-02.patch)

> CORRUPT replica mismatch between namenodes after failover
> -
>
> Key: HDFS-15187
> URL: https://issues.apache.org/jira/browse/HDFS-15187
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
> Attachments: HDFS-15187-01.patch
>
>
> The corrupt replica identified by Active Namenode, isn't identified by the 
> Other Namenode, when it is failovered to Active, in case the replica is being 
> marked corrupt due to updatePipeline.
> Scenario to repro :
> 1. Create a file, while writing turn one datanode down, to trigger update 
> pipeline.
> 2. Write some more data.
> 3. Close the file.
> 4. Turn on the shutdown datanode.
> 5. The replica in the datanode will be identifed as CORRUPT and the corrupt 
> count will be 1.
> 6. Failover to other Namenode.
> 7. Wait for all pending IBR processing.
> 8. The corrupt count will not be same, and the FSCK won't show the corrupt 
> replica.
> 9. Failover back to first namenode.
> 10. Corrupt count and corrupt replica will be there.
> Both Namenodes shows different stuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-20 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041319#comment-17041319
 ] 

Wei-Chiu Chuang commented on HDFS-15186:


[~ferhui] would you be interested in reviewing this patch? You've fixed a 
number of critical EC decomm bugs in the past. Thanks.

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Critical
> Attachments: HDFS-15186.001.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-20 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041306#comment-17041306
 ] 

Wei-Chiu Chuang commented on HDFS-15186:


[~avijayan] could you take a look at the patch? Or [~ayushtkn] please

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Critical
> Attachments: HDFS-15186.001.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15185) StartupProgress reports edits segments until the entire startup completes

2020-02-20 Thread Chen Liang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041290#comment-17041290
 ] 

Chen Liang commented on HDFS-15185:
---

I have tested this fix on a real cluster, the patch did get rid of the 
excessive ByteString displays. +1 with the Jenkins warnings addressed.

> StartupProgress reports edits segments until the entire startup completes
> -
>
> Key: HDFS-15185
> URL: https://issues.apache.org/jira/browse/HDFS-15185
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: Konstantin Shvachko
>Priority: Major
> Attachments: HDFS-15185.001.patch
>
>
> Startup Progress page keeps reporting edits segments after the {{LOAD_EDITS}} 
> stage is complete. New steps are added to StartupProgress while journal 
> tailing until all startup phases are completed. This adds a lot of edits 
> steps, since {{SAFEMODE}} phase can take a long time on a large cluster.
> With fast tailing the segments are small, but the number of them is large - 
> 160K. This makes the page load forever.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover

2020-02-20 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041247#comment-17041247
 ] 

Hadoop QA commented on HDFS-15187:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
42s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m  4s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 1s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 58s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}114m  3s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
31s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}178m 52s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestReconstructStripedFile |
|   | hadoop.hdfs.TestRollingUpgrade |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 |
| JIRA Issue | HDFS-15187 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12994013/HDFS-15187-01.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 2866567ae8e6 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 181e6d0 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_232 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28814/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28814/testReport/ |
| Max. process+thread count | 2995 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28814/console |
| Powered by | Apache Yetus 0.8.0   

[jira] [Commented] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover

2020-02-20 Thread Jira


[ 
https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041212#comment-17041212
 ] 

Íñigo Goiri commented on HDFS-15187:


Thanks [~ayushtkn] for the patch.
* Can we add a javadoc to processAndHandleReportedBlock explaining what true 
and false mean?
* I would like a more descriptive name for "response."
* The code where we do the "continue" is a little hard to follow, what about:
{code}
  private void processQueuedMessages(Iterable rbis)
  throws IOException {
boolean response = true;
for (ReportedBlockInfo rbi : rbis) {
  LOG.debug("Processing previouly queued message {}", rbi);
  if (rbi.getReportedState() == null) {
// This is a DELETE_BLOCK request
DatanodeStorageInfo storageInfo = rbi.getStorageInfo();
removeStoredBlock(getStoredBlock(rbi.getBlock()),
storageInfo.getDatanodeDescriptor());
  } else if (!response) {
// if the previous IBR processing was skipped, skip processing all
// further IBR's so as to ensure same sequence of processing.
   queueReportedBlock(rbi.getStorageInfo(), rbi.getBlock(),
 rbi.getReportedState(), QUEUE_REASON_FUTURE_GENSTAMP);
  } else {
response = processAndHandleReportedBlock(
rbi.getStorageInfo(), rbi.getBlock(), rbi.getReportedState(), null);
  }
}
  }
{code}



> CORRUPT replica mismatch between namenodes after failover
> -
>
> Key: HDFS-15187
> URL: https://issues.apache.org/jira/browse/HDFS-15187
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
> Attachments: HDFS-15187-01.patch
>
>
> The corrupt replica identified by Active Namenode, isn't identified by the 
> Other Namenode, when it is failovered to Active, in case the replica is being 
> marked corrupt due to updatePipeline.
> Scenario to repro :
> 1. Create a file, while writing turn one datanode down, to trigger update 
> pipeline.
> 2. Write some more data.
> 3. Close the file.
> 4. Turn on the shutdown datanode.
> 5. The replica in the datanode will be identifed as CORRUPT and the corrupt 
> count will be 1.
> 6. Failover to other Namenode.
> 7. Wait for all pending IBR processing.
> 8. The corrupt count will not be same, and the FSCK won't show the corrupt 
> replica.
> 9. Failover back to first namenode.
> 10. Corrupt count and corrupt replica will be there.
> Both Namenodes shows different stuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15177) Split datanode invalide block deletion, to avoid the FsDatasetImpl lock too much time.

2020-02-20 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041204#comment-17041204
 ] 

Stephen O'Donnell commented on HDFS-15177:
--

Initially the lock used the default setting, but HDFS-12137 change it to fair 
(hardcoded). In HDFS-15150 we have made it configurable.

> Split datanode invalide block deletion, to avoid the FsDatasetImpl lock too 
> much time.
> --
>
> Key: HDFS-15177
> URL: https://issues.apache.org/jira/browse/HDFS-15177
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: image-2020-02-18-22-39-00-642.png, 
> image-2020-02-18-22-51-28-624.png, image-2020-02-18-22-52-59-202.png, 
> image-2020-02-18-22-55-38-661.png
>
>
> In our cluster, the datanode receive the delete command with too many blocks 
> deletion when we have many blockpools sharing the same datanode and the 
> datanode with about 30 storage dirs, it will cause the FsDatasetImpl lock too 
> much time.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover

2020-02-20 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-15187:

Status: Patch Available  (was: Open)

> CORRUPT replica mismatch between namenodes after failover
> -
>
> Key: HDFS-15187
> URL: https://issues.apache.org/jira/browse/HDFS-15187
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
> Attachments: HDFS-15187-01.patch
>
>
> The corrupt replica identified by Active Namenode, isn't identified by the 
> Other Namenode, when it is failovered to Active, in case the replica is being 
> marked corrupt due to updatePipeline.
> Scenario to repro :
> 1. Create a file, while writing turn one datanode down, to trigger update 
> pipeline.
> 2. Write some more data.
> 3. Close the file.
> 4. Turn on the shutdown datanode.
> 5. The replica in the datanode will be identifed as CORRUPT and the corrupt 
> count will be 1.
> 6. Failover to other Namenode.
> 7. Wait for all pending IBR processing.
> 8. The corrupt count will not be same, and the FSCK won't show the corrupt 
> replica.
> 9. Failover back to first namenode.
> 10. Corrupt count and corrupt replica will be there.
> Both Namenodes shows different stuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover

2020-02-20 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-15187:

Attachment: HDFS-15187-01.patch

> CORRUPT replica mismatch between namenodes after failover
> -
>
> Key: HDFS-15187
> URL: https://issues.apache.org/jira/browse/HDFS-15187
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Critical
> Attachments: HDFS-15187-01.patch
>
>
> The corrupt replica identified by Active Namenode, isn't identified by the 
> Other Namenode, when it is failovered to Active, in case the replica is being 
> marked corrupt due to updatePipeline.
> Scenario to repro :
> 1. Create a file, while writing turn one datanode down, to trigger update 
> pipeline.
> 2. Write some more data.
> 3. Close the file.
> 4. Turn on the shutdown datanode.
> 5. The replica in the datanode will be identifed as CORRUPT and the corrupt 
> count will be 1.
> 6. Failover to other Namenode.
> 7. Wait for all pending IBR processing.
> 8. The corrupt count will not be same, and the FSCK won't show the corrupt 
> replica.
> 9. Failover back to first namenode.
> 10. Corrupt count and corrupt replica will be there.
> Both Namenodes shows different stuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover

2020-02-20 Thread Ayush Saxena (Jira)
Ayush Saxena created HDFS-15187:
---

 Summary: CORRUPT replica mismatch between namenodes after failover
 Key: HDFS-15187
 URL: https://issues.apache.org/jira/browse/HDFS-15187
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Ayush Saxena
Assignee: Ayush Saxena


The corrupt replica identified by Active Namenode, isn't identified by the 
Other Namenode, when it is failovered to Active, in case the replica is being 
marked corrupt due to updatePipeline.

Scenario to repro :
1. Create a file, while writing turn one datanode down, to trigger update 
pipeline.
2. Write some more data.
3. Close the file.
4. Turn on the shutdown datanode.
5. The replica in the datanode will be identifed as CORRUPT and the corrupt 
count will be 1.
6. Failover to other Namenode.
7. Wait for all pending IBR processing.
8. The corrupt count will not be same, and the FSCK won't show the corrupt 
replica.
9. Failover back to first namenode.
10. Corrupt count and corrupt replica will be there.

Both Namenodes shows different stuff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-20 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040973#comment-17040973
 ] 

Hadoop QA commented on HDFS-15186:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
48s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
 1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
43s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
7s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m 10s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 38s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 6 new + 5 unchanged - 0 fixed = 11 total (was 5) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 21s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}107m 18s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
33s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}172m 53s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestDeadNodeDetection |
|   | hadoop.hdfs.server.blockmanagement.TestBlockStatsMXBean |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 |
| JIRA Issue | HDFS-15186 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12993977/HDFS-15186.001.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 22a273674b3b 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / ec75071 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_232 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28813/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28813/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28813/testReport/ |
| Max. process+thread 

[jira] [Updated] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-20 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-15186:
---
Target Version/s: 3.3.0, 3.1.4, 3.2.2  (was: 3.3.0)

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Critical
> Attachments: HDFS-15186.001.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-20 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-15186:
---
Fix Version/s: (was: 3.3.0)

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Critical
> Attachments: HDFS-15186.001.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15120) Refresh BlockPlacementPolicy at runtime.

2020-02-20 Thread Jinglun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040876#comment-17040876
 ] 

Jinglun commented on HDFS-15120:


The failed test is unrelated. Hi [~ayushtkn], would you help to review v05, 
thanks !

> Refresh BlockPlacementPolicy at runtime.
> 
>
> Key: HDFS-15120
> URL: https://issues.apache.org/jira/browse/HDFS-15120
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-15120.001.patch, HDFS-15120.002.patch, 
> HDFS-15120.003.patch, HDFS-15120.004.patch, HDFS-15120.005.patch
>
>
> Now if we want to switch BlockPlacementPolicies we need to restart the 
> NameNode. It would be convenient if we can switch it at runtime. For example 
> we can switch between AvailableSpaceBlockPlacementPolicy and 
> BlockPlacementPolicyDefault as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-20 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena reassigned HDFS-15186:
---

Assignee: Yao Guangdong

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Critical
> Fix For: 3.3.0
>
> Attachments: HDFS-15186.001.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-20 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-15186:

Status: Patch Available  (was: Open)

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.1.3, 3.2.1, 3.0.3
>Reporter: Yao Guangdong
>Priority: Critical
> Fix For: 3.3.0
>
> Attachments: HDFS-15186.001.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15177) Split datanode invalide block deletion, to avoid the FsDatasetImpl lock too much time.

2020-02-20 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040816#comment-17040816
 ] 

zhuqi commented on HDFS-15177:
--

Hi [~sodonnell]

Thanks for your reply.

I will monitor the FoldedTreeSet problem such as HDFS-15131. 
And you said on the 3.x branch, the locking in the DN has been changed to a 
fair lock for some time now, but i find the AutoCloseableLock uses the 
ReentrantLock and it default uses NonfairSync , and when will the DN uses the 
fair lock?



 

> Split datanode invalide block deletion, to avoid the FsDatasetImpl lock too 
> much time.
> --
>
> Key: HDFS-15177
> URL: https://issues.apache.org/jira/browse/HDFS-15177
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: image-2020-02-18-22-39-00-642.png, 
> image-2020-02-18-22-51-28-624.png, image-2020-02-18-22-52-59-202.png, 
> image-2020-02-18-22-55-38-661.png
>
>
> In our cluster, the datanode receive the delete command with too many blocks 
> deletion when we have many blockpools sharing the same datanode and the 
> datanode with about 30 storage dirs, it will cause the FsDatasetImpl lock too 
> much time.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-20 Thread Yao Guangdong (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040794#comment-17040794
 ] 

Yao Guangdong commented on HDFS-15186:
--

Added the initial patch. Please check it once. [~weichiu]  PTAL 

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Priority: Critical
> Fix For: 3.3.0
>
> Attachments: HDFS-15186.001.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-20 Thread Yao Guangdong (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yao Guangdong updated HDFS-15186:
-
Attachment: HDFS-15186.001.patch

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Priority: Critical
> Fix For: 3.3.0
>
> Attachments: HDFS-15186.001.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-20 Thread HuangTao (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040767#comment-17040767
 ] 

HuangTao commented on HDFS-15186:
-

In our production, we meet same issue, and I have discussed with 
[~yaoguangdong] about this offline.

[~weichiu] PTAL

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Priority: Critical
> Fix For: 3.3.0
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-20 Thread Yao Guangdong (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yao Guangdong updated HDFS-15186:
-
Affects Version/s: (was: 3.0.2)
   3.0.3
   3.2.1
   3.1.3

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Priority: Critical
> Fix For: 3.3.0
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-20 Thread Yao Guangdong (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yao Guangdong updated HDFS-15186:
-
Fix Version/s: (was: 3.2.2)
   (was: 3.1.4)
   (was: 3.0.4)

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.2
>Reporter: Yao Guangdong
>Priority: Critical
> Fix For: 3.3.0
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-20 Thread Yao Guangdong (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yao Guangdong updated HDFS-15186:
-
Affects Version/s: (was: 3.1.3)
   (was: 3.2.1)
   (was: 3.0.3)
   3.0.2

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.2
>Reporter: Yao Guangdong
>Priority: Critical
> Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-20 Thread Yao Guangdong (Jira)
Yao Guangdong created HDFS-15186:


 Summary: Erasure Coding: Decommission may generate the parity 
block's content with all 0 in some case
 Key: HDFS-15186
 URL: https://issues.apache.org/jira/browse/HDFS-15186
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, erasure-coding
Affects Versions: 3.1.3, 3.2.1, 3.0.3
Reporter: Yao Guangdong
 Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2


I can find some parity block's content with all 0 when i decommission some 
DataNode(more than 1) from a cluster. And the probability is very big(parts per 
thousand).This is a big problem.You can think that if we read data from the 
zero parity block or use the zero parity block to recover a block which can 
make us use the error data even we don't know it.

There is some case in the below:

B: Busy DataNode, 

D:Decommissioning DataNode,

Others is normal.

1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].

2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].



In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 
8(D)], the DN may received reconstruct block command and the liveIndices=[0, 1, 
2, 3, 4, 5, 7, 8] and the targets's(the field which  in the class 
StripedReconstructionInfo) length is 2. 

The targets's length is 2 which mean that the DataNode need recover 2 internal 
block in current code.But from the liveIndices we only can find 1 missing 
block, so the method StripedWriter#initTargetIndices will use 0 as the default 
recover block and don't care the indices 0 is in the sources indices or not.

When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] use 
the ec algorithm.We can find that the indices [0] is in the both the sources 
indices and the targets indices in this case. The returned target buffer in the 
indices [6] is always 0 from the ec  algorithm.So I think this is the ec 
algorithm's problem. Because it should more fault tolerance.I try to fixed it 
.But it is too hard. Because the case is too more. The second is another case 
in the example above(use sources indices [1, 2, 3, 4, 5, 7] to recover indices 
[0, 6, 0]). So I changed my mind.Invoke the ec  algorithm with a correct 
parameters. Which mean that remove the duplicate target indices 0 in this 
case.Finally, I fixed it in this way.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org