[jira] [Updated] (HDDS-2255) Improve Acl Handler Messages

2019-10-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-2255:
-
Labels: newbie pull-request-available  (was: newbie)

> Improve Acl Handler Messages
> 
>
> Key: HDDS-2255
> URL: https://issues.apache.org/jira/browse/HDDS-2255
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: om
>Reporter: Hanisha Koneru
>Assignee: YiSheng Lien
>Priority: Minor
>  Labels: newbie, pull-request-available
>
> In Add/Remove/Set Acl Key/Bucket/Volume Handlers, we print a message about 
> whether the operation was successful or not. If we are trying to add an ACL 
> which is already existing, we convey the message that the operation failed. 
> It would be better if the message conveyed more clearly why the operation 
> failed i.e. the ACL already exists. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work logged] (HDDS-2255) Improve Acl Handler Messages

2019-10-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2255?focusedWorklogId=334892=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-334892
 ]

ASF GitHub Bot logged work on HDDS-2255:


Author: ASF GitHub Bot
Created on: 28/Oct/19 11:16
Start Date: 28/Oct/19 11:16
Worklog Time Spent: 10m 
  Work Description: cxorm commented on pull request #94: HDDS-2255. Improve 
Acl Handler Messages
URL: https://github.com/apache/hadoop-ozone/pull/94
 
 
   ## What changes were proposed in this pull request?
   Add a ```checkAclExist()``` method in ```ObjectStore.java``` and let the 
method called by ```addAclHandler``` to show the proper messages.
   
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-2255
   
   ## How was this patch tested?
   Build and check the message.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 334892)
Remaining Estimate: 0h
Time Spent: 10m

> Improve Acl Handler Messages
> 
>
> Key: HDDS-2255
> URL: https://issues.apache.org/jira/browse/HDDS-2255
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: om
>Reporter: Hanisha Koneru
>Assignee: YiSheng Lien
>Priority: Minor
>  Labels: newbie, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In Add/Remove/Set Acl Key/Bucket/Volume Handlers, we print a message about 
> whether the operation was successful or not. If we are trying to add an ACL 
> which is already existing, we convey the message that the operation failed. 
> It would be better if the message conveyed more clearly why the operation 
> failed i.e. the ACL already exists. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-2371) Print out the ozone version during the startup instead of hadoop version

2019-10-28 Thread YiSheng Lien (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

YiSheng Lien reassigned HDDS-2371:
--

Assignee: YiSheng Lien

> Print out the ozone version during the startup instead of hadoop version
> 
>
> Key: HDDS-2371
> URL: https://issues.apache.org/jira/browse/HDDS-2371
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Marton Elek
>Assignee: YiSheng Lien
>Priority: Major
>  Labels: newbie
>
> Ozone components printing out the current version during the startup:
>  
> {code:java}
> STARTUP_MSG: Starting StorageContainerManager
> STARTUP_MSG:   host = om/10.8.0.145
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 3.2.0
> STARTUP_MSG:   build = https://github.com/apache/hadoop.git -r 
> e97acb3bd8f3befd27418996fa5d4b50bf2e17bf; compiled by 'sunilg' on 
> 2019-01-{code}
> But as it's visible the build / compiled information is about hadoop not 
> about hadoop-ozone.
> (And personally I prefer to use a github compatible url instead of the SVN 
> style -r. Something like:
> {code:java}
> STARTUP_MSG: build =  
> https://github.com/apache/hadoop-ozone/commit/8541c5694efebb58f53cf4665d3e4e6e4a12845c
>  ; compiled by '' on ...{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14937) [SBN read] ObserverReadProxyProvider should throw InterruptException

2019-10-28 Thread xuzq (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuzq updated HDFS-14937:

Attachment: HDFS-14937-trunk-001.patch
Status: Patch Available  (was: Open)

> [SBN read] ObserverReadProxyProvider should throw InterruptException
> 
>
> Key: HDFS-14937
> URL: https://issues.apache.org/jira/browse/HDFS-14937
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: xuzq
>Assignee: xuzq
>Priority: Major
> Attachments: HDFS-14937-trunk-001.patch
>
>
> ObserverReadProxyProvider should throw InterruptException immediately if one 
> Observer catch InterruptException in invoking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-10-28 Thread Marton Elek (Jira)
Marton Elek created HDDS-2372:
-

 Summary: Datanode pipeline is failing with NoSuchFileException
 Key: HDDS-2372
 URL: https://issues.apache.org/jira/browse/HDDS-2372
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
Reporter: Marton Elek


Found it on a k8s based test cluster using a simple 3 node cluster and 
HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
this error:
{code:java}
datanode-0 datanode java.util.concurrent.ExecutionException: 
java.util.concurrent.ExecutionException: 
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: 
java.nio.file.NoSuchFileException: 
/data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
 {code}
Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14937) [SBN read] ObserverReadProxyProvider should throw InterruptException

2019-10-28 Thread xuzq (Jira)
xuzq created HDFS-14937:
---

 Summary: [SBN read] ObserverReadProxyProvider should throw 
InterruptException
 Key: HDFS-14937
 URL: https://issues.apache.org/jira/browse/HDFS-14937
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: xuzq
Assignee: xuzq


ObserverReadProxyProvider should throw InterruptException immediately if one 
Observer catch InterruptException in invoking.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2371) Print out the ozone version during the startup instead of hadoop version

2019-10-28 Thread Marton Elek (Jira)
Marton Elek created HDDS-2371:
-

 Summary: Print out the ozone version during the startup instead of 
hadoop version
 Key: HDDS-2371
 URL: https://issues.apache.org/jira/browse/HDDS-2371
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
Reporter: Marton Elek


Ozone components printing out the current version during the startup:

 
{code:java}
STARTUP_MSG: Starting StorageContainerManager
STARTUP_MSG:   host = om/10.8.0.145
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 3.2.0
STARTUP_MSG:   build = https://github.com/apache/hadoop.git -r 
e97acb3bd8f3befd27418996fa5d4b50bf2e17bf; compiled by 'sunilg' on 2019-01-{code}
But as it's visible the build / compiled information is about hadoop not about 
hadoop-ozone.

(And personally I prefer to use a github compatible url instead of the SVN 
style -r. Something like:
{code:java}
STARTUP_MSG: build =  
https://github.com/apache/hadoop-ozone/commit/8541c5694efebb58f53cf4665d3e4e6e4a12845c
 ; compiled by '' on ...{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.

2019-10-28 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960879#comment-16960879
 ] 

Hadoop QA commented on HDFS-14768:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
46s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 
42s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
6s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m 30s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
12s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
41s{color} | {color:green} hadoop-hdfs-project/hadoop-hdfs: The patch generated 
0 new + 201 unchanged - 1 fixed = 201 total (was 202) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
13m 27s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
13s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}103m 39s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
33s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}165m 59s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover |
|   | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | HDFS-14768 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12984142/HDFS-14768.010.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 50281e41fa8c 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 7be5508 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28188/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28188/testReport/ |
| Max. process+thread count | 2669 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| 

[jira] [Reopened] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng reopened HDDS-2356:

  Assignee: (was: Bharat Viswanadham)

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960870#comment-16960870
 ] 

Li Cheng commented on HDDS-2356:


I take this Jira to track issues that seem to be related with Multipart upload 
in my testing. Reopen this. 

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960866#comment-16960866
 ] 

Li Cheng commented on HDDS-2356:


MISMATCH_MULTIPART_LIST seems to be a recurring error. Never be able to finish 
this. 

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14936) Add getNumOfChildren() for interface InnerNode

2019-10-28 Thread Lisheng Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lisheng Sun updated HDFS-14936:
---
Attachment: HDFS-14936.001.patch
Status: Patch Available  (was: Open)

> Add getNumOfChildren() for interface InnerNode
> --
>
> Key: HDFS-14936
> URL: https://issues.apache.org/jira/browse/HDFS-14936
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-14936.001.patch
>
>
> current code InnerNode subclass InnerNodeImpl and DFSTopologyNodeImpl both 
> have getNumOfChildren(). 
> so Add getNumOfChildren() for interface InnerNode and remove unnessary 
> getNumOfChildren() in DFSTopologyNodeImpl.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14936) Add getNumOfChildren() for interface InnerNode

2019-10-28 Thread Lisheng Sun (Jira)
Lisheng Sun created HDFS-14936:
--

 Summary: Add getNumOfChildren() for interface InnerNode
 Key: HDFS-14936
 URL: https://issues.apache.org/jira/browse/HDFS-14936
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Lisheng Sun


current code InnerNode subclass InnerNodeImpl and DFSTopologyNodeImpl both have 
getNumOfChildren(). 

so Add getNumOfChildren() for interface InnerNode and remove unnessary 
getNumOfChildren() in DFSTopologyNodeImpl.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.

2019-10-28 Thread guojh (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960850#comment-16960850
 ] 

guojh commented on HDFS-14768:
--

[~surendrasingh] Thanks for you UT, I update the code and fixed the checkstyle 
error. Please review it. Thank you very much!

> EC : Busy DN replica should be consider in live replica check.
> --
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.007.patch, HDFS-14768.008.patch, 
> HDFS-14768.009.patch, HDFS-14768.010.patch, HDFS-14768.jpg, 
> guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, 
> zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   

[jira] [Comment Edited] (HDFS-14907) [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary

2019-10-28 Thread Takanobu Asanuma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960845#comment-16960845
 ] 

Takanobu Asanuma edited comment on HDFS-14907 at 10/28/19 8:07 AM:
---

Thanks for your advice, [~xkrogen].

start-component.sh looks the good place. I sent a PR. I've confirmed that 
dyno-datanodes run successfully with the PR.


was (Author: tasanuma0829):
Thanks for your advice, [~xkrogen].

start-component.sh looks the good place. I sent a PR.

> [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary
> --
>
> Key: HDFS-14907
> URL: https://issues.apache.org/jira/browse/HDFS-14907
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
>
> When executing {{start-dynamometer-cluster.sh}} with Hadoop-3 binary, 
> datanodes fail to run with the following log and 
> {{start-dynamometer-cluster.sh}} fails.
> {noformat}
> LogType:stderr
> LogLastModifiedTime:Wed Oct 09 15:03:09 +0900 2019
> LogLength:1386
> LogContents:
> Exception in thread "main" java.lang.NoClassDefFoundError: org/junit/Assert
> at 
> org.apache.hadoop.test.GenericTestUtils.assertExists(GenericTestUtils.java:299)
> at 
> org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:243)
> at 
> org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:252)
> at 
> org.apache.hadoop.hdfs.MiniDFSCluster.getBaseDirectory(MiniDFSCluster.java:2982)
> at 
> org.apache.hadoop.hdfs.MiniDFSCluster.determineDfsBaseDir(MiniDFSCluster.java:2972)
> at 
> org.apache.hadoop.hdfs.MiniDFSCluster.formatDataNodeDirs(MiniDFSCluster.java:2834)
> at 
> org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.run(SimulatedDataNodes.java:123)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at 
> org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.main(SimulatedDataNodes.java:88)
> Caused by: java.lang.ClassNotFoundException: org.junit.Assert
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 9 more
> ./start-component.sh: line 317: kill: (2261) - No such process
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14907) [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary

2019-10-28 Thread Takanobu Asanuma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takanobu Asanuma updated HDFS-14907:

Status: Patch Available  (was: Open)

> [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary
> --
>
> Key: HDFS-14907
> URL: https://issues.apache.org/jira/browse/HDFS-14907
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
>
> When executing {{start-dynamometer-cluster.sh}} with Hadoop-3 binary, 
> datanodes fail to run with the following log and 
> {{start-dynamometer-cluster.sh}} fails.
> {noformat}
> LogType:stderr
> LogLastModifiedTime:Wed Oct 09 15:03:09 +0900 2019
> LogLength:1386
> LogContents:
> Exception in thread "main" java.lang.NoClassDefFoundError: org/junit/Assert
> at 
> org.apache.hadoop.test.GenericTestUtils.assertExists(GenericTestUtils.java:299)
> at 
> org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:243)
> at 
> org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:252)
> at 
> org.apache.hadoop.hdfs.MiniDFSCluster.getBaseDirectory(MiniDFSCluster.java:2982)
> at 
> org.apache.hadoop.hdfs.MiniDFSCluster.determineDfsBaseDir(MiniDFSCluster.java:2972)
> at 
> org.apache.hadoop.hdfs.MiniDFSCluster.formatDataNodeDirs(MiniDFSCluster.java:2834)
> at 
> org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.run(SimulatedDataNodes.java:123)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at 
> org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.main(SimulatedDataNodes.java:88)
> Caused by: java.lang.ClassNotFoundException: org.junit.Assert
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 9 more
> ./start-component.sh: line 317: kill: (2261) - No such process
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14907) [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary

2019-10-28 Thread Takanobu Asanuma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960845#comment-16960845
 ] 

Takanobu Asanuma commented on HDFS-14907:
-

Thanks for your advice, [~xkrogen].

start-component.sh looks the good place. I sent a PR.

> [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary
> --
>
> Key: HDFS-14907
> URL: https://issues.apache.org/jira/browse/HDFS-14907
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
>
> When executing {{start-dynamometer-cluster.sh}} with Hadoop-3 binary, 
> datanodes fail to run with the following log and 
> {{start-dynamometer-cluster.sh}} fails.
> {noformat}
> LogType:stderr
> LogLastModifiedTime:Wed Oct 09 15:03:09 +0900 2019
> LogLength:1386
> LogContents:
> Exception in thread "main" java.lang.NoClassDefFoundError: org/junit/Assert
> at 
> org.apache.hadoop.test.GenericTestUtils.assertExists(GenericTestUtils.java:299)
> at 
> org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:243)
> at 
> org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:252)
> at 
> org.apache.hadoop.hdfs.MiniDFSCluster.getBaseDirectory(MiniDFSCluster.java:2982)
> at 
> org.apache.hadoop.hdfs.MiniDFSCluster.determineDfsBaseDir(MiniDFSCluster.java:2972)
> at 
> org.apache.hadoop.hdfs.MiniDFSCluster.formatDataNodeDirs(MiniDFSCluster.java:2834)
> at 
> org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.run(SimulatedDataNodes.java:123)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at 
> org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.main(SimulatedDataNodes.java:88)
> Caused by: java.lang.ClassNotFoundException: org.junit.Assert
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 9 more
> ./start-component.sh: line 317: kill: (2261) - No such process
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2370) Remove classpath in RunningWithHDFS.md ozone-hdfs/docker-compose as dir 'ozoneplugin' is not exist anymore

2019-10-28 Thread luhuachao (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960844#comment-16960844
 ] 

luhuachao commented on HDDS-2370:
-

[~adoroszlai]  thanks for reply , I would like to work on this.

 

 

> Remove classpath in RunningWithHDFS.md ozone-hdfs/docker-compose as dir 
> 'ozoneplugin' is not exist anymore
> --
>
> Key: HDDS-2370
> URL: https://issues.apache.org/jira/browse/HDDS-2370
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>  Components: documentation
>Reporter: luhuachao
>Priority: Major
> Attachments: HDDS-2370.1.patch
>
>
> In RunningWithHDFS.md 
> {code:java}
> export 
> HADOOP_CLASSPATH=/opt/ozone/share/hadoop/ozoneplugin/hadoop-ozone-datanode-plugin.jar{code}
> ozone-hdfs/docker-compose.yaml
>  
> {code:java}
>   environment:
>  HADOOP_CLASSPATH: /opt/ozone/share/hadoop/ozoneplugin/*.jar
> {code}
> when i run hddsdatanodeservice as pulgin in hdfs datanode, it comes out with 
> the error below , there is no constructor without parameter.
>  
>  
> {code:java}
> 2019-10-21 21:38:56,391 ERROR datanode.DataNode 
> (DataNode.java:startPlugins(972)) - Unable to load DataNode plugins. 
> Specified list of plugins: org.apache.hadoop.ozone.HddsDatanodeService
> java.lang.RuntimeException: java.lang.NoSuchMethodException: 
> org.apache.hadoop.ozone.HddsDatanodeService.()
> {code}
> what i doubt is that, ozone-0.5 not support running as a plugin in hdfs 
> datanode now ? if so, 
> why donnot  we remove doc RunningWithHDFS.md ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14907) [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary

2019-10-28 Thread Takanobu Asanuma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takanobu Asanuma reassigned HDFS-14907:
---

Assignee: Takanobu Asanuma

> [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary
> --
>
> Key: HDFS-14907
> URL: https://issues.apache.org/jira/browse/HDFS-14907
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Takanobu Asanuma
>Assignee: Takanobu Asanuma
>Priority: Major
>
> When executing {{start-dynamometer-cluster.sh}} with Hadoop-3 binary, 
> datanodes fail to run with the following log and 
> {{start-dynamometer-cluster.sh}} fails.
> {noformat}
> LogType:stderr
> LogLastModifiedTime:Wed Oct 09 15:03:09 +0900 2019
> LogLength:1386
> LogContents:
> Exception in thread "main" java.lang.NoClassDefFoundError: org/junit/Assert
> at 
> org.apache.hadoop.test.GenericTestUtils.assertExists(GenericTestUtils.java:299)
> at 
> org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:243)
> at 
> org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:252)
> at 
> org.apache.hadoop.hdfs.MiniDFSCluster.getBaseDirectory(MiniDFSCluster.java:2982)
> at 
> org.apache.hadoop.hdfs.MiniDFSCluster.determineDfsBaseDir(MiniDFSCluster.java:2972)
> at 
> org.apache.hadoop.hdfs.MiniDFSCluster.formatDataNodeDirs(MiniDFSCluster.java:2834)
> at 
> org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.run(SimulatedDataNodes.java:123)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> at 
> org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.main(SimulatedDataNodes.java:88)
> Caused by: java.lang.ClassNotFoundException: org.junit.Assert
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 9 more
> ./start-component.sh: line 317: kill: (2261) - No such process
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2019-10-28 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960829#comment-16960829
 ] 

Fei Hui edited comment on HDFS-14920 at 10/28/19 7:58 AM:
--

{quote}
other storages contains this internal block should be decommissioning
{quote}
This comment is wrong, have modified it.
The function *countReplicasForStripedBlock* is used for recomputing the LIVE 
replica for the same internal block.
One case use it.
{code}
// Count replicas on decommissioning nodes, as these will not be
// decommissioned unless recovery/completing last block has finished
NumberReplicas numReplicas = countNodes(lastBlock);
int numUsableReplicas = numReplicas.liveReplicas() +
numReplicas.decommissioning() +
numReplicas.liveEnteringMaintenanceReplicas();
{code}
I think if the same internal block is contains liveReplicas, and it is also 
contains decommissioning replicas. 
numReplicas.liveReplicas() + numReplicas.decommissioning() will not make sense.
So I think the same internal block is ether in liveReplicas or in 
decommissioning replicas, but not both.


was (Author: ferhui):
{quote}
other storages contains this internal block should be decommissioning
{quote}
This comment is error, have modified it.
The function *countReplicasForStripedBlock* is used for recomputing the LIVE 
replica for the same internal block.
One case use it.
{code}
// Count replicas on decommissioning nodes, as these will not be
// decommissioned unless recovery/completing last block has finished
NumberReplicas numReplicas = countNodes(lastBlock);
int numUsableReplicas = numReplicas.liveReplicas() +
numReplicas.decommissioning() +
numReplicas.liveEnteringMaintenanceReplicas();
{code}
I think if the same internal block is contains liveReplicas, and it is also 
contains decommissioning replicas. 
numReplicas.liveReplicas() + numReplicas.decommissioning() will not make sense.
So I think the same internal block is ether in liveReplicas or in 
decommissioning replicas, but not both.

> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is staill in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // 

[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2019-10-28 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960829#comment-16960829
 ] 

Fei Hui commented on HDFS-14920:


{quote}
other storages contains this internal block should be decommissioning
{quote}
This comment is error, have modified it.
The function *countReplicasForStripedBlock* is used for recomputing the LIVE 
replica for the same internal block.
One case use it.
{code}
// Count replicas on decommissioning nodes, as these will not be
// decommissioned unless recovery/completing last block has finished
NumberReplicas numReplicas = countNodes(lastBlock);
int numUsableReplicas = numReplicas.liveReplicas() +
numReplicas.decommissioning() +
numReplicas.liveEnteringMaintenanceReplicas();
{code}
I think if the same internal block is contains liveReplicas, and it is also 
contains decommissioning replicas. 
numReplicas.liveReplicas() + numReplicas.decommissioning() will not make sense.
So I think the same internal block is ether in liveReplicas or in 
decommissioning replicas, but not both.

> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is staill in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // replication task for decommissioning node(s).
>   if (additionalReplRequired - numReplicas.decommissioning() -
>   numReplicas.liveEnteringMaintenanceReplicas() > 0) {
> additionalReplRequired = additionalReplRequired -
> numReplicas.decommissioning() -
> numReplicas.liveEnteringMaintenanceReplicas();
>   }
> {code}
> Should reconstruction firstly and then replicate for decommissioning. Because 
> numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's 
> wrong,
> numReplicas.decommissioning() should be 3, it should exclude live replica. 
> If so, additionalReplRequired will be 1, reconstruction will schedule as 
> expected. After that, decommission goes on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2019-10-28 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960836#comment-16960836
 ] 

Fei Hui commented on HDFS-14920:


[~gjhkael] Yes. You got it!

> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is staill in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // replication task for decommissioning node(s).
>   if (additionalReplRequired - numReplicas.decommissioning() -
>   numReplicas.liveEnteringMaintenanceReplicas() > 0) {
> additionalReplRequired = additionalReplRequired -
> numReplicas.decommissioning() -
> numReplicas.liveEnteringMaintenanceReplicas();
>   }
> {code}
> Should reconstruction firstly and then replicate for decommissioning. Because 
> numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's 
> wrong,
> numReplicas.decommissioning() should be 3, it should exclude live replica. 
> If so, additionalReplRequired will be 1, reconstruction will schedule as 
> expected. After that, decommission goes on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2019-10-28 Thread guojh (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960817#comment-16960817
 ] 

guojh commented on HDFS-14920:
--

Thanks  [~ferhui] I am clear now. You just want to correct the decommissioning 
counter, If index[0, 1, 2, 3] is decommissioning, and another block with index 
0 is replica success, Now decommission counter should be 3 not 4. 

> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is staill in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // replication task for decommissioning node(s).
>   if (additionalReplRequired - numReplicas.decommissioning() -
>   numReplicas.liveEnteringMaintenanceReplicas() > 0) {
> additionalReplRequired = additionalReplRequired -
> numReplicas.decommissioning() -
> numReplicas.liveEnteringMaintenanceReplicas();
>   }
> {code}
> Should reconstruction firstly and then replicate for decommissioning. Because 
> numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's 
> wrong,
> numReplicas.decommissioning() should be 3, it should exclude live replica. 
> If so, additionalReplRequired will be 1, reconstruction will schedule as 
> expected. After that, decommission goes on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2019-10-28 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960795#comment-16960795
 ] 

Fei Hui edited comment on HDFS-14920 at 10/28/19 7:23 AM:
--

[~ayushtkn] Thanks for your review
{code}
// Sub decommissioning because the index replica is live.
if (decommissioningBitSet.get(blockIndex)) {
  counters.subtract(StoredReplicaState.DECOMMISSIONING, 1);
} else {
  decommissioningBitSet.set(blockIndex);
}
{code}
We set the *blockIndex* internal block. Because having enter if clause as bellow
{code}
 if (state == StoredReplicaState.LIVE) {
{code}
If the *blockIndex* internal block is in live state, this block in  other 
storages should not be decommissioning while we compute live and 
decommissioning replicas. The *blockIndex* internal block will be live or 
decommissioning, it could not be both live and decommissioning.


was (Author: ferhui):
[~ayushtkn] Thanks for your review
{code}
// Sub decommissioning because the index replica is live.
if (decommissioningBitSet.get(blockIndex)) {
  counters.subtract(StoredReplicaState.DECOMMISSIONING, 1);
} else {
  decommissioningBitSet.set(blockIndex);
}
{code}
We set the *blockIndex* internal block. Because having enter if clause as bellow
{code}
 if (state == StoredReplicaState.LIVE) {
{code}
If the *blockIndex* internal block is in live state, other storages contains 
this internal block should be decommissioning while we compute live and 
decommissioning replicas. The *blockIndex* internal block will be live or 
decommissioning, it could not be both live and decommissioning.

> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is staill in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // replication task for decommissioning node(s).
>   if (additionalReplRequired - numReplicas.decommissioning() -
>   numReplicas.liveEnteringMaintenanceReplicas() > 0) {
> additionalReplRequired = additionalReplRequired -
> numReplicas.decommissioning() -
> numReplicas.liveEnteringMaintenanceReplicas();
>   }
> {code}
> Should 

[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2019-10-28 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960813#comment-16960813
 ] 

Ayush Saxena commented on HDFS-14920:
-

Where are you using this {{decommissioningBitSet.set(blockIndex);}} of the else 
part?
{quote}other storages contains this internal block should be decommissioning
{quote}
There is a *"should be"*, we aren't sure of all the cases, may be theoretically 
it should be, but may be a bug, it might not and may lead to some very abnormal 
scenarios, I don't think putting it in decommissioningBit, without finding it 
be in the decommissioning state is a good idea.

[~gjhkael] had some questions too, can you clarify his doubts?

> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is staill in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // replication task for decommissioning node(s).
>   if (additionalReplRequired - numReplicas.decommissioning() -
>   numReplicas.liveEnteringMaintenanceReplicas() > 0) {
> additionalReplRequired = additionalReplRequired -
> numReplicas.decommissioning() -
> numReplicas.liveEnteringMaintenanceReplicas();
>   }
> {code}
> Should reconstruction firstly and then replicate for decommissioning. Because 
> numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's 
> wrong,
> numReplicas.decommissioning() should be 3, it should exclude live replica. 
> If so, additionalReplRequired will be 1, reconstruction will schedule as 
> expected. After that, decommission goes on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2370) Remove classpath in RunningWithHDFS.md ozone-hdfs/docker-compose as dir 'ozoneplugin' is not exist anymore

2019-10-28 Thread Attila Doroszlai (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960809#comment-16960809
 ] 

Attila Doroszlai commented on HDDS-2370:


Thanks for reporting this problem [~Huachao].  I think datanode plugin behavior 
should be fixed instead of removing it from the documentation.  Would you like 
to work on it, or may I?

> Remove classpath in RunningWithHDFS.md ozone-hdfs/docker-compose as dir 
> 'ozoneplugin' is not exist anymore
> --
>
> Key: HDDS-2370
> URL: https://issues.apache.org/jira/browse/HDDS-2370
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>  Components: documentation
>Reporter: luhuachao
>Priority: Major
> Attachments: HDDS-2370.1.patch
>
>
> In RunningWithHDFS.md 
> {code:java}
> export 
> HADOOP_CLASSPATH=/opt/ozone/share/hadoop/ozoneplugin/hadoop-ozone-datanode-plugin.jar{code}
> ozone-hdfs/docker-compose.yaml
>  
> {code:java}
>   environment:
>  HADOOP_CLASSPATH: /opt/ozone/share/hadoop/ozoneplugin/*.jar
> {code}
> when i run hddsdatanodeservice as pulgin in hdfs datanode, it comes out with 
> the error below , there is no constructor without parameter.
>  
>  
> {code:java}
> 2019-10-21 21:38:56,391 ERROR datanode.DataNode 
> (DataNode.java:startPlugins(972)) - Unable to load DataNode plugins. 
> Specified list of plugins: org.apache.hadoop.ozone.HddsDatanodeService
> java.lang.RuntimeException: java.lang.NoSuchMethodException: 
> org.apache.hadoop.ozone.HddsDatanodeService.()
> {code}
> what i doubt is that, ozone-0.5 not support running as a plugin in hdfs 
> datanode now ? if so, 
> why donnot  we remove doc RunningWithHDFS.md ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2019-10-28 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960803#comment-16960803
 ] 

Fei Hui commented on HDFS-14920:


[~gjhkael] Thanks for review.
We should consider this issue overall and find the root cause and then fix it.
I think the uploaded patch is also simple and I have considered your approach. 
Maybe we should consider the annotation as follow.
{quote}
  // should reconstruct all the internal blocks before scheduling
  // replication task for decommissioning node(s).
{quote}
The patch can resolve this issue. With the patch targets is 1, rather than 4, 
reconstruction will success.


> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is staill in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // replication task for decommissioning node(s).
>   if (additionalReplRequired - numReplicas.decommissioning() -
>   numReplicas.liveEnteringMaintenanceReplicas() > 0) {
> additionalReplRequired = additionalReplRequired -
> numReplicas.decommissioning() -
> numReplicas.liveEnteringMaintenanceReplicas();
>   }
> {code}
> Should reconstruction firstly and then replicate for decommissioning. Because 
> numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's 
> wrong,
> numReplicas.decommissioning() should be 3, it should exclude live replica. 
> If so, additionalReplRequired will be 1, reconstruction will schedule as 
> expected. After that, decommission goes on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2322) DoubleBuffer flush termination and OM shutdown's after that.

2019-10-28 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960802#comment-16960802
 ] 

Li Cheng commented on HDDS-2322:


https://issues.apache.org/jira/browse/HDDS-2356 is still having issues. Do you 
mean to track here? [~bharat]

> DoubleBuffer flush termination and OM shutdown's after that.
> 
>
> Key: HDDS-2322
> URL: https://issues.apache.org/jira/browse/HDDS-2322
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> om1_1       | 2019-10-18 00:34:45,317 [OMDoubleBufferFlushThread] ERROR      
> - Terminating with exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> om1_1       | java.util.ConcurrentModificationException
> om1_1       | at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1660)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> om1_1       | at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> om1_1       | at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
> om1_1       | at 
> org.apache.hadoop.ozone.om.helpers.OmKeyLocationInfoGroup.getProtobuf(OmKeyLocationInfoGroup.java:65)
> om1_1       | at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> om1_1       | at 
> java.base/java.util.Collections$2.tryAdvance(Collections.java:4745)
> om1_1       | at 
> java.base/java.util.Collections$2.forEachRemaining(Collections.java:4753)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> om1_1       | at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> om1_1       | at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
> om1_1       | at 
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:362)
> om1_1       | at 
> org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:37)
> om1_1       | at 
> org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:31)
> om1_1       | at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
> om1_1       | at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
> om1_1       | at 
> org.apache.hadoop.ozone.om.response.key.OMKeyCreateResponse.addToDBBatch(OMKeyCreateResponse.java:58)
> om1_1       | at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:139)
> om1_1       | at 
> java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
> om1_1       | at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:137)
> om1_1       | at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960799#comment-16960799
 ] 

Li Cheng commented on HDDS-2356:


[~bharat] In term of reproduction, I have a dataset which includes small files 
as well as big files and I'm using s3 gateway from ozone and mount ozone 
cluster to a local path by goofys. All the data are recursively written to the 
mount path, which essentially leads to ozone cluster. The ozone cluster is 
deployed on a 3-node VMs env and each VM has only 1 disk for ozone data 
writing. I think it's a pretty simple scenario to reproduce. The solely 
operation is writing to ozone cluster thru fuse. 

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13736) BlockPlacementPolicyDefault can not choose favored nodes when 'dfs.namenode.block-placement-policy.default.prefer-local-node' set to false

2019-10-28 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960798#comment-16960798
 ] 

Ayush Saxena commented on HDFS-13736:
-

Thanx [~xiaodong.hu] for the patch.  Seems fair enough.
There is a checkstyle warning, I think we can't do anything for it, We can live 
with it.
[~hexiaoqiao] if you get a chance, Can you also give a check once, Plan to push 
this may be by tomorrow, if no comments!!!

> BlockPlacementPolicyDefault can not choose favored nodes when 
> 'dfs.namenode.block-placement-policy.default.prefer-local-node' set to false
> --
>
> Key: HDFS-13736
> URL: https://issues.apache.org/jira/browse/HDFS-13736
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0
>Reporter: hu xiaodong
>Assignee: hu xiaodong
>Priority: Major
> Attachments: HDFS-13736.001.patch, HDFS-13736.002.patch, 
> HDFS-13736.003.patch, HDFS-13736.004.patch, HDFS-13736.005.patch, 
> HDFS-13736.006.patch
>
>
> BlockPlacementPolicyDefault can not choose favored nodes when 
> 'dfs.namenode.block-placement-policy.default.prefer-local-node' set to false. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960796#comment-16960796
 ] 

Li Cheng commented on HDDS-2356:


Also it's print the same pipeline id in s3g logs like crazy. Wonder if that's 
expected. [~bharat]

2019-10-28 11:43:08,912 [qtp1383524016-24] INFO - Allocating block with 
ExcludeList \{datanodes = [], containerIds = [], pipelineIds = []}
...skipping...
eID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>

[jira] [Comment Edited] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2019-10-28 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960795#comment-16960795
 ] 

Fei Hui edited comment on HDFS-14920 at 10/28/19 6:39 AM:
--

[~ayushtkn] Thanks for your review
{code}
// Sub decommissioning because the index replica is live.
if (decommissioningBitSet.get(blockIndex)) {
  counters.subtract(StoredReplicaState.DECOMMISSIONING, 1);
} else {
  decommissioningBitSet.set(blockIndex);
}
{code}
We set the *blockIndex* internal block. Because having enter if clause as bellow
{code}
 if (state == StoredReplicaState.LIVE) {
{code}
If the *blockIndex* internal block is in live state, other storages contains 
this internal block should be decommissioning while we compute live and 
decommissioning replicas. The *blockIndex* internal block will be live or 
decommissioning, it could not be both live and decommissioning.


was (Author: ferhui):
[~ayushtkn] Thanks for your review
{code}
// Sub decommissioning because the index replica is live.
if (decommissioningBitSet.get(blockIndex)) {
  counters.subtract(StoredReplicaState.DECOMMISSIONING, 1);
} else {
  decommissioningBitSet.set(blockIndex);
}
{code}
We set the *blockIndex* internal block. Because having enter if clause as bellow
{code}
 if (state == StoredReplicaState.LIVE) {
{code}
It the *blockIndex* internal block is in live state, other storages contains 
this internal block should be decommissioning while we compute live and 
decommissioning replicas. The *blockIndex* internal block will be live or 
decommissioning, it could not be both live and decommissioning.

> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is staill in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // replication task for decommissioning node(s).
>   if (additionalReplRequired - numReplicas.decommissioning() -
>   numReplicas.liveEnteringMaintenanceReplicas() > 0) {
> additionalReplRequired = additionalReplRequired -
> numReplicas.decommissioning() -
> numReplicas.liveEnteringMaintenanceReplicas();
>   }
> {code}
> Should 

[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2019-10-28 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960795#comment-16960795
 ] 

Fei Hui commented on HDFS-14920:


[~ayushtkn] Thanks for your review
{code}
// Sub decommissioning because the index replica is live.
if (decommissioningBitSet.get(blockIndex)) {
  counters.subtract(StoredReplicaState.DECOMMISSIONING, 1);
} else {
  decommissioningBitSet.set(blockIndex);
}
{code}
We set the *blockIndex* internal block. Because having enter if clause as bellow
{code}
 if (state == StoredReplicaState.LIVE) {
{code}
It the *blockIndex* internal block is in live state, other storages contains 
this internal block should be decommissioning while we compute live and 
decommissioning replicas. The *blockIndex* internal block will be live or 
decommissioning, it could not be both live and decommissioning.

> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is staill in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // replication task for decommissioning node(s).
>   if (additionalReplRequired - numReplicas.decommissioning() -
>   numReplicas.liveEnteringMaintenanceReplicas() > 0) {
> additionalReplRequired = additionalReplRequired -
> numReplicas.decommissioning() -
> numReplicas.liveEnteringMaintenanceReplicas();
>   }
> {code}
> Should reconstruction firstly and then replicate for decommissioning. Because 
> numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's 
> wrong,
> numReplicas.decommissioning() should be 3, it should exclude live replica. 
> If so, additionalReplRequired will be 1, reconstruction will schedule as 
> expected. After that, decommission goes on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14284) RBF: Log Router identifier when reporting exceptions

2019-10-28 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960790#comment-16960790
 ] 

Ayush Saxena commented on HDFS-14284:
-

Thanx [~hemanthboyina] for the patch. Couple of comments :

{code:java}
-throw new NoNamenodesAvailableException(nsId, ioe);
+throw new NoNamenodesAvailableException(
+nsId + " from router " + router.getRouterId(), ioe);
{code}

No need to append the text here in the nsId variable, Doesn't make sense to 
have message for a variable which intends to store NsId, Add a param for Router 
ID, and do the message appending part and all inside the Exception method, To 
make the actual code flow look clean.


{code:java}
-  throw new IOException("No namenodes to invoke " + method.getName() +
-  " with params " + Arrays.deepToString(params) + " from "
-  + router.getRouterId());
+  throw new RouterIOException("No namenodes to invoke " + method.getName()
+  + " with params " + Arrays.deepToString(params),
+  router.getRouterId());
{code}

If I see earlier the text was from ROUTERID not from router ROUTERID

{code:java}
.append(" from router ")
{code}
So, better we keep the text same, don't add router here, Somebody parsing the 
string would fail, if we tweak the text.

For the test :

* Derrive the RouterIOException from the RemoteException, And check the 
{{getMessage}} and {{getRouterID}} are giving correct stuff.
* No need to remove the NoNamenodeException Test, We are changing that too, 
Better keep that too, If the both exceptions are using some code flow, put them 
in same test and name the test a little genreic, otherwise sepaerate them, But 
try to reuse the code if possible, by refactoring into a method, if you can't 
keep in one test.
* 



> RBF: Log Router identifier when reporting exceptions
> 
>
> Key: HDFS-14284
> URL: https://issues.apache.org/jira/browse/HDFS-14284
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-14284.001.patch, HDFS-14284.002.patch, 
> HDFS-14284.003.patch, HDFS-14284.004.patch, HDFS-14284.005.patch, 
> HDFS-14284.006.patch, HDFS-14284.007.patch
>
>
> The typical setup is to use multiple Routers through 
> ConfiguredFailoverProxyProvider.
> In a regular HA Namenode setup, it is easy to know which NN was used.
> However, in RBF, any Router can be the one reporting the exception and it is 
> hard to know which was the one.
> We should have a way to identify which Router/Namenode was the one triggering 
> the exception.
> This would also apply with Observer Namenodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14284) RBF: Log Router identifier when reporting exceptions

2019-10-28 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960790#comment-16960790
 ] 

Ayush Saxena edited comment on HDFS-14284 at 10/28/19 6:30 AM:
---

Thanx [~hemanthboyina] for the patch. Couple of comments :

{code:java}
-throw new NoNamenodesAvailableException(nsId, ioe);
+throw new NoNamenodesAvailableException(
+nsId + " from router " + router.getRouterId(), ioe);
{code}

No need to append the text here in the nsId variable, Doesn't make sense to 
have message for a variable which intends to store NsId, Add a param for Router 
ID, and do the message appending part and all inside the Exception method, To 
make the actual code flow look clean.


{code:java}
-  throw new IOException("No namenodes to invoke " + method.getName() +
-  " with params " + Arrays.deepToString(params) + " from "
-  + router.getRouterId());
+  throw new RouterIOException("No namenodes to invoke " + method.getName()
+  + " with params " + Arrays.deepToString(params),
+  router.getRouterId());
{code}

If I see earlier the text was from ROUTERID not from router ROUTERID

{code:java}
.append(" from router ")
{code}
So, better we keep the text same, don't add router here, Somebody parsing the 
string would fail, if we tweak the text.

For the test :

* Derrive the RouterIOException from the RemoteException, And check the 
{{getMessage}} and {{getRouterID}} are giving correct stuff.
* No need to remove the NoNamenodeException Test, We are changing that too, 
Better keep that too, If the both exceptions are using some code flow, put them 
in same test and name the test a little genreic, otherwise sepaerate them, But 
try to reuse the code if possible, by refactoring into a method, if you can't 
keep in one test.




was (Author: ayushtkn):
Thanx [~hemanthboyina] for the patch. Couple of comments :

{code:java}
-throw new NoNamenodesAvailableException(nsId, ioe);
+throw new NoNamenodesAvailableException(
+nsId + " from router " + router.getRouterId(), ioe);
{code}

No need to append the text here in the nsId variable, Doesn't make sense to 
have message for a variable which intends to store NsId, Add a param for Router 
ID, and do the message appending part and all inside the Exception method, To 
make the actual code flow look clean.


{code:java}
-  throw new IOException("No namenodes to invoke " + method.getName() +
-  " with params " + Arrays.deepToString(params) + " from "
-  + router.getRouterId());
+  throw new RouterIOException("No namenodes to invoke " + method.getName()
+  + " with params " + Arrays.deepToString(params),
+  router.getRouterId());
{code}

If I see earlier the text was from ROUTERID not from router ROUTERID

{code:java}
.append(" from router ")
{code}
So, better we keep the text same, don't add router here, Somebody parsing the 
string would fail, if we tweak the text.

For the test :

* Derrive the RouterIOException from the RemoteException, And check the 
{{getMessage}} and {{getRouterID}} are giving correct stuff.
* No need to remove the NoNamenodeException Test, We are changing that too, 
Better keep that too, If the both exceptions are using some code flow, put them 
in same test and name the test a little genreic, otherwise sepaerate them, But 
try to reuse the code if possible, by refactoring into a method, if you can't 
keep in one test.
* 



> RBF: Log Router identifier when reporting exceptions
> 
>
> Key: HDFS-14284
> URL: https://issues.apache.org/jira/browse/HDFS-14284
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Íñigo Goiri
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-14284.001.patch, HDFS-14284.002.patch, 
> HDFS-14284.003.patch, HDFS-14284.004.patch, HDFS-14284.005.patch, 
> HDFS-14284.006.patch, HDFS-14284.007.patch
>
>
> The typical setup is to use multiple Routers through 
> ConfiguredFailoverProxyProvider.
> In a regular HA Namenode setup, it is easy to know which NN was used.
> However, in RBF, any Router can be the one reporting the exception and it is 
> hard to know which was the one.
> We should have a way to identify which Router/Namenode was the one triggering 
> the exception.
> This would also apply with Observer Namenodes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.

2019-10-28 Thread guojh (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guojh updated HDFS-14768:
-
Attachment: HDFS-14768.010.patch

> EC : Busy DN replica should be consider in live replica check.
> --
>
> Key: HDFS-14768
> URL: https://issues.apache.org/jira/browse/HDFS-14768
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding, hdfs, namenode
>Affects Versions: 3.0.2
>Reporter: guojh
>Assignee: guojh
>Priority: Major
>  Labels: patch
> Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, 
> HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, 
> HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, 
> HDFS-14768.006.patch, HDFS-14768.007.patch, HDFS-14768.008.patch, 
> HDFS-14768.009.patch, HDFS-14768.010.patch, HDFS-14768.jpg, 
> guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, 
> zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt
>
>
> Policy is RS-6-3-1024K, version is hadoop 3.0.2;
> We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission 
> index[3,4], increase the index 6 datanode's
> pendingReplicationWithoutTargets  that make it large than 
> replicationStreamsHardLimit(we set 14). Then, After the method 
> chooseSourceDatanodes of BlockMananger, the liveBlockIndices is 
> [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. 
> In method scheduleReconstruction of BlockManager, the additionalReplRequired 
> is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a 
> erasureCode task to target datanode.
> When datanode get the task will build  targetIndices from liveBlockIndices 
> and target length. the code is blow.
> {code:java}
> // code placeholder
> targetIndices = new short[targets.length];
> private void initTargetIndices() { 
>   BitSet bitset = reconstructor.getLiveBitSet();
>   int m = 0; hasValidTargets = false; 
>   for (int i = 0; i < dataBlkNum + parityBlkNum; i++) {  
> if (!bitset.get) {    
>   if (reconstructor.getBlockLen > 0) {
>        if (m < targets.length) {
>          targetIndices[m++] = (short)i;
>          hasValidTargets = true;
>         }
>       }
>     }
>  }
> {code}
> targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value.
> The StripedReader is  aways create reader from first 6 index block, and is 
> [0,1,2,3,4,5]
> Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal 
> bug. the block index6's data is corruption(all data is zero).
> I write a unit test can stabilize repreduce.
> {code:java}
> // code placeholder
> private int replicationStreamsHardLimit = 
> DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT;
> numDNs = dataBlocks + parityBlocks + 10;
> @Test(timeout = 24)
> public void testFileDecommission() throws Exception {
>   LOG.info("Starting test testFileDecommission");
>   final Path ecFile = new Path(ecDir, "testFileDecommission");
>   int writeBytes = cellSize * dataBlocks;
>   writeStripedFile(dfs, ecFile, writeBytes);
>   Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks());
>   FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes);
>   final INodeFile fileNode = cluster.getNamesystem().getFSDirectory()
>   .getINode4Write(ecFile.toString()).asFile();
>   LocatedBlocks locatedBlocks =
>   StripedFileTestUtil.getLocatedBlocks(ecFile, dfs);
>   LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0)
>   .get(0);
>   DatanodeInfo[] dnLocs = lb.getLocations();
>   LocatedStripedBlock lastBlock =
>   (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock();
>   DatanodeInfo[] storageInfos = lastBlock.getLocations();
>   //
>   DatanodeDescriptor datanodeDescriptor = 
> cluster.getNameNode().getNamesystem()
>   
> .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid());
>   BlockInfo firstBlock = fileNode.getBlocks()[0];
>   DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock);
>   // the first heartbeat will consume 3 replica tasks
>   for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) {
> BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new 
> Block(i),
> new DatanodeStorageInfo[]{dStorageInfos[0]});
>   }
>   assertEquals(dataBlocks + parityBlocks, dnLocs.length);
>   int[] decommNodeIndex = {3, 4};
>   final List decommisionNodes = new ArrayList();
>   // add the node which will be decommissioning
>   decommisionNodes.add(dnLocs[decommNodeIndex[0]]);
>   decommisionNodes.add(dnLocs[decommNodeIndex[1]]);
>   decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED);
>   assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes());
> 

[jira] [Commented] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.

2019-10-28 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960786#comment-16960786
 ] 

Hadoop QA commented on HDFS-14768:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m  
4s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
2s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 39s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
35s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
20s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
15s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
15s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 47s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 2 new + 201 unchanged - 1 fixed = 203 total (was 202) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
15m 34s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
18s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}115m 53s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
35s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}185m 24s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestCrcCorruption |
|   | hadoop.hdfs.server.namenode.TestNameNodeMXBean |
|   | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy |
|   | hadoop.hdfs.TestErasureCodingPoliciesWithRandomECPolicy |
|   | hadoop.hdfs.qjournal.server.TestJournalNodeSync |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 |
| JIRA Issue | HDFS-14768 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12984136/HDFS-14768.009.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  |
| uname | Linux f3e1018a1ea1 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 
05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 7be5508 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_222 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28187/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| unit | 

[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2019-10-28 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960783#comment-16960783
 ] 

Ayush Saxena commented on HDFS-14920:
-

Thanx [~ferhui] for the patch.

Couple of comments :
{code:java}
+BitSet liveBitSet = isStriped ?
+new BitSet(((BlockInfoStriped) block).getTotalBlockNum()) : null;
+BitSet decommissioningBitSet = isStriped ?
 new BitSet(((BlockInfoStriped) block).getTotalBlockNum()) : null;{code}

No need to do the same check isStriped and do the same stuff again for both 
{{liveBitset}} and {{decommissioningBitSet}}, either assign {{liveBitset}} 
directly to {{decommissioningBitSet}}, or pull a common if check above, I 
prefer the first one though, if you think readability is getting compromised, 
may be put a one line comment if required.

I have a doubt here :


{code:java}
if (state == StoredReplicaState.LIVE) {
  if (!liveBitSet.get(blockIndex)) {
liveBitSet.set(blockIndex);
// Sub decommissioning because the index replica is live.
if (decommissioningBitSet.get(blockIndex)) {
  counters.subtract(StoredReplicaState.DECOMMISSIONING, 1);
} else {
  decommissioningBitSet.set(blockIndex);
}
  } else {
counters.subtract(StoredReplicaState.LIVE, 1);
counters.add(StoredReplicaState.REDUNDANT, 1);
  }
}
{code}

If state for first is live, you are setting in the liveBitset because it is 
live, fair enough, then again if check is there if that is there in the 
decommissioning bitset too, you reduce the counter, thats Ok, but if it isn't 
there in the decommissioning bitset in the else part, you are adding it in the 
decommissioning Bitset why? How do you say it is decommissioning?
Here :
{code:java}
// Sub decommissioning because the index replica is live.
if (decommissioningBitSet.get(blockIndex)) {
  counters.subtract(StoredReplicaState.DECOMMISSIONING, 1);
} else {
  decommissioningBitSet.set(blockIndex);
}
{code}
Can you explain.

In the tests, you can use, Lambda's Something like this :

{code:java}
GenericTestUtils.waitFor(new Supplier() {
  @Override
  public Boolean get() {
return bm.countNodes(blockInfo).decommissioning() == numDecommission;
  }
}, 100, 1);
{code}

To :

{code:java}
GenericTestUtils.waitFor(
() -> bm.countNodes(blockInfo).decommissioning() == numDecommission,
100, 1);

{code}

Similarly for others.



> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is staill in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because 

[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission

2019-10-28 Thread guojh (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960772#comment-16960772
 ] 

guojh commented on HDFS-14920:
--

[~ferhui] If the block group not has all internal blocks,  the 
additionalReplRequired should compute by liveBlockIndicies, just according the 
block's real total block and live block indicies. Is it more sample? And you 
patch seems not solved this issue? Have I missing anything?

> Erasure Coding: Decommission may hang If one or more datanodes are out of 
> service during decommission  
> ---
>
> Key: HDFS-14920
> URL: https://issues.apache.org/jira/browse/HDFS-14920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Fei Hui
>Assignee: Fei Hui
>Priority: Major
> Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch
>
>
> Decommission test hangs in our clusters.
> Have seen the messages as follow
> {quote}
> 2019-10-22 15:58:51,514 TRACE 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block 
> blk_-9223372035600425840_372987973 numExpected=9, numLive=5
> 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: 
> blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, 
> corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, 
> maintenance replicas: 0, live entering maintenance replicas: 0, excess 
> replicas: 0, Is Open File: false, Datanodes having this block: 
> 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 
> 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 
> 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> 2019-10-22 15:58:51,514 DEBUG 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node 
> 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate 
> to finish Decommission In Progress
> {quote}
> After digging the source code and cluster log,  guess it happens as follow 
> steps.
> # Storage strategy is RS-6-3-1024k.
> # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from 
> datanode dn0, b1 is from datanode dn1, ...etc
> # At the beginning dn0 is in decommission progress, b0 is replicated 
> successfully, and dn0 is staill in decommission progress.
> # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of 
> service, so need to reconstruct, and create ErasureCodingWork to do it, in 
> the ErasureCodingWork, additionalReplRequired is 4
> # Because hasAllInternalBlocks is false, Will call 
> ErasureCodingWork#addTaskToDatanode -> 
> DatanodeDescriptor#addBlockToBeErasureCoded, and send 
> BlockECReconstructionInfo task to Datanode
> # DataNode can not reconstruction the block because targets is 4, greater 
> than 3( parity number).
> There is a problem as follow, from BlockManager.java#scheduleReconstruction
> {code}
>   // should reconstruct all the internal blocks before scheduling
>   // replication task for decommissioning node(s).
>   if (additionalReplRequired - numReplicas.decommissioning() -
>   numReplicas.liveEnteringMaintenanceReplicas() > 0) {
> additionalReplRequired = additionalReplRequired -
> numReplicas.decommissioning() -
> numReplicas.liveEnteringMaintenanceReplicas();
>   }
> {code}
> Should reconstruction firstly and then replicate for decommissioning. Because 
> numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's 
> wrong,
> numReplicas.decommissioning() should be 3, it should exclude live replica. 
> If so, additionalReplRequired will be 1, reconstruction will schedule as 
> expected. After that, decommission goes on.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



<    1   2