[jira] [Comment Edited] (HDFS-15500) In-order deletion of snapshots: Diff lists must be update only in the last snapshot

2020-08-03 Thread Jitendra Nath Pandey (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169022#comment-17169022
 ] 

Jitendra Nath Pandey edited comment on HDFS-15500 at 8/3/20, 5:41 PM:
--

Let's add an assertion to ensure only the latest snapshot is being updated.

 


was (Author: jnp):
Let's add an assertion to ensure.

 

> In-order deletion of snapshots: Diff lists must be update only in the last 
> snapshot
> ---
>
> Key: HDFS-15500
> URL: https://issues.apache.org/jira/browse/HDFS-15500
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Mukul Kumar Singh
>Assignee: Tsz-wo Sze
>Priority: Major
>
> With ordered deletions the diff lists of the snapshots should become 
> immutable except the latest one. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15500) In-order deletion of snapshots: Diff lists must be update only in the last snapshot

2020-08-03 Thread Jitendra Nath Pandey (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169022#comment-17169022
 ] 

Jitendra Nath Pandey edited comment on HDFS-15500 at 8/3/20, 5:38 PM:
--

Let's add an assertion to ensure.

 


was (Author: jnp):
Another useful check:
 * With ordered deletions the diff lists of the snapshots should become 
immutable except the latest one.  Can we add an assertion / validation for this?

 

> In-order deletion of snapshots: Diff lists must be update only in the last 
> snapshot
> ---
>
> Key: HDFS-15500
> URL: https://issues.apache.org/jira/browse/HDFS-15500
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Mukul Kumar Singh
>Assignee: Tsz-wo Sze
>Priority: Major
>
> With ordered deletions the diff lists of the snapshots should become 
> immutable except the latest one. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15500) In-order deletion of snapshots: Diff lists must be update only in the last snapshot

2020-08-03 Thread Jitendra Nath Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey reassigned HDFS-15500:
---

Assignee: Tsz-wo Sze

> In-order deletion of snapshots: Diff lists must be update only in the last 
> snapshot
> ---
>
> Key: HDFS-15500
> URL: https://issues.apache.org/jira/browse/HDFS-15500
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Mukul Kumar Singh
>Assignee: Tsz-wo Sze
>Priority: Major
>
> The jira proposes to add new assertions, one of the assertion to start with is
> a) Add an assertion that with ordered snapshot deletion flag true, prior 
> snapshot in cleansubtree is null



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15500) In-order deletion of snapshots: Diff lists must be update only in the last snapshot

2020-08-03 Thread Jitendra Nath Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDFS-15500:

Description: With ordered deletions the diff lists of the snapshots should 
become immutable except the latest one.   (was: The jira proposes to add new 
assertions, one of the assertion to start with is
a) Add an assertion that with ordered snapshot deletion flag true, prior 
snapshot in cleansubtree is null)

> In-order deletion of snapshots: Diff lists must be update only in the last 
> snapshot
> ---
>
> Key: HDFS-15500
> URL: https://issues.apache.org/jira/browse/HDFS-15500
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Mukul Kumar Singh
>Assignee: Tsz-wo Sze
>Priority: Major
>
> With ordered deletions the diff lists of the snapshots should become 
> immutable except the latest one. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15500) In-order deletion of snapshots: Diff lists must be update only in the last snapshot

2020-08-03 Thread Jitendra Nath Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDFS-15500:

Summary: In-order deletion of snapshots: Diff lists must be update only in 
the last snapshot  (was: Add more assertions about ordered deletion of snapshot)

> In-order deletion of snapshots: Diff lists must be update only in the last 
> snapshot
> ---
>
> Key: HDFS-15500
> URL: https://issues.apache.org/jira/browse/HDFS-15500
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Mukul Kumar Singh
>Priority: Major
>
> The jira proposes to add new assertions, one of the assertion to start with is
> a) Add an assertion that with ordered snapshot deletion flag true, prior 
> snapshot in cleansubtree is null



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15500) Add more assertions about ordered deletion of snapshot

2020-07-31 Thread Jitendra Nath Pandey (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17169022#comment-17169022
 ] 

Jitendra Nath Pandey commented on HDFS-15500:
-

Another useful check:
 * With ordered deletions the diff lists of the snapshots should become 
immutable except the latest one.  Can we add an assertion / validation for this?

 

> Add more assertions about ordered deletion of snapshot
> --
>
> Key: HDFS-15500
> URL: https://issues.apache.org/jira/browse/HDFS-15500
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Mukul Kumar Singh
>Priority: Major
>
> The jira proposes to add new assertions, one of the assertion to start with is
> a) Add an assertion that with ordered snapshot deletion flag true, prior 
> snapshot in cleansubtree is null



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15482) Ordered snapshot deletion: hide the deleted snapshots from users

2020-07-30 Thread Jitendra Nath Pandey (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168103#comment-17168103
 ] 

Jitendra Nath Pandey edited comment on HDFS-15482 at 7/30/20, 5:37 PM:
---

We will need to consider a few cases here.
 # Do we allow to create the snapshot of the same name once a snapshot is 
marked for deletion, but not actually deleted? If deleted snapshots are no 
longer visible, user might want to create a snapshot of the same name and get 
surprised if it fails. On the other hand, if we allow it, system has two 
snapshots of the same name.
 # If the snapshot is deleted and hidden, does user get to force immediate 
delete (if it is in order)? It makes sense to allow users to be able to delete 
immediately if user is following the order. But a hidden snapshot will not be 
accessible anymore. This gets more complicated if user creates a snapshot of 
the same name.


was (Author: jnp):
We will need to consider a few cases here.
1) Do we allow to create the snapshot of the same name once a snapshot is 
marked for deletion, but not actually deleted? If deleted snapshots are no 
longer visible, user might want to create a snapshot of the same name and get 
surprised if it fails. On the other hand, if we allow it, system has two 
snapshots of the same name. 
2) If the snapshot is deleted and hidden, does user get to force immediate 
delete (if it is in order)? It makes sense to allow users to be able to delete 
immediately if user is following the order. But a hidden snapshot will not be 
accessible anymore. This gets more complicated if user creates a snapshot of 
the same name.






> Ordered snapshot deletion: hide the deleted snapshots from users
> 
>
> Key: HDFS-15482
> URL: https://issues.apache.org/jira/browse/HDFS-15482
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: snapshots
>Reporter: Tsz-wo Sze
>Assignee: Shashikant Banerjee
>Priority: Major
>
> In HDFS-15480,  the behavior of deleting the non-earliest snapshots is 
> changed to marking them as deleted in XAttr but not actually deleting them.  
> The users are still able to access the these snapshots as usual.
> In this JIRA, the marked-for-deletion snapshots are hided so that they become 
> inaccessible
> to users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15482) Ordered snapshot deletion: hide the deleted snapshots from users

2020-07-30 Thread Jitendra Nath Pandey (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168103#comment-17168103
 ] 

Jitendra Nath Pandey commented on HDFS-15482:
-

We will need to consider a few cases here.
1) Do we allow to create the snapshot of the same name once a snapshot is 
marked for deletion, but not actually deleted? If deleted snapshots are no 
longer visible, user might want to create a snapshot of the same name and get 
surprised if it fails. On the other hand, if we allow it, system has two 
snapshots of the same name. 
2) If the snapshot is deleted and hidden, does user get to force immediate 
delete (if it is in order)? It makes sense to allow users to be able to delete 
immediately if user is following the order. But a hidden snapshot will not be 
accessible anymore. This gets more complicated if user creates a snapshot of 
the same name.






> Ordered snapshot deletion: hide the deleted snapshots from users
> 
>
> Key: HDFS-15482
> URL: https://issues.apache.org/jira/browse/HDFS-15482
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: snapshots
>Reporter: Tsz-wo Sze
>Assignee: Shashikant Banerjee
>Priority: Major
>
> In HDFS-15480,  the behavior of deleting the non-earliest snapshots is 
> changed to marking them as deleted in XAttr but not actually deleting them.  
> The users are still able to access the these snapshots as usual.
> In this JIRA, the marked-for-deletion snapshots are hided so that they become 
> inaccessible
> to users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15470) Added more unit tests to validate rename behaviour across snapshots

2020-07-20 Thread Jitendra Nath Pandey (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161466#comment-17161466
 ] 

Jitendra Nath Pandey commented on HDFS-15470:
-

+1 for the patch.

> Added more unit tests to validate rename behaviour across snapshots
> ---
>
> Key: HDFS-15470
> URL: https://issues.apache.org/jira/browse/HDFS-15470
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
> Fix For: 3.0.4
>
> Attachments: HDFS-15470.000.patch, HDFS-15470.001.patch, 
> HDFS-15470.002.patch
>
>
> HDFS-15313 fixes a critical issue which will avoid deletion of data in active 
> fs with a sequence of snapshot deletes. The idea is to add more tests to 
> verify the behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-2380) Use the Table.isExist API instead of get() call while checking for presence of key.

2019-11-05 Thread Jitendra Nath Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey resolved HDDS-2380.

Resolution: Fixed

> Use the Table.isExist API instead of get() call while checking for presence 
> of key.
> ---
>
> Key: HDDS-2380
> URL: https://issues.apache.org/jira/browse/HDDS-2380
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Aravindan Vijayan
>Assignee: Aravindan Vijayan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, when OM creates a file/directory, it checks the absence of all 
> prefix paths of the key in its RocksDB. Since we don't care about the 
> deserialization of the actual value, we should use the isExist API added in 
> org.apache.hadoop.hdds.utils.db.Table which internally uses the more 
> performant keyMayExist API of RocksDB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2331) Client OOME due to buffer retention

2019-10-29 Thread Jitendra Nath Pandey (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962669#comment-16962669
 ] 

Jitendra Nath Pandey commented on HDDS-2331:


Shall we resolve it, given RATIS-726 is committed, and HDDS-2375 tracks the 
changes to optimize buffer allocation?

> Client OOME due to buffer retention
> ---
>
> Key: HDDS-2331
> URL: https://issues.apache.org/jira/browse/HDDS-2331
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Client
>Affects Versions: 0.5.0
>Reporter: Attila Doroszlai
>Assignee: Shashikant Banerjee
>Priority: Critical
> Attachments: profiler.png
>
>
> Freon random key generator exhausts default heap after just few hundred 1MB 
> keys.  Heap dump on OOME reveals 150+ instances of 
> {{ContainerCommandRequestMessage}}, each with 16MB {{byte[]}}.
> Steps to reproduce:
> # Start Ozone cluster with 1 datanode
> # Start Freon (5K keys of size 1MB)
> Result: OOME after a few hundred keys
> {noformat}
> $ cd hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/compose/ozone
> $ docker-compose up -d
> $ docker-compose exec scm bash
> $ export HADOOP_OPTS='-XX:+HeapDumpOnOutOfMemoryError'
> $ ozone freon rk --numOfThreads 1 --numOfVolumes 1 --numOfBuckets 1 
> --replicationType RATIS --factor ONE --keySize 1048576 --numOfKeys 5120 
> --bufferSize 65536
> ...
> java.lang.OutOfMemoryError: Java heap space
> Dumping heap to java_pid289.hprof ...
> Heap dump file created [1456141975 bytes in 7.760 secs]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-2376) Fail to read data through XceiverClientGrpc

2019-10-29 Thread Jitendra Nath Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey reassigned HDDS-2376:
--

Assignee: Hanisha Koneru

> Fail to read data through XceiverClientGrpc
> ---
>
> Key: HDDS-2376
> URL: https://issues.apache.org/jira/browse/HDDS-2376
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Sammi Chen
>Assignee: Hanisha Koneru
>Priority: Blocker
>
> Run teragen, application failed with following stack, 
> 19/10/29 14:35:42 INFO mapreduce.Job: Running job: job_1567133159094_0048
> 19/10/29 14:35:59 INFO mapreduce.Job: Job job_1567133159094_0048 running in 
> uber mode : false
> 19/10/29 14:35:59 INFO mapreduce.Job:  map 0% reduce 0%
> 19/10/29 14:35:59 INFO mapreduce.Job: Job job_1567133159094_0048 failed with 
> state FAILED due to: Application application_1567133159094_0048 failed 2 
> times due to AM Container for appattempt_1567133159094_0048_02 exited 
> with  exitCode: -1000
> For more detailed output, check application tracking 
> page:http://host183:8088/cluster/app/application_1567133159094_0048Then, 
> click on links to logs of each attempt.
> Diagnostics: Unexpected OzoneException: 
> org.apache.hadoop.ozone.common.OzoneChecksumException: Checksum mismatch at 
> index 0
> java.io.IOException: Unexpected OzoneException: 
> org.apache.hadoop.ozone.common.OzoneChecksumException: Checksum mismatch at 
> index 0
>   at 
> org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunk(ChunkInputStream.java:342)
>   at 
> org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunkFromContainer(ChunkInputStream.java:307)
>   at 
> org.apache.hadoop.hdds.scm.storage.ChunkInputStream.prepareRead(ChunkInputStream.java:259)
>   at 
> org.apache.hadoop.hdds.scm.storage.ChunkInputStream.read(ChunkInputStream.java:144)
>   at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.read(BlockInputStream.java:239)
>   at 
> org.apache.hadoop.ozone.client.io.KeyInputStream.read(KeyInputStream.java:171)
>   at 
> org.apache.hadoop.fs.ozone.OzoneFSInputStream.read(OzoneFSInputStream.java:52)
>   at java.io.DataInputStream.read(DataInputStream.java:100)
>   at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:86)
>   at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:60)
>   at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:120)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366)
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:267)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:62)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.ozone.common.OzoneChecksumException: Checksum 
> mismatch at index 0
>   at 
> org.apache.hadoop.ozone.common.ChecksumData.verifyChecksumDataMatches(ChecksumData.java:148)
>   at 
> org.apache.hadoop.ozone.common.Checksum.verifyChecksum(Checksum.java:275)
>   at 
> org.apache.hadoop.ozone.common.Checksum.verifyChecksum(Checksum.java:238)
>   at 
> org.apache.hadoop.hdds.scm.storage.ChunkInputStream.lambda$new$0(ChunkInputStream.java:375)
>   at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithRetry(XceiverClientGrpc.java:287)
>   at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommandWithTraceIDAndRetry(XceiverClientGrpc.java:250)
>   at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.sendCommand(XceiverClientGrpc.java:233)
>   at 
> org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.readChunk(ContainerProtocolCalls.java:245)
>   at 
> org.apache.hadoop.hdds.scm.storage.ChunkInputStream.readChunk(ChunkInputStream.java:335)
>   ... 26 more
> Caused by: Checksum mismatch at index 0
> org.apache.hadoop.ozone.common.OzoneChecksumException: Checksum mismatch at 
> index 0
>   

[jira] [Updated] (HDDS-2041) Don't depend on DFSUtil to check HTTP policy

2019-10-28 Thread Jitendra Nath Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-2041:
---
Priority: Blocker  (was: Major)

> Don't depend on DFSUtil to check HTTP policy
> 
>
> Key: HDDS-2041
> URL: https://issues.apache.org/jira/browse/HDDS-2041
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>  Components: website
>Affects Versions: 0.4.1
>Reporter: Vivek Ratnavel Subramanian
>Assignee: Xiaoyu Yao
>Priority: Blocker
>
> Currently, BaseHttpServer uses DFSUtil to get Http policy. With this, when 
> http policy is set to HTTPS on hdfs-site.xml, ozone http servers try to come 
> up with HTTPS and fail if SSL certificates are not present in the required 
> location.
> Ozone web UIs should not depend on HDFS config to determine HTTP policy. 
> Instead, it should have its own config to determine the policy. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-2041) Don't depend on DFSUtil to check HTTP policy

2019-10-28 Thread Jitendra Nath Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey reassigned HDDS-2041:
--

Assignee: Xiaoyu Yao  (was: Vivek Ratnavel Subramanian)

> Don't depend on DFSUtil to check HTTP policy
> 
>
> Key: HDDS-2041
> URL: https://issues.apache.org/jira/browse/HDDS-2041
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>  Components: website
>Affects Versions: 0.4.1
>Reporter: Vivek Ratnavel Subramanian
>Assignee: Xiaoyu Yao
>Priority: Major
>
> Currently, BaseHttpServer uses DFSUtil to get Http policy. With this, when 
> http policy is set to HTTPS on hdfs-site.xml, ozone http servers try to come 
> up with HTTPS and fail if SSL certificates are not present in the required 
> location.
> Ozone web UIs should not depend on HDFS config to determine HTTP policy. 
> Instead, it should have its own config to determine the policy. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2283) Container creation on datanodes take time because of Rocksdb option creation.

2019-10-24 Thread Jitendra Nath Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-2283:
---
Fix Version/s: 0.5.0

> Container creation on datanodes take time because of Rocksdb option creation.
> -
>
> Key: HDDS-2283
> URL: https://issues.apache.org/jira/browse/HDDS-2283
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Reporter: Mukul Kumar Singh
>Assignee: Siddharth Wagle
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
> Attachments: HDDS-2283.00.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Container Creation on datanodes take around 300ms due to rocksdb creation. 
> Rocksdb creation is taking a considerable time and this needs to be optimized.
> Creating a rocksdb per disk should be enough and each container can be table 
> inside the rocksdb.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-24 Thread Jitendra Nath Pandey (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959180#comment-16959180
 ] 

Jitendra Nath Pandey commented on HDDS-2356:


This seems very similar to HDDS-2355.

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDDS-2181) Ozone Manager should send correct ACL type in ACL requests to Authorizer

2019-10-16 Thread Jitendra Nath Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey reopened HDDS-2181:


The pull request is still open.

> Ozone Manager should send correct ACL type in ACL requests to Authorizer
> 
>
> Key: HDDS-2181
> URL: https://issues.apache.org/jira/browse/HDDS-2181
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
>Reporter: Vivek Ratnavel Subramanian
>Assignee: Vivek Ratnavel Subramanian
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 10h 50m
>  Remaining Estimate: 0h
>
> Currently, Ozone manager sends "WRITE" as ACLType for key create, key delete 
> and bucket create operation. Fix the acl type in all requests to the 
> authorizer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2152) Ozone client fails with OOM while writing a large (~300MB) key.

2019-09-23 Thread Jitendra Nath Pandey (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935859#comment-16935859
 ] 

Jitendra Nath Pandey commented on HDDS-2152:


[~shashikant], is this related to RATIS-688?

> Ozone client fails with OOM while writing a large (~300MB) key.
> ---
>
> Key: HDDS-2152
> URL: https://issues.apache.org/jira/browse/HDDS-2152
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Client
>Reporter: Aravindan Vijayan
>Assignee: YiSheng Lien
>Priority: Major
> Attachments: largekey.png
>
>
> {code}
> dd if=/dev/zero of=testfile bs=1024 count=307200
> ozone sh key put /vol1/bucket1/key testfile
> {code}
> {code}
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at 
> java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) at 
> java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at 
> org.apache.hadoop.hdds.scm.storage.BufferPool.allocateBufferIfNeeded(BufferPool.java:66)
>  at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.write(BlockOutputStream.java:234)
>  at 
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.write(BlockOutputStreamEntry.java:129)
>  at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:211)
>  at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.write(KeyOutputStream.java:193)
>  at 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:49)
>  at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:96) at 
> org.apache.hadoop.ozone.web.ozShell.keys.PutKeyHandler.call(PutKeyHandler.java:117)
>  at 
> org.apache.hadoop.ozone.web.ozShell.keys.PutKeyHandler.call(PutKeyHandler.java:55)
>  at picocli.CommandLine.execute(CommandLine.java:1173) at 
> picocli.CommandLine.access$800(CommandLine.java:141)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2019) Handle Set DtService of token in S3Gateway for OM HA

2019-09-11 Thread Jitendra Nath Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-2019:
---
Priority: Critical  (was: Major)

> Handle Set DtService of token in S3Gateway for OM HA
> 
>
> Key: HDDS-2019
> URL: https://issues.apache.org/jira/browse/HDDS-2019
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Critical
>
> When OM HA is enabled, when tokens are generated, the service name should be 
> set with address of all OM's.
>  
> Current without HA, it is set with Om RpcAddress string. This Jira is to 
> handle:
>  # Set dtService with all OM address. Right now in OMClientProducer, UGI is 
> created with S3 token, and serviceName of token is set with OMAddress, for HA 
> case, this should be set with all OM RPC addresses.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2100) Ozone TokenRenewer provider is incorrectly configured

2019-09-06 Thread Jitendra Nath Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-2100:
---
Status: Patch Available  (was: Open)

> Ozone TokenRenewer provider is incorrectly configured
> -
>
> Key: HDDS-2100
> URL: https://issues.apache.org/jira/browse/HDDS-2100
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Jitendra Nath Pandey
>Priority: Blocker
> Attachments: HDDS-2100.1.patch
>
>
> {{hadoop-ozone/ozonefs/src/main/resources/META-INF/services/org.apache.hadoop.security.token.TokenRenewer}}
>  contains {{org.apache.hadoop.fs.ozone.OzoneClientAdapterImpl$Renewer}}.
> The right renewer class is 
> {{org.apache.hadoop.fs.ozone.BasicOzoneClientAdapterImpl$Renewer}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2100) Ozone TokenRenewer provider is incorrectly configured

2019-09-06 Thread Jitendra Nath Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-2100:
---
Attachment: HDDS-2100.1.patch

> Ozone TokenRenewer provider is incorrectly configured
> -
>
> Key: HDDS-2100
> URL: https://issues.apache.org/jira/browse/HDDS-2100
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Jitendra Nath Pandey
>Priority: Blocker
> Attachments: HDDS-2100.1.patch
>
>
> {{hadoop-ozone/ozonefs/src/main/resources/META-INF/services/org.apache.hadoop.security.token.TokenRenewer}}
>  contains {{org.apache.hadoop.fs.ozone.OzoneClientAdapterImpl$Renewer}}.
> The right renewer class is 
> {{org.apache.hadoop.fs.ozone.BasicOzoneClientAdapterImpl$Renewer}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2101) Ozone filesystem provider doesn't exist

2019-09-06 Thread Jitendra Nath Pandey (Jira)
Jitendra Nath Pandey created HDDS-2101:
--

 Summary: Ozone filesystem provider doesn't exist
 Key: HDDS-2101
 URL: https://issues.apache.org/jira/browse/HDDS-2101
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
  Components: Ozone Filesystem
Reporter: Jitendra Nath Pandey


We don't have a filesystem provider in META-INF. 
i.e. following file doesn't exist.
{{hadoop-ozone/ozonefs/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem}}

See for example
{{hadoop-tools/hadoop-aws/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2100) Ozone TokenRenewer provider is incorrectly configured

2019-09-06 Thread Jitendra Nath Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-2100:
---
Target Version/s: 0.4.1

> Ozone TokenRenewer provider is incorrectly configured
> -
>
> Key: HDDS-2100
> URL: https://issues.apache.org/jira/browse/HDDS-2100
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Jitendra Nath Pandey
>Priority: Blocker
>
> {{hadoop-ozone/ozonefs/src/main/resources/META-INF/services/org.apache.hadoop.security.token.TokenRenewer}}
>  contains {{org.apache.hadoop.fs.ozone.OzoneClientAdapterImpl$Renewer}}.
> The right renewer class is 
> {{org.apache.hadoop.fs.ozone.BasicOzoneClientAdapterImpl$Renewer}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2100) Ozone TokenRenewer provider is incorrectly configured

2019-09-06 Thread Jitendra Nath Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-2100:
---
Affects Version/s: 0.4.0

> Ozone TokenRenewer provider is incorrectly configured
> -
>
> Key: HDDS-2100
> URL: https://issues.apache.org/jira/browse/HDDS-2100
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Jitendra Nath Pandey
>Priority: Blocker
>
> {{hadoop-ozone/ozonefs/src/main/resources/META-INF/services/org.apache.hadoop.security.token.TokenRenewer}}
>  contains {{org.apache.hadoop.fs.ozone.OzoneClientAdapterImpl$Renewer}}.
> The right renewer class is 
> {{org.apache.hadoop.fs.ozone.BasicOzoneClientAdapterImpl$Renewer}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2100) Ozone TokenRenewer provider is incorrectly configured

2019-09-06 Thread Jitendra Nath Pandey (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-2100:
---
Description: 
{{hadoop-ozone/ozonefs/src/main/resources/META-INF/services/org.apache.hadoop.security.token.TokenRenewer}}
 contains {{org.apache.hadoop.fs.ozone.OzoneClientAdapterImpl$Renewer}}.

The right renewer class is 
{{org.apache.hadoop.fs.ozone.BasicOzoneClientAdapterImpl$Renewer}}

> Ozone TokenRenewer provider is incorrectly configured
> -
>
> Key: HDDS-2100
> URL: https://issues.apache.org/jira/browse/HDDS-2100
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Jitendra Nath Pandey
>Priority: Blocker
>
> {{hadoop-ozone/ozonefs/src/main/resources/META-INF/services/org.apache.hadoop.security.token.TokenRenewer}}
>  contains {{org.apache.hadoop.fs.ozone.OzoneClientAdapterImpl$Renewer}}.
> The right renewer class is 
> {{org.apache.hadoop.fs.ozone.BasicOzoneClientAdapterImpl$Renewer}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2100) Ozone TokenRenewer provider is incorrectly configured

2019-09-06 Thread Jitendra Nath Pandey (Jira)
Jitendra Nath Pandey created HDDS-2100:
--

 Summary: Ozone TokenRenewer provider is incorrectly configured
 Key: HDDS-2100
 URL: https://issues.apache.org/jira/browse/HDDS-2100
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
Reporter: Jitendra Nath Pandey






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1751) replication of underReplicated container fails with SCMContainerPlacementRackAware policy

2019-07-21 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889925#comment-16889925
 ] 

Jitendra Nath Pandey commented on HDDS-1751:


Is it related to HDDS-1713.

> replication of underReplicated container fails with 
> SCMContainerPlacementRackAware policy
> -
>
> Key: HDDS-1751
> URL: https://issues.apache.org/jira/browse/HDDS-1751
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.4.0
>Reporter: Mukul Kumar Singh
>Priority: Major
>  Labels: MiniOzoneChaosCluster
>
> SCM container replication fails with
> {code}
> 2019-07-02 18:26:41,564 WARN  container.ReplicationManager 
> (ReplicationManager.java:handleUnderReplicatedContainer(501)) - Exception 
> while replicating container 18.
> org.apache.hadoop.hdds.scm.exceptions.SCMException: No enough datanodes to 
> choose.
> at 
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:100)
> at 
> org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:487)
> at 
> org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:293)
> at 
> java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
> at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
> at 
> org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:205)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-1751) replication of underReplicated container fails with SCMContainerPlacementRackAware policy

2019-07-21 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889925#comment-16889925
 ] 

Jitendra Nath Pandey edited comment on HDDS-1751 at 7/22/19 5:18 AM:
-

Is it related to HDDS-1713?


was (Author: jnp):
Is it related to HDDS-1713.

> replication of underReplicated container fails with 
> SCMContainerPlacementRackAware policy
> -
>
> Key: HDDS-1751
> URL: https://issues.apache.org/jira/browse/HDDS-1751
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.4.0
>Reporter: Mukul Kumar Singh
>Priority: Major
>  Labels: MiniOzoneChaosCluster
>
> SCM container replication fails with
> {code}
> 2019-07-02 18:26:41,564 WARN  container.ReplicationManager 
> (ReplicationManager.java:handleUnderReplicatedContainer(501)) - Exception 
> while replicating container 18.
> org.apache.hadoop.hdds.scm.exceptions.SCMException: No enough datanodes to 
> choose.
> at 
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodes(SCMContainerPlacementRackAware.java:100)
> at 
> org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:487)
> at 
> org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:293)
> at 
> java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
> at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1080)
> at 
> org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:205)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-1586) Allow Ozone RPC client to read with topology awareness

2019-07-03 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey reassigned HDDS-1586:
--

Assignee: Sammi Chen

> Allow Ozone RPC client to read with topology awareness
> --
>
> Key: HDDS-1586
> URL: https://issues.apache.org/jira/browse/HDDS-1586
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Xiaoyu Yao
>Assignee: Sammi Chen
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> The idea is to leverage the node location from the block locations and perfer 
> read from closer block replicas. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1532) Ozone: Freon: Improve the concurrent testing framework.

2019-07-01 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876651#comment-16876651
 ] 

Jitendra Nath Pandey commented on HDDS-1532:


Awesome! Great work [~xudongcao].

> Ozone: Freon: Improve the concurrent testing framework.
> ---
>
> Key: HDDS-1532
> URL: https://issues.apache.org/jira/browse/HDDS-1532
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.4.0
>Reporter: Xudong Cao
>Assignee: Xudong Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.4.1
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently, Freon's concurrency framework is just on volume-level, but in 
> actual testing, users are likely to provide a smaller volume number(typically 
> 1), and a larger bucket number and key number, in which case the existing 
> concurrency framework can not make good use of the thread pool.
> We need to improve the concurrency policy, make the volume creation task, 
> bucket creation task, and key creation task all can be equally submitted to 
> the thread pool as a general task. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1496) Support partial chunk reads and checksum verification

2019-06-21 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1496:
---
Fix Version/s: 0.5.0

> Support partial chunk reads and checksum verification
> -
>
> Key: HDDS-1496
> URL: https://issues.apache.org/jira/browse/HDDS-1496
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Hanisha Koneru
>Assignee: Hanisha Koneru
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>
> BlockInputStream#readChunkFromContainer() reads the whole chunk from disk 
> even if we need to read only a part of the chunk.
> This Jira aims to improve readChunkFromContainer so that only that part of 
> the chunk file is read which is needed by client plus the part of chunk file 
> which is required to verify the checksum.
> For example, lets say the client is reading from index 120 to 450 in the 
> chunk. And let's say checksum is stored for every 100 bytes in the chunk i.e. 
> the first checksum is for bytes from index 0 to 99, the next for bytes from 
> index 100 to 199 and so on. To verify bytes from 120 to 450, we would need to 
> read from bytes 100 to 499 so that checksum verification can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1589) CloseContainer transaction on unhealthy replica should fail with CONTAINER_UNHEALTHY exception

2019-06-03 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1589:
---
Fix Version/s: 0.5.0

> CloseContainer transaction on unhealthy replica should fail with 
> CONTAINER_UNHEALTHY exception
> --
>
> Key: HDDS-1589
> URL: https://issues.apache.org/jira/browse/HDDS-1589
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
> Fix For: 0.5.0
>
>
> Currently, while trying to close an unhealthy container over Ratis, it fails 
> with INTERNAL_ERROR which leads to exception as follow:
> {code:java}
> 2019-05-19 22:00:48,386 ERROR commandhandler.CloseContainerCommandHandler 
> (CloseContainerCommandHandler.java:handle(124)) - Can't close container #125
> org.apache.ratis.protocol.StateMachineException: 
> java.util.concurrent.CompletionException from Server 
> faea26b0-9c60-4b4c-a0df-bf7c67cc5b48: java.lang.IllegalStateException
> at 
> org.apache.ratis.server.impl.RaftServerImpl.lambda$replyPendingRequest$24(RaftServerImpl.java:1221)
> at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
> at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.concurrent.CompletionException: 
> java.lang.IllegalStateException
> at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
> at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592)
> ... 3 more
> Caused by: java.lang.IllegalStateException
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
> at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:300)
> at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:149)
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:347)
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.runCommand(ContainerStateMachine.java:354)
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$applyTransaction$5(ContainerStateMachine.java:613)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> {code}
> This happens when , it tries to mark the container unhealthy as the 
> transaction has failed and tries to mark the container unhealthy where it 
> expects the container to be in OPEN or CLOSIG state ad hence asserts. It 
> should ideally fail with CONTAINER_UNHEALTHY so as to not retry to not change 
> the state to be UNHEALTHY.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-1589) CloseContainer transaction on unhealthy replica should fail with CONTAINER_UNHEALTHY exception

2019-06-03 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey resolved HDDS-1589.

Resolution: Fixed

> CloseContainer transaction on unhealthy replica should fail with 
> CONTAINER_UNHEALTHY exception
> --
>
> Key: HDDS-1589
> URL: https://issues.apache.org/jira/browse/HDDS-1589
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
>
> Currently, while trying to close an unhealthy container over Ratis, it fails 
> with INTERNAL_ERROR which leads to exception as follow:
> {code:java}
> 2019-05-19 22:00:48,386 ERROR commandhandler.CloseContainerCommandHandler 
> (CloseContainerCommandHandler.java:handle(124)) - Can't close container #125
> org.apache.ratis.protocol.StateMachineException: 
> java.util.concurrent.CompletionException from Server 
> faea26b0-9c60-4b4c-a0df-bf7c67cc5b48: java.lang.IllegalStateException
> at 
> org.apache.ratis.server.impl.RaftServerImpl.lambda$replyPendingRequest$24(RaftServerImpl.java:1221)
> at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
> at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.concurrent.CompletionException: 
> java.lang.IllegalStateException
> at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
> at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592)
> ... 3 more
> Caused by: java.lang.IllegalStateException
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
> at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:300)
> at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:149)
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:347)
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.runCommand(ContainerStateMachine.java:354)
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$applyTransaction$5(ContainerStateMachine.java:613)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
> {code}
> This happens when , it tries to mark the container unhealthy as the 
> transaction has failed and tries to mark the container unhealthy where it 
> expects the container to be in OPEN or CLOSIG state ad hence asserts. It 
> should ideally fail with CONTAINER_UNHEALTHY so as to not retry to not change 
> the state to be UNHEALTHY.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1580) Obtain Handler reference in ContainerScrubber

2019-05-28 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1580:
---
Fix Version/s: 0.5.0

> Obtain Handler reference in ContainerScrubber
> -
>
> Key: HDDS-1580
> URL: https://issues.apache.org/jira/browse/HDDS-1580
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Affects Versions: 0.5.0
>Reporter: Shweta
>Assignee: Shweta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Obtain reference to Handler based on containerType in scrub() in 
> ContainerScrubber.java



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-1557) Datanode exits because Ratis fails to shutdown ratis server

2019-05-22 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey reassigned HDDS-1557:
--

Assignee: Aravindan Vijayan

> Datanode exits because Ratis fails to shutdown ratis server 
> 
>
> Key: HDDS-1557
> URL: https://issues.apache.org/jira/browse/HDDS-1557
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Aravindan Vijayan
>Priority: Major
>  Labels: MiniOzoneChaosCluster
>
> Datanode exits because Ratis fails to shutdown ratis server 
> {code}
> 2019-05-19 12:07:19,276 INFO  impl.RaftServerImpl 
> (RaftServerImpl.java:checkInconsistentAppendEntries(965)) - 
> 80747533-f47c-43de-85b8-e70db448c63f: inconsistency entries. 
> Reply:99930d0a-72ab-4795-a3ac-f3c
> fb61ca1bb<-80747533-f47c-43de-85b8-e70db448c63f#3132:FAIL,INCONSISTENCY,nextIndex:9057,term:33,followerCommit:9057
> 2019-05-19 12:07:19,276 WARN  impl.RaftServerProxy 
> (RaftServerProxy.java:lambda$close$4(320)) - 
> e143b976-ab35-4555-a800-7f05a2b1b738: Failed to close GRPC server
> java.io.InterruptedIOException: e143b976-ab35-4555-a800-7f05a2b1b738: 
> shutdown server with port 64605 failed
> at 
> org.apache.ratis.util.IOUtils.toInterruptedIOException(IOUtils.java:48)
> at 
> org.apache.ratis.grpc.server.GrpcService.closeImpl(GrpcService.java:160)
> at 
> org.apache.ratis.server.impl.RaftServerRpcWithProxy.lambda$close$2(RaftServerRpcWithProxy.java:76)
> at 
> org.apache.ratis.util.LifeCycle.lambda$checkStateAndClose$2(LifeCycle.java:231)
> at 
> org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:251)
> at 
> org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:229)
> at 
> org.apache.ratis.server.impl.RaftServerRpcWithProxy.close(RaftServerRpcWithProxy.java:76)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$close$4(RaftServerProxy.java:318)
> at 
> org.apache.ratis.util.LifeCycle.lambda$checkStateAndClose$2(LifeCycle.java:231)
> at 
> org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:251)
> at 
> org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:229)
> at 
> org.apache.ratis.server.impl.RaftServerProxy.close(RaftServerProxy.java:313)
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.stop(XceiverServerRatis.java:432)
> at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.stop(OzoneContainer.java:201)
> at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.close(DatanodeStateMachine.java:270)
> at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.stopDaemon(DatanodeStateMachine.java:394)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.stop(HddsDatanodeService.java:449)
> at 
> org.apache.hadoop.ozone.HddsDatanodeService.terminateDatanode(HddsDatanodeService.java:429)
> at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.start(DatanodeStateMachine.java:208)
> at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$startDaemon$0(DatanodeStateMachine.java:349)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at 
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl.awaitTermination(ServerImpl.java:282)
> at 
> org.apache.ratis.grpc.server.GrpcService.closeImpl(GrpcService.java:158)
> ... 19 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1517) AllocateBlock call fails with ContainerNotFoundException

2019-05-22 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845578#comment-16845578
 ] 

Jitendra Nath Pandey commented on HDDS-1517:


+1 for the latest patch in the PR.

> AllocateBlock call fails with ContainerNotFoundException
> 
>
> Key: HDDS-1517
> URL: https://issues.apache.org/jira/browse/HDDS-1517
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.4.1
>
> Attachments: HDDS-1517.000.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> In allocateContainer call,  the container is first added to pipelineStateMap 
> and then added to container cache. If two allocate blocks execute 
> concurrently, it might happen that one find the container to exist in the 
> pipelineStateMap but the container is yet to be updated in the container 
> cache, hence failing with CONTAINER_NOT_FOUND exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1568) Add RocksDB metrics to OM

2019-05-21 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845268#comment-16845268
 ] 

Jitendra Nath Pandey commented on HDDS-1568:


We should have a generic metrics2 adapter for rocksdb metrics that can be used 
in SCM, OM and Datanodes.
Additionally, a Datanode will have multiple instances of rocksdb one for each 
container, it should be possible to prefix the metrics with container-id. 
Please also note that DN will have thousands of containers and hence thousands 
of rocksdb instances, so we might have scale concerns as well.

> Add RocksDB metrics to OM
> -
>
> Key: HDDS-1568
> URL: https://issues.apache.org/jira/browse/HDDS-1568
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Manager
>Reporter: Siddharth Wagle
>Assignee: Aravindan Vijayan
>Priority: Major
>
> RocksDB statistics need to sinked to hadoop-metrics2 for Ozone Manager to 
> understand how OM behaves under heavy load.
> Example: "rocksdb.bytes.written"
> https://github.com/facebook/rocksdb/wiki/Statistics



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-1568) Add RocksDB metrics to OM

2019-05-21 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16845268#comment-16845268
 ] 

Jitendra Nath Pandey edited comment on HDDS-1568 at 5/21/19 9:24 PM:
-

We should have a generic metrics2 adapter for rocksdb metrics that can be used 
in SCM, OM and Datanodes.
Additionally, a Datanode will have multiple instances of rocksdb one for each 
container, it should be possible to prefix the metrics with container-id. 
Please also note that DN will have thousands of containers and hence thousands 
of rocksdb instances, so we might have scale concerns as well. We should 
consider datanode rocksdb metrics in a separate jira.


was (Author: jnp):
We should have a generic metrics2 adapter for rocksdb metrics that can be used 
in SCM, OM and Datanodes.
Additionally, a Datanode will have multiple instances of rocksdb one for each 
container, it should be possible to prefix the metrics with container-id. 
Please also note that DN will have thousands of containers and hence thousands 
of rocksdb instances, so we might have scale concerns as well.

> Add RocksDB metrics to OM
> -
>
> Key: HDDS-1568
> URL: https://issues.apache.org/jira/browse/HDDS-1568
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Manager
>Reporter: Siddharth Wagle
>Assignee: Aravindan Vijayan
>Priority: Major
>
> RocksDB statistics need to sinked to hadoop-metrics2 for Ozone Manager to 
> understand how OM behaves under heavy load.
> Example: "rocksdb.bytes.written"
> https://github.com/facebook/rocksdb/wiki/Statistics



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1530) Ozone: Freon: Support big files larger than 2GB and add "--bufferSize" and "--validateWrites" options.

2019-05-16 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841921#comment-16841921
 ] 

Jitendra Nath Pandey commented on HDDS-1530:


Cancelled and resubmitted the patch to trigger pre-commit again.

> Ozone: Freon: Support big files larger than 2GB and add "--bufferSize" and 
> "--validateWrites" options.
> --
>
> Key: HDDS-1530
> URL: https://issues.apache.org/jira/browse/HDDS-1530
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: test
>Reporter: Xudong Cao
>Assignee: Xudong Cao
>Priority: Major
> Attachments: HDDS-1530.001.patch, HDDS-1530.002.patch
>
>
> *Current problems:*
>  1. Freon does not support big files larger than 2GB because it use an int 
> type "keySize" parameter and also "keyValue" buffer size.
>  2. Freon allocates a entire buffer for each key at once, so if the key size 
> is large and the concurrency is high, freon will report OOM exception 
> frequently.
>  3. Freon lacks option such as "--validateWrites", thus users cannot manually 
> specify that verification is required after writing.
> *Some solutions:*
>  1. Use a long type "keySize" parameter, make sure freon can support big 
> files larger than 2GB.
>  2. Use a small buffer repeatedly than allocating the entire key-size buffer 
> at once, the default buffer size is 4K and can be configured by "–bufferSize" 
> parameter.
>  3. Add a "--validateWrites" option to Freon command line, users can provide 
> this option to indicate that a validation is required after write.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1530) Ozone: Freon: Support big files larger than 2GB and add "--bufferSize" and "--validateWrites" options.

2019-05-16 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1530:
---
Status: Patch Available  (was: Open)

> Ozone: Freon: Support big files larger than 2GB and add "--bufferSize" and 
> "--validateWrites" options.
> --
>
> Key: HDDS-1530
> URL: https://issues.apache.org/jira/browse/HDDS-1530
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: test
>Reporter: Xudong Cao
>Assignee: Xudong Cao
>Priority: Major
> Attachments: HDDS-1530.001.patch, HDDS-1530.002.patch
>
>
> *Current problems:*
>  1. Freon does not support big files larger than 2GB because it use an int 
> type "keySize" parameter and also "keyValue" buffer size.
>  2. Freon allocates a entire buffer for each key at once, so if the key size 
> is large and the concurrency is high, freon will report OOM exception 
> frequently.
>  3. Freon lacks option such as "--validateWrites", thus users cannot manually 
> specify that verification is required after writing.
> *Some solutions:*
>  1. Use a long type "keySize" parameter, make sure freon can support big 
> files larger than 2GB.
>  2. Use a small buffer repeatedly than allocating the entire key-size buffer 
> at once, the default buffer size is 4K and can be configured by "–bufferSize" 
> parameter.
>  3. Add a "--validateWrites" option to Freon command line, users can provide 
> this option to indicate that a validation is required after write.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1530) Ozone: Freon: Support big files larger than 2GB and add "--bufferSize" and "--validateWrites" options.

2019-05-16 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1530:
---
Status: Open  (was: Patch Available)

> Ozone: Freon: Support big files larger than 2GB and add "--bufferSize" and 
> "--validateWrites" options.
> --
>
> Key: HDDS-1530
> URL: https://issues.apache.org/jira/browse/HDDS-1530
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: test
>Reporter: Xudong Cao
>Assignee: Xudong Cao
>Priority: Major
> Attachments: HDDS-1530.001.patch, HDDS-1530.002.patch
>
>
> *Current problems:*
>  1. Freon does not support big files larger than 2GB because it use an int 
> type "keySize" parameter and also "keyValue" buffer size.
>  2. Freon allocates a entire buffer for each key at once, so if the key size 
> is large and the concurrency is high, freon will report OOM exception 
> frequently.
>  3. Freon lacks option such as "--validateWrites", thus users cannot manually 
> specify that verification is required after writing.
> *Some solutions:*
>  1. Use a long type "keySize" parameter, make sure freon can support big 
> files larger than 2GB.
>  2. Use a small buffer repeatedly than allocating the entire key-size buffer 
> at once, the default buffer size is 4K and can be configured by "–bufferSize" 
> parameter.
>  3. Add a "--validateWrites" option to Freon command line, users can provide 
> this option to indicate that a validation is required after write.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1517) AllocateBlock call fails with ContainerNotFoundException

2019-05-16 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1517:
---
Status: Patch Available  (was: Open)

> AllocateBlock call fails with ContainerNotFoundException
> 
>
> Key: HDDS-1517
> URL: https://issues.apache.org/jira/browse/HDDS-1517
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.4.1
>
> Attachments: HDDS-1517.000.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In allocateContainer call,  the container is first added to pipelineStateMap 
> and then added to container cache. If two allocate blocks execute 
> concurrently, it might happen that one find the container to exist in the 
> pipelineStateMap but the container is yet to be updated in the container 
> cache, hence failing with CONTAINER_NOT_FOUND exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1517) AllocateBlock call fails with ContainerNotFoundException

2019-05-16 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1517:
---
Status: Open  (was: Patch Available)

> AllocateBlock call fails with ContainerNotFoundException
> 
>
> Key: HDDS-1517
> URL: https://issues.apache.org/jira/browse/HDDS-1517
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.4.1
>
> Attachments: HDDS-1517.000.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In allocateContainer call,  the container is first added to pipelineStateMap 
> and then added to container cache. If two allocate blocks execute 
> concurrently, it might happen that one find the container to exist in the 
> pipelineStateMap but the container is yet to be updated in the container 
> cache, hence failing with CONTAINER_NOT_FOUND exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14323) Distcp fails in Hadoop 3.x when 2.x source webhdfs url has special characters in hdfs file path

2019-05-16 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey reassigned HDFS-14323:
---

Assignee: Srinivasu Majeti

> Distcp fails in Hadoop 3.x when 2.x source webhdfs url has special characters 
> in hdfs file path
> ---
>
> Key: HDFS-14323
> URL: https://issues.apache.org/jira/browse/HDFS-14323
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 3.2.0
>Reporter: Srinivasu Majeti
>Assignee: Srinivasu Majeti
>Priority: Major
> Attachments: HDFS-14323v0.patch
>
>
> There was an enhancement to allow semicolon in source/target URLs for distcp 
> use case as part of HDFS-13176 and backward compatibility fix as part of 
> HDFS-13582 . Still there seems to be an issue when trying to trigger distcp 
> from 3.x cluster to pull webhdfs data from 2.x hadoop cluster. We might need 
> to deal with existing fix as described below by making sure if url is already 
> encoded or not. That fixes it. 
> diff --git 
> a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/web/WebHdfsFileSystem.java
>  
> b/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/web/WebHdfsFileSystem.java
> index 5936603c34a..dc790286aff 100644
> --- 
> a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/web/WebHdfsFileSystem.java
> +++ 
> b/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/web/WebHdfsFileSystem.java
> @@ -609,7 +609,10 @@ URL toUrl(final HttpOpParam.Op op, final Path fspath,
>  boolean pathAlreadyEncoded = false;
>  try {
>  fspathUriDecoded = URLDecoder.decode(fspathUri.getPath(), "UTF-8");
> - pathAlreadyEncoded = true;
> + if(!fspathUri.getPath().equals(fspathUriDecoded))
> + {
> + pathAlreadyEncoded = true;
> + }
>  } catch (IllegalArgumentException ex) {
>  LOG.trace("Cannot decode URL encoded file", ex);
>  }
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-1517) AllocateBlock call fails with ContainerNotFoundException

2019-05-16 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841586#comment-16841586
 ] 

Jitendra Nath Pandey edited comment on HDDS-1517 at 5/16/19 5:56 PM:
-

The patch moves addition of container to pipelineStateMap after its addition to 
container cache. Now a thread may first find the container in the cache but not 
in pipelineStateMap. How is the race condition addressed? Do we guarantee that 
a thread will always look in a certain order.


was (Author: jnp):
The patch moves addition of container to pipelineStateMap after its addition to 
container cache. Now a thread may first find the container in the cache but not 
in pipelineStateMap. How is the race condition addressed? Do we guarantee that 
a thread will never look in pipelineStateMap before it looks in container cache?

> AllocateBlock call fails with ContainerNotFoundException
> 
>
> Key: HDDS-1517
> URL: https://issues.apache.org/jira/browse/HDDS-1517
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.4.1
>
> Attachments: HDDS-1517.000.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In allocateContainer call,  the container is first added to pipelineStateMap 
> and then added to container cache. If two allocate blocks execute 
> concurrently, it might happen that one find the container to exist in the 
> pipelineStateMap but the container is yet to be updated in the container 
> cache, hence failing with CONTAINER_NOT_FOUND exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1517) AllocateBlock call fails with ContainerNotFoundException

2019-05-16 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841586#comment-16841586
 ] 

Jitendra Nath Pandey commented on HDDS-1517:


The patch moves addition of container to pipelineStateMap after its addition to 
container cache. Now a thread may first find the container in the cache but not 
in pipelineStateMap. How is the race condition addressed? Do we guarantee that 
a thread will never look in pipelineStateMap before it looks in container cache?

> AllocateBlock call fails with ContainerNotFoundException
> 
>
> Key: HDDS-1517
> URL: https://issues.apache.org/jira/browse/HDDS-1517
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.4.1
>
> Attachments: HDDS-1517.000.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In allocateContainer call,  the container is first added to pipelineStateMap 
> and then added to container cache. If two allocate blocks execute 
> concurrently, it might happen that one find the container to exist in the 
> pipelineStateMap but the container is yet to be updated in the container 
> cache, hence failing with CONTAINER_NOT_FOUND exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-1491) Ozone KeyInputStream seek() should not read the chunk file

2019-05-13 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey resolved HDDS-1491.

   Resolution: Fixed
Fix Version/s: 0.5.0

> Ozone KeyInputStream seek() should not read the chunk file
> --
>
> Key: HDDS-1491
> URL: https://issues.apache.org/jira/browse/HDDS-1491
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Hanisha Koneru
>Assignee: Hanisha Koneru
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> KeyInputStream#seek() calls BlockInputStream#seek() to adjust the buffer 
> position to the seeked position. As part of the seek operation, the whole 
> chunk is read from the container and stored in the buffer so that the buffer 
> position can be advanced to the seeked position. 
> We should not read from disk on a seek() operation. Instead, for a read 
> operation, when the chunk file is read and put in the buffer, at that time, 
> we can advance the buffer position to the previously seeked position.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12735) Make ContainerStateMachine#applyTransaction async

2019-05-10 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDFS-12735:

Status: Open  (was: Patch Available)

> Make ContainerStateMachine#applyTransaction async
> -
>
> Key: HDFS-12735
> URL: https://issues.apache.org/jira/browse/HDFS-12735
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: performance
> Attachments: HDFS-12735-HDFS-7240.000.patch, 
> HDFS-12735-HDFS-7240.001.patch, HDFS-12735-HDFS-7240.002.patch
>
>
> Currently ContainerStateMachine#applyTransaction makes a synchronous call to 
> dispatch client requests. Idea is to have a thread pool which dispatches 
> client requests and returns a CompletableFuture.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12735) Make ContainerStateMachine#applyTransaction async

2019-05-10 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16837617#comment-16837617
 ] 

Jitendra Nath Pandey commented on HDFS-12735:
-

I think this has already been addressed.
[~ljain] [~msingh], please confirm and close this.

> Make ContainerStateMachine#applyTransaction async
> -
>
> Key: HDFS-12735
> URL: https://issues.apache.org/jira/browse/HDFS-12735
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: performance
> Attachments: HDFS-12735-HDFS-7240.000.patch, 
> HDFS-12735-HDFS-7240.001.patch, HDFS-12735-HDFS-7240.002.patch
>
>
> Currently ContainerStateMachine#applyTransaction makes a synchronous call to 
> dispatch client requests. Idea is to have a thread pool which dispatches 
> client requests and returns a CompletableFuture.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1517) AllocateBlock call fails with ContainerNotFoundException

2019-05-10 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16837488#comment-16837488
 ] 

Jitendra Nath Pandey commented on HDDS-1517:


Is it related to HDDS-1374?

> AllocateBlock call fails with ContainerNotFoundException
> 
>
> Key: HDDS-1517
> URL: https://issues.apache.org/jira/browse/HDDS-1517
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
> Fix For: 0.5.0
>
>
> In allocateContainer call,  the container is first added to pipelineStateMap 
> and then added to container cache. If two allocate blocks execute 
> concurrently, it might happen that one find the container to exist in the 
> pipelineStateMap but the container is yet to be updated in the container 
> cache, hence failing with CONTAINER_NOT_FOUND exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1445) Add handling of NotReplicatedException in OzoneClient

2019-05-07 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1445:
---
Fix Version/s: 0.5.0

> Add handling of NotReplicatedException in OzoneClient
> -
>
> Key: HDDS-1445
> URL: https://issues.apache.org/jira/browse/HDDS-1445
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: MiniOzoneChaosCluster
> Fix For: 0.5.0
>
>
> In MiniOzoneChaosCluster some of the calls fail with NotReplicatedException. 
> This Exception needs to be handled in OzoneClient
> {code}
> 2019-04-17 10:13:47,254 INFO  client.GrpcClientProtocolService 
> (GrpcClientProtocolService.java:lambda$processClientRequest$0(264)) - Failed 
> RaftClientRequest:client-43B95E0E3BE0->1ebec547-8cf8-4466-bf43-ea9f19fb546b@group-1B28E0BF6CBC,
>  cid=800, seq=0, Watch-ALL_COMMITTED(234), Message:, 
> reply=RaftClientReply:client-43B95E0E3BE0->1ebec547-8cf8-4466-bf43-ea9f19fb546b@group-1B28E0BF6CBC,
>  cid=800, FAILED org.apache.ratis.protocol.NotReplicatedException: Request 
> with call Id 800 and log index 234 is not yet replicated to ALL_COMMITTED, 
> logIndex=234, commits[1ebec547-8cf8-4466-bf43-ea9f19fb546b:c267, 
> 7b200ef5-7711-437d-a9bc-ad0e18fdf6bb:c267, 
> ffbfb65f-a622-466d-b6e8-47038cc15e0b:c226]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-1445) Add handling of NotReplicatedException in OzoneClient

2019-05-07 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey resolved HDDS-1445.

Resolution: Fixed

> Add handling of NotReplicatedException in OzoneClient
> -
>
> Key: HDDS-1445
> URL: https://issues.apache.org/jira/browse/HDDS-1445
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: MiniOzoneChaosCluster
>
> In MiniOzoneChaosCluster some of the calls fail with NotReplicatedException. 
> This Exception needs to be handled in OzoneClient
> {code}
> 2019-04-17 10:13:47,254 INFO  client.GrpcClientProtocolService 
> (GrpcClientProtocolService.java:lambda$processClientRequest$0(264)) - Failed 
> RaftClientRequest:client-43B95E0E3BE0->1ebec547-8cf8-4466-bf43-ea9f19fb546b@group-1B28E0BF6CBC,
>  cid=800, seq=0, Watch-ALL_COMMITTED(234), Message:, 
> reply=RaftClientReply:client-43B95E0E3BE0->1ebec547-8cf8-4466-bf43-ea9f19fb546b@group-1B28E0BF6CBC,
>  cid=800, FAILED org.apache.ratis.protocol.NotReplicatedException: Request 
> with call Id 800 and log index 234 is not yet replicated to ALL_COMMITTED, 
> logIndex=234, commits[1ebec547-8cf8-4466-bf43-ea9f19fb546b:c267, 
> 7b200ef5-7711-437d-a9bc-ad0e18fdf6bb:c267, 
> ffbfb65f-a622-466d-b6e8-47038cc15e0b:c226]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDDS-1384) TestBlockOutputStreamWithFailures is failing

2019-05-07 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey reopened HDDS-1384:


> TestBlockOutputStreamWithFailures is failing
> 
>
> Key: HDDS-1384
> URL: https://issues.apache.org/jira/browse/HDDS-1384
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: test
>Reporter: Nanda kumar
>Assignee: Elek, Marton
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> TestBlockOutputStreamWithFailures is failing with the following error
> {noformat}
> 2019-04-04 18:52:43,240 INFO  volume.ThrottledAsyncChecker 
> (ThrottledAsyncChecker.java:schedule(140)) - Scheduling a check for 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume@1f6c0e8a
> 2019-04-04 18:52:43,240 INFO  volume.HddsVolumeChecker 
> (HddsVolumeChecker.java:checkAllVolumes(203)) - Scheduled health check for 
> volume org.apache.hadoop.ozone.container.common.volume.HddsVolume@1f6c0e8a
> 2019-04-04 18:52:43,241 ERROR server.GrpcService 
> (ExitUtils.java:terminate(133)) - Terminating with exit status 1: Failed to 
> start Grpc server
> java.io.IOException: Failed to bind
>   at 
> org.apache.ratis.thirdparty.io.grpc.netty.NettyServer.start(NettyServer.java:253)
>   at 
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl.start(ServerImpl.java:166)
>   at 
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl.start(ServerImpl.java:81)
>   at org.apache.ratis.grpc.server.GrpcService.startImpl(GrpcService.java:144)
>   at org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:202)
>   at 
> org.apache.ratis.server.impl.RaftServerRpcWithProxy.start(RaftServerRpcWithProxy.java:69)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$start$3(RaftServerProxy.java:300)
>   at org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:202)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:298)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:419)
>   at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:186)
>   at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.start(DatanodeStateMachine.java:169)
>   at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$startDaemon$0(DatanodeStateMachine.java:338)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.BindException: Address already in use
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:433)
>   at sun.nio.ch.Net.bind(Net.java:425)
>   at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:130)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:558)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1358)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:1019)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:254)
>   at 
> org.apache.ratis.thirdparty.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:366)
>   at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
>   at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:462)
>   at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
>   at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   ... 1 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1424) Support multi-container robot test execution

2019-05-07 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1424:
---
Fix Version/s: 0.5.0

> Support multi-container robot test execution
> 
>
> Key: HDDS-1424
> URL: https://issues.apache.org/jira/browse/HDDS-1424
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Elek, Marton
>Assignee: Elek, Marton
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> The ./smoketest folder in the distribution package contains robotframework 
> based test scripts to test the main behaviour of Ozone.
> The tests have two layers:
> 1. robot test definitions to execute commands and assert the results (on a 
> given host machine)
> 2. ./smoketest/test.sh which starts/stops the docker-compose based 
> environments AND execute the selected robot tests inside the right hosts
> The second one (test.sh) has some serious limitations:
> 1. all the tests are executed inside the same container (om):
> https://github.com/apache/hadoop/blob/5f951ea2e39ae4dfe554942baeec05849cd7d3c2/hadoop-ozone/dist/src/main/smoketest/test.sh#L89
> Some of the tests (ozonesecure-mr, ozonefs) may require the flexibility to 
> execute different robot tests in different containers.
> 2. The definition of the global test set is complex and hard to understood. 
> The current code is:
> {code}
>TESTS=("basic")
>execute_tests ozone "${TESTS[@]}"
>TESTS=("auditparser")
>execute_tests ozone "${TESTS[@]}"
>TESTS=("ozonefs")
>execute_tests ozonefs "${TESTS[@]}"
>TESTS=("basic")
>execute_tests ozone-hdfs "${TESTS[@]}"
>TESTS=("s3")
>execute_tests ozones3 "${TESTS[@]}"
>TESTS=("security")
>execute_tests ozonesecure .
> {code} 
> For example for ozonesecure the TESTS is not used. And the usage of bash 
> lists require additional complexity in the execute_tests function.
> I propose here a very lightweight refactor. Instead of including both the 
> test definitions AND the helper methods in test.sh I would separate them.
> Let's put a test.sh to each of the compose directories. The separated test.sh 
> can include common methods from a main shell script. For example:
> {code}
> source "$COMPOSE_DIR/../testlib.sh"
> start_docker_env
> execute_robot_test scm basic/basic.robot
> execute_robot_test scm s3
> stop_docker_env
> generate_report
> {code}
> This is a more clean and more flexible definition. It's easy to execute just 
> this test (as it's saved to the compose/ozones3 directory. And it's more 
> flexible.
> Other example, where multiple containers are used to execute tests:
> {code}
> source "$COMPOSE_DIR/../testlib.sh"
> start_docker_env
> execute_robot_test scm ozonefs/ozonefs.robot
> export OZONE_HOME=/opt/ozone
> execute_robot_test hadoop32 ozonefs/hadoopo3fs.robot
> execute_robot_test hadoop31 ozonefs/hadoopo3fs.robot
> stop_docker_env
> generate_report
> {code}
> With this separation the definition of the helper methods (eg. 
> execute_robot_test or stop_docker_env) would also be simplified.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-1493) Download and Import Container replicator fails.

2019-05-06 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey reassigned HDDS-1493:
--

Assignee: Nanda kumar

> Download and Import Container replicator fails.
> ---
>
> Key: HDDS-1493
> URL: https://issues.apache.org/jira/browse/HDDS-1493
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Aravindan Vijayan
>Assignee: Nanda kumar
>Priority: Major
>
> While running batch jobs (16 threads writing a lot of 10MB+ files), the 
> following error is seen in the SCM logs.
> {code}
> ERROR  - Can't import the downloaded container data id=317
> {code}
> It is unclear from the logs why this happens. Needs more investigation to 
> find the root cause.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1492) Generated chunk size name too long.

2019-05-06 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1492:
---
Priority: Critical  (was: Major)

> Generated chunk size name too long.
> ---
>
> Key: HDDS-1492
> URL: https://issues.apache.org/jira/browse/HDDS-1492
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Aravindan Vijayan
>Priority: Critical
>
> Following exception is seen in SCM logs intermittently. 
> {code}
> java.lang.RuntimeException: file name 
> 'chunks/2a54b2a153f4a9c5da5f44e2c6f97c60_stream_9c6ac565-e2d4-469c-bd5c-47922a35e798_chunk_10.tmp.2.23115'
>  is too long ( > 100 bytes)
> {code}
> We may have to limit the name of the chunk to 100 bytes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1492) Generated chunk size name too long.

2019-05-06 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1492:
---
Target Version/s: 0.5.0

> Generated chunk size name too long.
> ---
>
> Key: HDDS-1492
> URL: https://issues.apache.org/jira/browse/HDDS-1492
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Aravindan Vijayan
>Priority: Critical
>
> Following exception is seen in SCM logs intermittently. 
> {code}
> java.lang.RuntimeException: file name 
> 'chunks/2a54b2a153f4a9c5da5f44e2c6f97c60_stream_9c6ac565-e2d4-469c-bd5c-47922a35e798_chunk_10.tmp.2.23115'
>  is too long ( > 100 bytes)
> {code}
> We may have to limit the name of the chunk to 100 bytes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1485) Ozone writes fail when single threaded client writes 100MB files repeatedly.

2019-05-02 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1485:
---
Priority: Blocker  (was: Major)

> Ozone writes fail when single threaded client writes 100MB files repeatedly. 
> -
>
> Key: HDDS-1485
> URL: https://issues.apache.org/jira/browse/HDDS-1485
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Aravindan Vijayan
>Priority: Blocker
>
> *Environment*
> 26 node physical cluster.
> All Datanodes are up and running.
> Client attempting to write 1600 x 100MB files using the FsStress utility 
> (https://github.com/arp7/FsPerfTest) fails with the following error. 
> {code}
> 19/05/02 09:58:49 ERROR storage.BlockOutputStream: Unexpected Storage 
> Container Exception:
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  ContainerID 424 does not exist
> at 
> org.apache.hadoop.hdds.scm.storage.ContainerProtocolCalls.validateContainerResponse(ContainerProtocolCalls.java:573)
> at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.validateResponse(BlockOutputStream.java:539)
> at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.lambda$writeChunkToContainer$2(BlockOutputStream.java:616)
> at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
> at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> It looks like a corruption in the container metadata. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1384) TestBlockOutputStreamWithFailures is failing

2019-04-30 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1384:
---
Fix Version/s: 0.5.0

> TestBlockOutputStreamWithFailures is failing
> 
>
> Key: HDDS-1384
> URL: https://issues.apache.org/jira/browse/HDDS-1384
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: test
>Reporter: Nanda kumar
>Assignee: Elek, Marton
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> TestBlockOutputStreamWithFailures is failing with the following error
> {noformat}
> 2019-04-04 18:52:43,240 INFO  volume.ThrottledAsyncChecker 
> (ThrottledAsyncChecker.java:schedule(140)) - Scheduling a check for 
> org.apache.hadoop.ozone.container.common.volume.HddsVolume@1f6c0e8a
> 2019-04-04 18:52:43,240 INFO  volume.HddsVolumeChecker 
> (HddsVolumeChecker.java:checkAllVolumes(203)) - Scheduled health check for 
> volume org.apache.hadoop.ozone.container.common.volume.HddsVolume@1f6c0e8a
> 2019-04-04 18:52:43,241 ERROR server.GrpcService 
> (ExitUtils.java:terminate(133)) - Terminating with exit status 1: Failed to 
> start Grpc server
> java.io.IOException: Failed to bind
>   at 
> org.apache.ratis.thirdparty.io.grpc.netty.NettyServer.start(NettyServer.java:253)
>   at 
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl.start(ServerImpl.java:166)
>   at 
> org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl.start(ServerImpl.java:81)
>   at org.apache.ratis.grpc.server.GrpcService.startImpl(GrpcService.java:144)
>   at org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:202)
>   at 
> org.apache.ratis.server.impl.RaftServerRpcWithProxy.start(RaftServerRpcWithProxy.java:69)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$start$3(RaftServerProxy.java:300)
>   at org.apache.ratis.util.LifeCycle.startAndTransition(LifeCycle.java:202)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:298)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:419)
>   at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:186)
>   at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.start(DatanodeStateMachine.java:169)
>   at 
> org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$startDaemon$0(DatanodeStateMachine.java:338)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.BindException: Address already in use
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:433)
>   at sun.nio.ch.Net.bind(Net.java:425)
>   at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:130)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:558)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1358)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:501)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:486)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:1019)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:254)
>   at 
> org.apache.ratis.thirdparty.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:366)
>   at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
>   at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
>   at 
> org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:462)
>   at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
>   at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   ... 1 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1452) All chunk writes should happen to a single file for a block in datanode

2019-04-25 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825801#comment-16825801
 ] 

Jitendra Nath Pandey commented on HDDS-1452:


There are two very problems to solve here:
 1) Ability to write smaller chunks. Each chunk is a separate file, and 
therefore really small chunks bloat the number of files. Large chunks make the 
IO bursty, and we don't get effective pipelining in the IO path. This hurts the 
performance for large file sizes when compared to HDFS. So, we need ability to 
be able to stream smaller chunks without creating lots of small files.
 2) For small files, we end up with small file sizes in datanode irrespective 
of the chunk sizes. This bloats the number of individual files. Therefore, it 
is desirable to pack multiple blocks into a single file. This leads to some 
additional considerations
 * Even if multiple blocks share the same file, and we write small chunks, we 
need blocks to be contiguously allocated, so that we get a decent scan speed.
 * When deleting blocks the compaction logic for container will need to 
re-write lot more data and metadata.

That said, these problems are not orthogonal. Solution for 1 can be a stepping 
stone for 2, given we know the direction.

 

 

> All chunk writes should happen to a single file for a block in datanode
> ---
>
> Key: HDDS-1452
> URL: https://issues.apache.org/jira/browse/HDDS-1452
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
> Fix For: 0.5.0
>
>
> Currently, all chunks of a block happen to individual chunk files in 
> datanode. This idea here is to write all individual chunks to a single file 
> in datanode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1445) Add handling of NotReplicatedException in OzoneClient

2019-04-24 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1445:
---
Labels: MiniOzoneChaosCluster  (was: )

> Add handling of NotReplicatedException in OzoneClient
> -
>
> Key: HDDS-1445
> URL: https://issues.apache.org/jira/browse/HDDS-1445
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: MiniOzoneChaosCluster
>
> In MiniOzoneChaosCluster some of the calls fail with NotReplicatedException. 
> This Exception needs to be handled in OzoneClient
> {code}
> 2019-04-17 10:13:47,254 INFO  client.GrpcClientProtocolService 
> (GrpcClientProtocolService.java:lambda$processClientRequest$0(264)) - Failed 
> RaftClientRequest:client-43B95E0E3BE0->1ebec547-8cf8-4466-bf43-ea9f19fb546b@group-1B28E0BF6CBC,
>  cid=800, seq=0, Watch-ALL_COMMITTED(234), Message:, 
> reply=RaftClientReply:client-43B95E0E3BE0->1ebec547-8cf8-4466-bf43-ea9f19fb546b@group-1B28E0BF6CBC,
>  cid=800, FAILED org.apache.ratis.protocol.NotReplicatedException: Request 
> with call Id 800 and log index 234 is not yet replicated to ALL_COMMITTED, 
> logIndex=234, commits[1ebec547-8cf8-4466-bf43-ea9f19fb546b:c267, 
> 7b200ef5-7711-437d-a9bc-ad0e18fdf6bb:c267, 
> ffbfb65f-a622-466d-b6e8-47038cc15e0b:c226]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1301) Optimize recursive ozone filesystem apis

2019-04-23 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1301:
---
Status: Open  (was: Patch Available)

> Optimize recursive ozone filesystem apis
> 
>
> Key: HDDS-1301
> URL: https://issues.apache.org/jira/browse/HDDS-1301
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
> Attachments: HDDS-1301.001.patch
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> This Jira aims to optimise recursive apis in ozone file system. These are the 
> apis which have a recursive flag which requires an operation to be performed 
> on all the children of the directory. The Jira would add support for 
> recursive apis in Ozone manager in order to reduce the number of rpc calls to 
> Ozone Manager. Also currently these operations are not atomic. This Jira 
> would make all the operations in ozone filesystem atomic.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-1448) RatisPipelineProvider should only consider open pipeline while excluding dn for pipeline allocation

2019-04-23 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824825#comment-16824825
 ] 

Jitendra Nath Pandey edited comment on HDDS-1448 at 4/24/19 5:23 AM:
-

This is related to multi-raft support, and should be designed with that in 
context.
cc [~swagle]


was (Author: jnp):
This is related to multi-raft support, and should be designed with that in 
context.

> RatisPipelineProvider should only consider open pipeline while excluding dn 
> for pipeline allocation
> ---
>
> Key: HDDS-1448
> URL: https://issues.apache.org/jira/browse/HDDS-1448
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Aravindan Vijayan
>Priority: Major
>  Labels: MiniOzoneChaosCluster
>
> While allocation pipelines, Ratis pipeline provider considers all the 
> pipelines irrespective of the state of the pipeline. This can lead to case 
> where all the datanodes are up but the pipelines are in closing state in SCM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1448) RatisPipelineProvider should only consider open pipeline while excluding dn for pipeline allocation

2019-04-23 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824825#comment-16824825
 ] 

Jitendra Nath Pandey commented on HDDS-1448:


This is related to multi-raft support, and should be designed with that in 
context.

> RatisPipelineProvider should only consider open pipeline while excluding dn 
> for pipeline allocation
> ---
>
> Key: HDDS-1448
> URL: https://issues.apache.org/jira/browse/HDDS-1448
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Aravindan Vijayan
>Priority: Major
>  Labels: MiniOzoneChaosCluster
>
> While allocation pipelines, Ratis pipeline provider considers all the 
> pipelines irrespective of the state of the pipeline. This can lead to case 
> where all the datanodes are up but the pipelines are in closing state in SCM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-1458) Create a maven profile to run fault injection tests

2019-04-23 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey reassigned HDDS-1458:
--

Assignee: Eric Yang

> Create a maven profile to run fault injection tests
> ---
>
> Key: HDDS-1458
> URL: https://issues.apache.org/jira/browse/HDDS-1458
> Project: Hadoop Distributed Data Store
>  Issue Type: Test
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
>
> Some fault injection tests have been written using blockade.  It would be 
> nice to have ability to start docker compose and exercise the blockade test 
> cases against Ozone docker containers, and generate reports.  This is 
> optional integration tests to catch race conditions and fault tolerance 
> defects. 
> We can introduce a profile with id: it (short for integration tests).  This 
> will launch docker compose via maven-exec-plugin and run blockade to simulate 
> container failures and timeout.
> Usage command:
> {code}
> mvn clean verify -Pit
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1447) Fix CheckStyle warnings

2019-04-23 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1447:
---
Fix Version/s: 0.5.0

> Fix CheckStyle warnings 
> 
>
> Key: HDDS-1447
> URL: https://issues.apache.org/jira/browse/HDDS-1447
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Wanqiang Ji
>Assignee: Wanqiang Ji
>Priority: Major
> Fix For: 0.5.0
>
> Attachments: HDDS-1447.001.patch
>
>
> We had a full acceptance test + unit test build for 
> [HDDS-1433|https://issues.apache.org/jira/browse/HDDS-1433] : 
> [https://ci.anzix.net/job/ozone/16677/] gave 3 warnings belongs to Ozone.
> *Modules:*
>  * [Apache Hadoop Ozone 
> Client|https://ci.anzix.net/job/ozone/16677/checkstyle/new/moduleName.1350159737/]
>  ** KeyOutputStream.java:319
>  ** KeyOutputStream.java:622
>  * [Apache Hadoop Ozone Integration 
> Tests|https://ci.anzix.net/job/ozone/16677/checkstyle/new/moduleName.-1713756601/]
>  ** ContainerTestHelper.java:731



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1449) JVM Exit in datanode while committing a key

2019-04-23 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824417#comment-16824417
 ] 

Jitendra Nath Pandey commented on HDDS-1449:


It could be related https://github.com/facebook/rocksdb/issues/688

> JVM Exit in datanode while committing a key
> ---
>
> Key: HDDS-1449
> URL: https://issues.apache.org/jira/browse/HDDS-1449
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: MiniOzoneChaosCluster
> Attachments: 2019-04-22--20-23-56-IST.MiniOzoneChaosCluster.log, 
> hs_err_pid67466.log
>
>
> Saw the following trace in MiniOzoneChaosCluster run.
> {code}
> C  [librocksdbjni17271331491728127.jnilib+0x9755c]  
> Java_org_rocksdb_RocksDB_write0+0x1c
> J 13917  org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x0001102ff62e 
> [0x0001102ff580+0xae]
> J 17167 C2 
> org.apache.hadoop.utils.RocksDBStore.writeBatch(Lorg/apache/hadoop/utils/BatchOperation;)V
>  (260 bytes) @ 0x000111bbd01c [0x000111bbcde0+0x23c]
> J 20434 C1 
> org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.putBlock(Lorg/apache/hadoop/ozone/container/common/interfaces/Container;Lorg/apache/hadoop/ozone/container/common/helpers/BlockData;)J
>  (261 bytes) @ 0x000111c267ac [0x000111c25640+0x116c]
> J 19262 C2 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto;
>  (866 bytes) @ 0x0001125c5aa0 [0x0001125c1560+0x4540]
> J 15095 C2 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto;
>  (142 bytes) @ 0x000110ffc940 [0x000110ffc0c0+0x880]
> J 19301 C2 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto;
>  (146 bytes) @ 0x000111396144 [0x000111395e60+0x2e4]
> J 15997 C2 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine$$Lambda$776.get()Ljava/lang/Object;
>  (16 bytes) @ 0x000110138e54 [0x000110138d80+0xd4]
> J 15970 C2 java.util.concurrent.CompletableFuture$AsyncSupply.run()V (61 
> bytes) @ 0x00010fc80094 [0x00010fc8+0x94]
> J 17368 C2 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
>  (225 bytes) @ 0x000110b0a7a0 [0x000110b0a5a0+0x200]
> J 7389 C1 java.util.concurrent.ThreadPoolExecutor$Worker.run()V (9 bytes) @ 
> 0x00011012a004 [0x000110129f00+0x104]
> J 6837 C1 java.lang.Thread.run()V (17 bytes) @ 0x00011002b144 
> [0x00011002b000+0x144]
> v  ~StubRoutines::call_stub
> V  [libjvm.dylib+0x2ef1f6]  JavaCalls::call_helper(JavaValue*, methodHandle*, 
> JavaCallArguments*, Thread*)+0x6ae
> V  [libjvm.dylib+0x2ef99a]  JavaCalls::call_virtual(JavaValue*, KlassHandle, 
> Symbol*, Symbol*, JavaCallArguments*, Thread*)+0x164
> V  [libjvm.dylib+0x2efb46]  JavaCalls::call_virtual(JavaValue*, Handle, 
> KlassHandle, Symbol*, Symbol*, Thread*)+0x4a
> V  [libjvm.dylib+0x34a46d]  thread_entry(JavaThread*, Thread*)+0x7c
> V  [libjvm.dylib+0x56eb0f]  JavaThread::thread_main_inner()+0x9b
> V  [libjvm.dylib+0x57020a]  JavaThread::run()+0x1c2
> V  [libjvm.dylib+0x48d4a6]  java_start(Thread*)+0xf6
> C  [libsystem_pthread.dylib+0x3305]  _pthread_body+0x7e
> C  [libsystem_pthread.dylib+0x626f]  _pthread_start+0x46
> C  [libsystem_pthread.dylib+0x2415]  thread_start+0xd
> C  0x
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1455) Inconsistent naming convention with Ozone Kerberos configuration

2019-04-23 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824414#comment-16824414
 ] 

Jitendra Nath Pandey commented on HDDS-1455:


cc [~anu]

> Inconsistent naming convention with Ozone Kerberos configuration
> 
>
> Key: HDDS-1455
> URL: https://issues.apache.org/jira/browse/HDDS-1455
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Eric Yang
>Priority: Major
>
> In SetupSecureOzone.md, the naming convention for keytab files are different 
> from code.
> {code}
> hdds.scm.http.kerberos.keytab
> ozone.om.http.kerberos.keytab
> {code}
> In ozone-default.xml, it is looking for:
> {code}
> hdds.scm.http.kerberos.keytab
> ozone.om.http.kerberos.keytab.file
> {code}
> For the non http version of keytab, they are branded as:
> {code}
> hdds.scm.kerberos.keytab.file
> ozone.om.kerberos.keytab.file
> {code}
> It is best to shorten the name to remove .file suffix from the code to be 
> consistent with Hadoop naming convention.  The second nitpick is hdds and 
> ozone prefix.  Is there a good reason to have distinct prefix for both that 
> work closely together?  How about hadoop.ozone prefix?  From usability point 
> of view, the current prefix are very confusing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1454) GC other system pause events can trigger pipeline destroy for all the nodes in the cluster

2019-04-23 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16824305#comment-16824305
 ] 

Jitendra Nath Pandey commented on HDDS-1454:


HADOOP-9618 added a jvm pause monitor into NN. This could be useful here. 

> GC other system pause events can trigger pipeline destroy for all the nodes 
> in the cluster
> --
>
> Key: HDDS-1454
> URL: https://issues.apache.org/jira/browse/HDDS-1454
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Priority: Major
>  Labels: MiniOzoneChaosCluster
>
> In a MiniOzoneChaosCluster run it was observed that events like GC pauses or 
> any other pauses in SCM can mark all the datanodes as stale in SCM. This will 
> trigger multiple pipeline destroy and will render the system unusable. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1305) Robot test containers: hadoop client can't access o3fs

2019-04-19 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822347#comment-16822347
 ] 

Jitendra Nath Pandey commented on HDDS-1305:


[~Sandeep Nemuri], what is the 'hadoop' version being used? It could be an 
older version getting linked at runtime.

> Robot test containers: hadoop client can't access o3fs
> --
>
> Key: HDDS-1305
> URL: https://issues.apache.org/jira/browse/HDDS-1305
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Sandeep Nemuri
>Assignee: Anu Engineer
>Priority: Major
> Attachments: run.log
>
>
> Run the robot test using:
> {code:java}
> ./test.sh --keep --env ozonefs
> {code}
> login to OM container and check if we have desired volume/bucket/key got 
> created with robot tests.
> {code:java}
> [root@o3new ~]$ docker exec -it ozonefs_om_1 /bin/bash
> bash-4.2$ ozone fs -ls o3fs://bucket1.fstest/
> Found 3 items
> -rw-rw-rw-   1 hadoop hadoop  22990 2019-03-15 17:28 
> o3fs://bucket1.fstest/KEY.txt
> drwxrwxrwx   - hadoop hadoop  0 1970-01-01 00:00 
> o3fs://bucket1.fstest/testdir
> drwxrwxrwx   - hadoop hadoop  0 2019-03-15 17:27 
> o3fs://bucket1.fstest/testdir1
> {code}
> {code:java}
> [root@o3new ~]$ docker exec -it ozonefs_hadoop3_1 /bin/bash
> bash-4.4$ hadoop classpath
> /opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/*:/opt/hadoop/share/hadoop/common/*:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/*:/opt/hadoop/share/hadoop/hdfs/*:/opt/hadoop/share/hadoop/mapreduce/*:/opt/hadoop/share/hadoop/yarn:/opt/hadoop/share/hadoop/yarn/lib/*:/opt/hadoop/share/hadoop/yarn/*:/opt/ozone/share/ozone/lib/hadoop-ozone-filesystem-lib-current-0.5.0-SNAPSHOT.jar
> bash-4.4$ hadoop fs -ls o3fs://bucket1.fstest/
> 2019-03-18 19:12:42 INFO  Configuration:3204 - Removed undeclared tags:
> 2019-03-18 19:12:42 ERROR OzoneClientFactory:294 - Couldn't create protocol 
> class org.apache.hadoop.ozone.client.rpc.RpcClient exception:
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.apache.hadoop.ozone.client.OzoneClientFactory.getClientProtocol(OzoneClientFactory.java:291)
>   at 
> org.apache.hadoop.ozone.client.OzoneClientFactory.getRpcClient(OzoneClientFactory.java:169)
>   at 
> org.apache.hadoop.fs.ozone.OzoneClientAdapterImpl.(OzoneClientAdapterImpl.java:127)
>   at 
> org.apache.hadoop.fs.ozone.OzoneFileSystem.initialize(OzoneFileSystem.java:189)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3354)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3403)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3371)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:477)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
>   at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:325)
>   at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:249)
>   at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:232)
>   at 
> org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:104)
>   at org.apache.hadoop.fs.shell.Command.run(Command.java:176)
>   at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
>   at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
> Caused by: java.lang.VerifyError: Cannot inherit from final class
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at 

[jira] [Updated] (HDDS-1376) Datanode exits while executing client command when scmId is null

2019-04-16 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1376:
---
Fix Version/s: 0.5.0

> Datanode exits while executing client command when scmId is null
> 
>
> Key: HDDS-1376
> URL: https://issues.apache.org/jira/browse/HDDS-1376
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.4.0
>Reporter: Mukul Kumar Singh
>Assignee: Hanisha Koneru
>Priority: Major
>  Labels: MiniOzoneChaosCluster, pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Ozone Datanode exits with the following error, this happens because DN hasn't 
> received a scmID from the SCM after registration but is processing a client 
> command.
> {code}
> 2019-04-03 17:02:10,958 ERROR storage.RaftLogWorker 
> (ExitUtils.java:terminate(133)) - Terminating with exit status 1: 
> df6b578e-8d35-44f5-9b21-db7184dcc54e-RaftLogWorker failed.
> java.io.IOException: java.lang.NullPointerException: scmId cannot be null
> at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
> at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
> at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:83)
> at 
> org.apache.ratis.server.storage.RaftLogWorker$StateMachineDataPolicy.getFromFuture(RaftLogWorker.java:76)
> at 
> org.apache.ratis.server.storage.RaftLogWorker$WriteLog.execute(RaftLogWorker.java:354)
> at 
> org.apache.ratis.server.storage.RaftLogWorker.run(RaftLogWorker.java:219)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException: scmId cannot be null
> at 
> com.google.common.base.Preconditions.checkNotNull(Preconditions.java:204)
> at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.create(KeyValueContainer.java:110)
> at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handleCreateContainer(KeyValueHandler.java:243)
> at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handle(KeyValueHandler.java:165)
> at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.createContainer(HddsDispatcher.java:350)
> at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatchRequest(HddsDispatcher.java:224)
> at 
> org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:149)
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.dispatchCommand(ContainerStateMachine.java:347)
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.runCommand(ContainerStateMachine.java:354)
> at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.lambda$handleWriteChunk$0(ContainerStateMachine.java:385)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run$$$capture(CompletableFuture.java:1590)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> ... 1 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1102) Confusing error log when datanode tries to connect to a destroyed pipeline

2019-04-15 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1102:
---
Labels: newbie pushed-to-craterlake test-badlands  (was: 
pushed-to-craterlake test-badlands)

> Confusing error log when datanode tries to connect to a destroyed pipeline
> --
>
> Key: HDDS-1102
> URL: https://issues.apache.org/jira/browse/HDDS-1102
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Reporter: Nilotpal Nandi
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: newbie, pushed-to-craterlake, test-badlands
> Attachments: allnode.log, datanode.log
>
>
> steps taken:
> 
>  # created 5 datanode cluster.
>  # shutdown 2 datanodes
>  # started the datanodes again.
> One of the datanodes was shut down.
> exception seen :
>  
> {noformat}
> 2019-02-14 07:37:26 INFO LeaderElection:230 - 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8 got exception when requesting votes: {}
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 
> a3d1dd2d-554e-4e87-a2cf-076a229af352: group-FD6FA533F1FB not found.
>  at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>  at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>  at 
> org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214)
>  at 
> org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146)
>  at org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102)
> Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> INTERNAL: a3d1dd2d-554e-4e87-a2cf-076a229af352: group-FD6FA533F1FB not found.
>  at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:233)
>  at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:214)
>  at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:139)
>  at 
> org.apache.ratis.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:265)
>  at 
> org.apache.ratis.grpc.server.GrpcServerProtocolClient.requestVote(GrpcServerProtocolClient.java:83)
>  at org.apache.ratis.grpc.server.GrpcService.requestVote(GrpcService.java:187)
>  at 
> org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> 2019-02-14 07:37:26 INFO LeaderElection:46 - 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8: Election PASSED; received 1 response(s) 
> [6a0522ba-019e-4b77-ac1f-a9322cd525b8<-61ad3bf3-e9b1-48e5-90e3-3b78c8b5bba5#0:OK-t7]
>  and 1 exception(s); 6a0522ba-019e-4b77-ac1f-a9322cd525b8:t7, leader=null, 
> voted=6a0522ba-019e-4b77-ac1f-a9322cd525b8, 
> raftlog=6a0522ba-019e-4b77-ac1f-a9322cd525b8-SegmentedRaftLog:OPENED, conf=3: 
> [61ad3bf3-e9b1-48e5-90e3-3b78c8b5bba5:172.20.0.8:9858, 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8:172.20.0.6:9858, 
> 0f377918-aafa-4d8a-972a-6ead54048fba:172.20.0.3:9858], old=null
> 2019-02-14 07:37:26 INFO LeaderElection:52 - 0: 
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 
> a3d1dd2d-554e-4e87-a2cf-076a229af352: group-FD6FA533F1FB not found.
> 2019-02-14 07:37:26 INFO RoleInfo:130 - 6a0522ba-019e-4b77-ac1f-a9322cd525b8: 
> shutdown LeaderElection
> 2019-02-14 07:37:26 INFO RaftServerImpl:161 - 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8 changes role from CANDIDATE to LEADER at 
> term 7 for changeToLeader
> 2019-02-14 07:37:26 INFO RaftServerImpl:258 - 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8: change Leader from null to 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8 at term 7 for becomeLeader, leader 
> elected after 1066ms
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - 
> raft.server.staging.catchup.gap = 1000 (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - raft.server.rpc.sleep.time 
> = 25ms (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - raft.server.watch.timeout 
> = 10s (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - 
> raft.server.watch.timeout.denomination = 1s (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - 
> 

[jira] [Updated] (HDDS-1439) ozone spark job failing with class not found error for hadoop 2

2019-04-15 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1439:
---
Target Version/s: 0.5.0  (was: 0.4.0)

> ozone spark job failing with class not found error for hadoop 2
> ---
>
> Key: HDDS-1439
> URL: https://issues.apache.org/jira/browse/HDDS-1439
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Affects Versions: 0.4.0
>Reporter: Ajay Kumar
>Assignee: Ajay Kumar
>Priority: Major
>
> spark job fails to run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1439) ozone spark job failing with class not found error for hadoop 2

2019-04-15 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818203#comment-16818203
 ] 

Jitendra Nath Pandey commented on HDDS-1439:


I think this should not be a release blocker. IMO, ozone working on hadoop-2 is 
nice to have, but not a must requirement. However, I agree this should be fixed.

> ozone spark job failing with class not found error for hadoop 2
> ---
>
> Key: HDDS-1439
> URL: https://issues.apache.org/jira/browse/HDDS-1439
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Affects Versions: 0.4.0
>Reporter: Ajay Kumar
>Assignee: Ajay Kumar
>Priority: Major
>
> spark job fails to run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1294) ExcludeList shoud be a RPC Client config so that multiple streams can avoid the same error.

2019-04-10 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814756#comment-16814756
 ] 

Jitendra Nath Pandey commented on HDDS-1294:


# 
{quote}DistributedStorageHandler is a long lived object. The exclude list from 
this will never get cleaned up.
{quote}
The storage handler is closed only when {{OzoneHddsDatanodeService}} is 
stopped, therefore the problem is slowly all datanodes in the cluster may get 
added in the exclude list, and after that this datanode will not be able to 
serve any request. The exclusion should be in the context of a single session 
or should decay out.

 # It is odd that get methods are synchronized on the list objects, while add 
methods are synchronized on the class instance. Therefore, it is possible that 
a get will be executed while the list is being modified causing 
concurrent-modification exception. It is ok to synchronize the get methods on 
the class instance as well.

> ExcludeList shoud be a RPC Client config so that multiple streams can avoid 
> the same error.
> ---
>
> Key: HDDS-1294
> URL: https://issues.apache.org/jira/browse/HDDS-1294
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Client
>Affects Versions: 0.4.0
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: MiniOzoneChaosCluster
> Attachments: HDDS-1294.000.patch, HDDS-1294.001.patch, 
> HDDS-1294.002.patch
>
>
> ExcludeList right now is a per BlockOutPutStream value, this can result in 
> multiple keys created out of the same client to run into same exception



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-1294) ExcludeList shoud be a RPC Client config so that multiple streams can avoid the same error.

2019-04-10 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814756#comment-16814756
 ] 

Jitendra Nath Pandey edited comment on HDDS-1294 at 4/10/19 6:31 PM:
-

# 
{quote}DistributedStorageHandler is a long lived object. The exclude list from 
this will never get cleaned up.
{quote}
The storage handler is closed only when {{OzoneHddsDatanodeService}} is 
stopped, therefore the problem is slowly all datanodes in the cluster may get 
added in the exclude list, and after that this datanode will not be able to 
serve any request. The exclusion should be in the context of a single session 
or should decay out.
 # It is odd that get methods are synchronized on the list objects, while add 
methods are synchronized on the class instance. Therefore, it is possible that 
a get will be executed while the list is being modified causing 
concurrent-modification exception. It is ok to synchronize the get methods on 
the class instance as well.


was (Author: jnp):
# 
{quote}DistributedStorageHandler is a long lived object. The exclude list from 
this will never get cleaned up.
{quote}
The storage handler is closed only when {{OzoneHddsDatanodeService}} is 
stopped, therefore the problem is slowly all datanodes in the cluster may get 
added in the exclude list, and after that this datanode will not be able to 
serve any request. The exclusion should be in the context of a single session 
or should decay out.

 # It is odd that get methods are synchronized on the list objects, while add 
methods are synchronized on the class instance. Therefore, it is possible that 
a get will be executed while the list is being modified causing 
concurrent-modification exception. It is ok to synchronize the get methods on 
the class instance as well.

> ExcludeList shoud be a RPC Client config so that multiple streams can avoid 
> the same error.
> ---
>
> Key: HDDS-1294
> URL: https://issues.apache.org/jira/browse/HDDS-1294
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Client
>Affects Versions: 0.4.0
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: MiniOzoneChaosCluster
> Attachments: HDDS-1294.000.patch, HDDS-1294.001.patch, 
> HDDS-1294.002.patch
>
>
> ExcludeList right now is a per BlockOutPutStream value, this can result in 
> multiple keys created out of the same client to run into same exception



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1348) Refactor BlockOutpuStream Class

2019-04-10 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814725#comment-16814725
 ] 

Jitendra Nath Pandey commented on HDDS-1348:


{quote}Already exposed for testing by 
BlockOutputStream#getCommitWatcher#getCommitIndex2flushedDataMap()
{quote}
Let's annotate {{CommitWatcher#getCommitIndex2flushedDataMap}} as well with 
{{VisibleForTesting}}. This can be done while committing.

+1 for the patch, if test failures are not related.

> Refactor BlockOutpuStream Class
> ---
>
> Key: HDDS-1348
> URL: https://issues.apache.org/jira/browse/HDDS-1348
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Client
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
> Fix For: 0.5.0
>
> Attachments: HDDS-1348.000.patch, HDDS-1348.001.patch
>
>
> BlockOutputStream contains functionalities for handling write, flush and 
> close as well as tracking commitIndexes . The idea is to separate all 
> commitIndex tracking and management code outside of BlockOutputStream class



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1395) Key write fails with "BlockOutputStream has been closed"

2019-04-09 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813915#comment-16813915
 ] 

Jitendra Nath Pandey commented on HDDS-1395:


#  {code:java}
} catch (Exception e) {
  Throwable t = HddsClientUtils.checkForException(e);
  LOG.warn("3 way commit failed ", e);
  if (t instanceof GroupMismatchException) {
throw e;
  }
{code} Why don't we try to catch GroupMismatchException directly?
Also, I see that other exceptions are just ignored.
# It seems {{currentStreamIndex}} points to the latest entry in the 
{{streamEntries}}. If true, why is tracked as the separate variable. In general 
these seem be highly correlated, can we encapsulate them.
# Why are we incrementing {{currentStreamIndex}}, if there is no data in the 
buffer?

> Key write fails with "BlockOutputStream has been closed"
> 
>
> Key: HDDS-1395
> URL: https://issues.apache.org/jira/browse/HDDS-1395
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Client
>Affects Versions: 0.4.0
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: MiniOzoneChaosCluster
> Attachments: HDDS-1395.000.patch
>
>
> Key write fails with BlockOutputStream has been closed
> {code}
> 2019-04-05 11:24:47,770 ERROR ozone.MiniOzoneLoadGenerator 
> (MiniOzoneLoadGenerator.java:load(102)) - LOADGEN: Create 
> key:pool-431-thread-9-2092651262 failed with exception, but skipping
> java.io.IOException: BlockOutputStream has been closed.
> at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.checkOpen(BlockOutputStream.java:662)
> at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.write(BlockOutputStream.java:245)
> at 
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.write(BlockOutputStreamEntry.java:131)
> at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:325)
> at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.write(KeyOutputStream.java:287)
> at 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:49)
> at java.io.OutputStream.write(OutputStream.java:75)
> at 
> org.apache.hadoop.ozone.MiniOzoneLoadGenerator.load(MiniOzoneLoadGenerator.java:100)
> at 
> org.apache.hadoop.ozone.MiniOzoneLoadGenerator.lambda$startIO$0(MiniOzoneLoadGenerator.java:143)
> at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-1395) Key write fails with "BlockOutputStream has been closed"

2019-04-09 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813612#comment-16813612
 ] 

Jitendra Nath Pandey edited comment on HDDS-1395 at 4/9/19 4:55 PM:


Thanks for the patch [~shashikant]. It will be good to write a few sentences, 
in the Jira comments, about what was the root cause, and how the patch 
addresses it, so that reviewers can get a context around the issue, before 
looking at the patch.


was (Author: jnp):
Thanks for the patch [~shashikant]. It will be good to write a few sentences 
about what was the root cause, and how the patch addresses it, so that 
reviewers can get a context around the issue, before looking at the patch.

> Key write fails with "BlockOutputStream has been closed"
> 
>
> Key: HDDS-1395
> URL: https://issues.apache.org/jira/browse/HDDS-1395
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Client
>Affects Versions: 0.4.0
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: MiniOzoneChaosCluster
> Attachments: HDDS-1395.000.patch
>
>
> Key write fails with BlockOutputStream has been closed
> {code}
> 2019-04-05 11:24:47,770 ERROR ozone.MiniOzoneLoadGenerator 
> (MiniOzoneLoadGenerator.java:load(102)) - LOADGEN: Create 
> key:pool-431-thread-9-2092651262 failed with exception, but skipping
> java.io.IOException: BlockOutputStream has been closed.
> at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.checkOpen(BlockOutputStream.java:662)
> at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.write(BlockOutputStream.java:245)
> at 
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.write(BlockOutputStreamEntry.java:131)
> at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:325)
> at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.write(KeyOutputStream.java:287)
> at 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:49)
> at java.io.OutputStream.write(OutputStream.java:75)
> at 
> org.apache.hadoop.ozone.MiniOzoneLoadGenerator.load(MiniOzoneLoadGenerator.java:100)
> at 
> org.apache.hadoop.ozone.MiniOzoneLoadGenerator.lambda$startIO$0(MiniOzoneLoadGenerator.java:143)
> at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1395) Key write fails with "BlockOutputStream has been closed"

2019-04-09 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813612#comment-16813612
 ] 

Jitendra Nath Pandey commented on HDDS-1395:


Thanks for the patch [~shashikant]. It will be good to write a few sentences 
about what was the root cause, and how the patch addresses it, so that 
reviewers can get a context around the issue, before looking at the patch.

> Key write fails with "BlockOutputStream has been closed"
> 
>
> Key: HDDS-1395
> URL: https://issues.apache.org/jira/browse/HDDS-1395
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Client
>Affects Versions: 0.4.0
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: MiniOzoneChaosCluster
> Attachments: HDDS-1395.000.patch
>
>
> Key write fails with BlockOutputStream has been closed
> {code}
> 2019-04-05 11:24:47,770 ERROR ozone.MiniOzoneLoadGenerator 
> (MiniOzoneLoadGenerator.java:load(102)) - LOADGEN: Create 
> key:pool-431-thread-9-2092651262 failed with exception, but skipping
> java.io.IOException: BlockOutputStream has been closed.
> at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.checkOpen(BlockOutputStream.java:662)
> at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.write(BlockOutputStream.java:245)
> at 
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntry.write(BlockOutputStreamEntry.java:131)
> at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.handleWrite(KeyOutputStream.java:325)
> at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.write(KeyOutputStream.java:287)
> at 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.write(OzoneOutputStream.java:49)
> at java.io.OutputStream.write(OutputStream.java:75)
> at 
> org.apache.hadoop.ozone.MiniOzoneLoadGenerator.load(MiniOzoneLoadGenerator.java:100)
> at 
> org.apache.hadoop.ozone.MiniOzoneLoadGenerator.lambda$startIO$0(MiniOzoneLoadGenerator.java:143)
> at 
> java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1294) ExcludeList shoud be a RPC Client config so that multiple streams can avoid the same error.

2019-04-09 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813081#comment-16813081
 ] 

Jitendra Nath Pandey commented on HDDS-1294:


# DistributedStorageHandler is a long lived object. The exclude list from this 
will never get cleaned up.
# {code:java}
public List getPipelineIds() {
  return Collections.synchronizedList(pipelineIds);
}{code}
What is the need for returning synchronizedList? I think the problem is that 
same class is used at client to update, and also at the SCM to iterate. 
# In {{TestCloseContainerHandlingByClient}}, since we are trying to detect 
issues with potential race conditions, you might want to create 10 threads in a 
loop to increase the concurrency.



> ExcludeList shoud be a RPC Client config so that multiple streams can avoid 
> the same error.
> ---
>
> Key: HDDS-1294
> URL: https://issues.apache.org/jira/browse/HDDS-1294
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Client
>Affects Versions: 0.4.0
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Major
>  Labels: MiniOzoneChaosCluster
> Attachments: HDDS-1294.000.patch, HDDS-1294.001.patch
>
>
> ExcludeList right now is a per BlockOutPutStream value, this can result in 
> multiple keys created out of the same client to run into same exception



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1348) Refactor BlockOutpuStream Class

2019-04-09 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813042#comment-16813042
 ] 

Jitendra Nath Pandey commented on HDDS-1348:


A few comments:
 # Why don't we always call adjustBuffers on 
{{xceiverClient.getReplicatedMinCommitIndex()}}? Isn't 
replicated-min-commit-index is the source of truth in all cases.
 # I think {{totalAckDataLength}} should be updated as soon as we get an ack, 
instead of adjustBuffer. We are coupling the buffer-management with tracking of 
acks.
 # {{watchOnCommitIndex(boolean first)}}: It would be better to have two 
different functions, one for buffer full case, and another for flush. A boolean 
flag can be confusing for example, suppose when we call flush, buffer is full 
as well, so it becomes tricky to understand code.
 # Please add unit tests only for CommitWatcher class. The test should mock 
xceiverClient, and test buffers are appropriately released, for exception as 
well as non-exception cases.
 # Let's make \{{CommitWatcher#getCommitIndex2flushedDataMap}} visible for 
testing.

> Refactor BlockOutpuStream Class
> ---
>
> Key: HDDS-1348
> URL: https://issues.apache.org/jira/browse/HDDS-1348
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Client
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
> Fix For: 0.5.0
>
> Attachments: HDDS-1348.000.patch
>
>
> BlockOutputStream contains functionalities for handling write, flush and 
> close as well as tracking commitIndexes . The idea is to separate all 
> commitIndex tracking and management code outside of BlockOutputStream class



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1401) Key Read fails with Unable to find the block, after reducing the size of container cache

2019-04-07 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1401:
---
Priority: Blocker  (was: Major)

> Key Read fails with Unable to find the block, after reducing the size of 
> container cache
> 
>
> Key: HDDS-1401
> URL: https://issues.apache.org/jira/browse/HDDS-1401
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.3.0
>Reporter: Mukul Kumar Singh
>Assignee: Shashikant Banerjee
>Priority: Blocker
>
> Key Read fails with Unable to find the block NO_SUCH_BLOCK after reducing the 
> value of OZONE_CONTAINER_CACHE_SIZE.
> The reads are tried on the other datanodes but it failed on all the 3 
> datanodes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1337) HandleGroupMismatchException in OzoneClient

2019-03-29 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1337:
---
Priority: Blocker  (was: Major)

> HandleGroupMismatchException in OzoneClient
> ---
>
> Key: HDDS-1337
> URL: https://issues.apache.org/jira/browse/HDDS-1337
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Client
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Blocker
>  Labels: Blocker
> Fix For: 0.4.0
>
> Attachments: HDDS-1337.000.patch, HDDS-1337.001.patch
>
>
> If a pipeline gets destroyed in ozone client, ozone client may hit 
> GroupMismatchException from Ratis. In cases as such, client should exclude 
> the pipeline and retry write to a different block.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1067) freon run on client gets hung when two of the datanodes are down in 3 datanode cluster

2019-03-28 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1067:
---
Target Version/s: 0.5.0

> freon run on client gets hung when two of the datanodes are down in 3 
> datanode cluster
> --
>
> Key: HDDS-1067
> URL: https://issues.apache.org/jira/browse/HDDS-1067
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Client
>Reporter: Nilotpal Nandi
>Assignee: Shashikant Banerjee
>Priority: Major
> Attachments: stack_file.txt
>
>
> steps taken :
> 
>  # created 3 node docker cluster.
>  # wrote a key
>  # created partition such that 2 out of 3 datanodes cannot communicate with 
> any other node.
>  # Third datanode can communicate with scm, om and the client.
>  # ran freon to write key
> Observation :
> -
> freon run is hung. There is no timeout.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-1102) Confusing error log when datanode tries to connect to a destroyed pipeline

2019-03-25 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey reassigned HDDS-1102:
--

Assignee: Shashikant Banerjee

> Confusing error log when datanode tries to connect to a destroyed pipeline
> --
>
> Key: HDDS-1102
> URL: https://issues.apache.org/jira/browse/HDDS-1102
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Reporter: Nilotpal Nandi
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pushed-to-craterlake, test-badlands
> Attachments: allnode.log, datanode.log
>
>
> steps taken:
> 
>  # created 5 datanode cluster.
>  # shutdown 2 datanodes
>  # started the datanodes again.
> One of the datanodes was shut down.
> exception seen :
>  
> {noformat}
> 2019-02-14 07:37:26 INFO LeaderElection:230 - 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8 got exception when requesting votes: {}
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 
> a3d1dd2d-554e-4e87-a2cf-076a229af352: group-FD6FA533F1FB not found.
>  at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>  at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>  at 
> org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214)
>  at 
> org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146)
>  at org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102)
> Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> INTERNAL: a3d1dd2d-554e-4e87-a2cf-076a229af352: group-FD6FA533F1FB not found.
>  at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:233)
>  at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:214)
>  at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:139)
>  at 
> org.apache.ratis.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:265)
>  at 
> org.apache.ratis.grpc.server.GrpcServerProtocolClient.requestVote(GrpcServerProtocolClient.java:83)
>  at org.apache.ratis.grpc.server.GrpcService.requestVote(GrpcService.java:187)
>  at 
> org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> 2019-02-14 07:37:26 INFO LeaderElection:46 - 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8: Election PASSED; received 1 response(s) 
> [6a0522ba-019e-4b77-ac1f-a9322cd525b8<-61ad3bf3-e9b1-48e5-90e3-3b78c8b5bba5#0:OK-t7]
>  and 1 exception(s); 6a0522ba-019e-4b77-ac1f-a9322cd525b8:t7, leader=null, 
> voted=6a0522ba-019e-4b77-ac1f-a9322cd525b8, 
> raftlog=6a0522ba-019e-4b77-ac1f-a9322cd525b8-SegmentedRaftLog:OPENED, conf=3: 
> [61ad3bf3-e9b1-48e5-90e3-3b78c8b5bba5:172.20.0.8:9858, 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8:172.20.0.6:9858, 
> 0f377918-aafa-4d8a-972a-6ead54048fba:172.20.0.3:9858], old=null
> 2019-02-14 07:37:26 INFO LeaderElection:52 - 0: 
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 
> a3d1dd2d-554e-4e87-a2cf-076a229af352: group-FD6FA533F1FB not found.
> 2019-02-14 07:37:26 INFO RoleInfo:130 - 6a0522ba-019e-4b77-ac1f-a9322cd525b8: 
> shutdown LeaderElection
> 2019-02-14 07:37:26 INFO RaftServerImpl:161 - 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8 changes role from CANDIDATE to LEADER at 
> term 7 for changeToLeader
> 2019-02-14 07:37:26 INFO RaftServerImpl:258 - 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8: change Leader from null to 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8 at term 7 for becomeLeader, leader 
> elected after 1066ms
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - 
> raft.server.staging.catchup.gap = 1000 (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - raft.server.rpc.sleep.time 
> = 25ms (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - raft.server.watch.timeout 
> = 10s (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - 
> raft.server.watch.timeout.denomination = 1s (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - 
> raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
> 2019-02-14 07:37:26 INFO 

[jira] [Updated] (HDDS-1102) Confusing error log when datanode tries to connect to a destroyed pipeline

2019-03-25 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1102:
---
Priority: Critical  (was: Major)

> Confusing error log when datanode tries to connect to a destroyed pipeline
> --
>
> Key: HDDS-1102
> URL: https://issues.apache.org/jira/browse/HDDS-1102
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Reporter: Nilotpal Nandi
>Priority: Critical
>  Labels: pushed-to-craterlake, test-badlands
> Attachments: allnode.log, datanode.log
>
>
> steps taken:
> 
>  # created 5 datanode cluster.
>  # shutdown 2 datanodes
>  # started the datanodes again.
> One of the datanodes was shut down.
> exception seen :
>  
> {noformat}
> 2019-02-14 07:37:26 INFO LeaderElection:230 - 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8 got exception when requesting votes: {}
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 
> a3d1dd2d-554e-4e87-a2cf-076a229af352: group-FD6FA533F1FB not found.
>  at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>  at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>  at 
> org.apache.ratis.server.impl.LeaderElection.waitForResults(LeaderElection.java:214)
>  at 
> org.apache.ratis.server.impl.LeaderElection.askForVotes(LeaderElection.java:146)
>  at org.apache.ratis.server.impl.LeaderElection.run(LeaderElection.java:102)
> Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> INTERNAL: a3d1dd2d-554e-4e87-a2cf-076a229af352: group-FD6FA533F1FB not found.
>  at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:233)
>  at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:214)
>  at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:139)
>  at 
> org.apache.ratis.proto.grpc.RaftServerProtocolServiceGrpc$RaftServerProtocolServiceBlockingStub.requestVote(RaftServerProtocolServiceGrpc.java:265)
>  at 
> org.apache.ratis.grpc.server.GrpcServerProtocolClient.requestVote(GrpcServerProtocolClient.java:83)
>  at org.apache.ratis.grpc.server.GrpcService.requestVote(GrpcService.java:187)
>  at 
> org.apache.ratis.server.impl.LeaderElection.lambda$submitRequests$0(LeaderElection.java:188)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> 2019-02-14 07:37:26 INFO LeaderElection:46 - 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8: Election PASSED; received 1 response(s) 
> [6a0522ba-019e-4b77-ac1f-a9322cd525b8<-61ad3bf3-e9b1-48e5-90e3-3b78c8b5bba5#0:OK-t7]
>  and 1 exception(s); 6a0522ba-019e-4b77-ac1f-a9322cd525b8:t7, leader=null, 
> voted=6a0522ba-019e-4b77-ac1f-a9322cd525b8, 
> raftlog=6a0522ba-019e-4b77-ac1f-a9322cd525b8-SegmentedRaftLog:OPENED, conf=3: 
> [61ad3bf3-e9b1-48e5-90e3-3b78c8b5bba5:172.20.0.8:9858, 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8:172.20.0.6:9858, 
> 0f377918-aafa-4d8a-972a-6ead54048fba:172.20.0.3:9858], old=null
> 2019-02-14 07:37:26 INFO LeaderElection:52 - 0: 
> java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 
> a3d1dd2d-554e-4e87-a2cf-076a229af352: group-FD6FA533F1FB not found.
> 2019-02-14 07:37:26 INFO RoleInfo:130 - 6a0522ba-019e-4b77-ac1f-a9322cd525b8: 
> shutdown LeaderElection
> 2019-02-14 07:37:26 INFO RaftServerImpl:161 - 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8 changes role from CANDIDATE to LEADER at 
> term 7 for changeToLeader
> 2019-02-14 07:37:26 INFO RaftServerImpl:258 - 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8: change Leader from null to 
> 6a0522ba-019e-4b77-ac1f-a9322cd525b8 at term 7 for becomeLeader, leader 
> elected after 1066ms
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - 
> raft.server.staging.catchup.gap = 1000 (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - raft.server.rpc.sleep.time 
> = 25ms (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - raft.server.watch.timeout 
> = 10s (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - 
> raft.server.watch.timeout.denomination = 1s (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - 
> raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default)
> 2019-02-14 07:37:26 INFO RaftServerConfigKeys:43 - 
> 

[jira] [Updated] (HDDS-1304) Ozone ha breaks service discovery

2019-03-25 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1304:
---
Target Version/s: 0.4.0

> Ozone ha breaks service discovery
> -
>
> Key: HDDS-1304
> URL: https://issues.apache.org/jira/browse/HDDS-1304
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.4.0
>Reporter: Ajay Kumar
>Assignee: Nanda kumar
>Priority: Blocker
>
> Ozone ha breaks service discovery



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1112) Add a ozoneFilesystem related api's to OzoneManager to reduce redundant lookups

2019-03-24 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1112:
---
Target Version/s: 0.5.0  (was: 0.3.0)

> Add a ozoneFilesystem related api's to OzoneManager to reduce redundant 
> lookups
> ---
>
> Key: HDDS-1112
> URL: https://issues.apache.org/jira/browse/HDDS-1112
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Filesystem
>Affects Versions: 0.4.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Critical
> Fix For: 0.4.0
>
>
> With the current OzoneFilesystem design, most of the lookups while create 
> happens via that getFileStatus api, which inturn does a getKey or a list Key 
> for the keys in the Ozone bucket. 
> In most of the cases, the files do not exists before creation, and hence 
> these lookups corresponds to wasted time in lookup. This jira proposes to 
> optimize the "create" and "getFileState" api in OzoneFileSystem by 
> introducing OzoneFilesystem friendly apis in OM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1112) Add a ozoneFilesystem related api's to OzoneManager to reduce redundant lookups

2019-03-24 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1112:
---
Fix Version/s: (was: 0.4.0)

> Add a ozoneFilesystem related api's to OzoneManager to reduce redundant 
> lookups
> ---
>
> Key: HDDS-1112
> URL: https://issues.apache.org/jira/browse/HDDS-1112
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Filesystem
>Affects Versions: 0.4.0
>Reporter: Mukul Kumar Singh
>Assignee: Mukul Kumar Singh
>Priority: Critical
>
> With the current OzoneFilesystem design, most of the lookups while create 
> happens via that getFileStatus api, which inturn does a getKey or a list Key 
> for the keys in the Ozone bucket. 
> In most of the cases, the files do not exists before creation, and hence 
> these lookups corresponds to wasted time in lookup. This jira proposes to 
> optimize the "create" and "getFileState" api in OzoneFileSystem by 
> introducing OzoneFilesystem friendly apis in OM.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1312) Add more unit tests to verify BlockOutputStream functionalities

2019-03-21 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1312:
---
Target Version/s: 0.4.0  (was: 0.5.0)

> Add more unit tests to verify BlockOutputStream functionalities
> ---
>
> Key: HDDS-1312
> URL: https://issues.apache.org/jira/browse/HDDS-1312
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.4.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Blocker
> Attachments: HDDS-1312.000.patch
>
>
> This jira aims to add more unit test coverage for BlockOutputStream 
> functionalities.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-1312) Add more unit tests to verify BlockOutputStream functionalities

2019-03-21 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797871#comment-16797871
 ] 

Jitendra Nath Pandey edited comment on HDDS-1312 at 3/21/19 6:02 PM:
-

I am yet to complete the review, however some comments.
 # 
{code:java}
ioException = new IOException(
"Unexpected Storage Container Exception: " + e.toString(), e);
adjustBuffersOnException();
throw ioException;{code}
What is the reason for this change?

 # 
{code:java}
if (currentBufferIndex >= 0) {
  currentBufferIndex--;
}
{code}
This doesn't look right, because it will hide a bug elsewhere. If releaseBuffer 
is called, a buffer must have been allocated somewhere. We should instead add a 
precondition check.

 # 
{code:java}
public Throwable checkForException
public ConcurrentHashMap getCommitInfoMap(){code}
These should be annotated with {{VisibleForTesting}}

 # In {{testBufferCaching}}, some comments don't look right, e.g. after flush 
call, data is flushed, but may not be acked.

 

 


was (Author: jnp):
I am yet to complete the review, however some comments.
# {code:java}
ioException = new IOException(
"Unexpected Storage Container Exception: " + e.toString(), e);
adjustBuffersOnException();
throw ioException;{code}
What is the reason for this change?
# {code:java}
if (currentBufferIndex >= 0) {
  currentBufferIndex--;
}
{code}
This doesn't look right, because it will hide a bug elsewhere. If releaseBuffer 
is called, a buffer must have been allocated somewhere. We should instead add a 
precondition check.
# {code:java}
public Throwable checkForException
public ConcurrentHashMap getCommitInfoMap(){code}
These should be annotated with {{VisibleForTesting}}
# In {{testBufferCaching}}, some comments don't look write, e.g. after flush 
call, data is flushed, but may not be acked.

 

 

> Add more unit tests to verify BlockOutputStream functionalities
> ---
>
> Key: HDDS-1312
> URL: https://issues.apache.org/jira/browse/HDDS-1312
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Blocker
> Attachments: HDDS-1312.000.patch
>
>
> This jira aims to add more unit test coverage for BlockOutputStream 
> functionalities.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1312) Add more unit tests to verify BlockOutputStream functionalities

2019-03-21 Thread Jitendra Nath Pandey (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16797871#comment-16797871
 ] 

Jitendra Nath Pandey commented on HDDS-1312:


I am yet to complete the review, however some comments.
# {code:java}
ioException = new IOException(
"Unexpected Storage Container Exception: " + e.toString(), e);
adjustBuffersOnException();
throw ioException;{code}
What is the reason for this change?
# {code:java}
if (currentBufferIndex >= 0) {
  currentBufferIndex--;
}
{code}
This doesn't look right, because it will hide a bug elsewhere. If releaseBuffer 
is called, a buffer must have been allocated somewhere. We should instead add a 
precondition check.
# {code:java}
public Throwable checkForException
public ConcurrentHashMap getCommitInfoMap(){code}
These should be annotated with {{VisibleForTesting}}
# In {{testBufferCaching}}, some comments don't look write, e.g. after flush 
call, data is flushed, but may not be acked.

 

 

> Add more unit tests to verify BlockOutputStream functionalities
> ---
>
> Key: HDDS-1312
> URL: https://issues.apache.org/jira/browse/HDDS-1312
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Blocker
> Attachments: HDDS-1312.000.patch
>
>
> This jira aims to add more unit test coverage for BlockOutputStream 
> functionalities.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1088) Add blockade Tests to test Replica Manager

2019-03-19 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1088:
---
Fix Version/s: 0.5.0

> Add blockade Tests to test Replica Manager
> --
>
> Key: HDDS-1088
> URL: https://issues.apache.org/jira/browse/HDDS-1088
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Nilotpal Nandi
>Assignee: Nilotpal Nandi
>Priority: Major
> Fix For: 0.5.0
>
> Attachments: HDDS-1088.001.patch, HDDS-1088.002.patch, 
> HDDS-1088.003.patch, HDDS-1088.004.patch, HDDS-1088.005.patch, 
> HDDS-1088.006.patch, HDDS-1088.007.patch, HDDS-1088.008.patch
>
>
> We need to add tests for testing Replica Manager for scenarios like loss of 
> node, adding new nodes, under-replicated containers



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1312) Add more unit tests to verify BlockOutputStream functionalities

2019-03-19 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1312:
---
Priority: Blocker  (was: Major)

> Add more unit tests to verify BlockOutputStream functionalities
> ---
>
> Key: HDDS-1312
> URL: https://issues.apache.org/jira/browse/HDDS-1312
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Blocker
> Attachments: HDDS-1312.000.patch
>
>
> This jira aims to add more unit test coverage for BlockOutputStream 
> functionalities.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1312) Add more unit tests to verify BlockOutputStream functionalities

2019-03-19 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1312:
---
Fix Version/s: (was: 0.5.0)

> Add more unit tests to verify BlockOutputStream functionalities
> ---
>
> Key: HDDS-1312
> URL: https://issues.apache.org/jira/browse/HDDS-1312
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
> Attachments: HDDS-1312.000.patch
>
>
> This jira aims to add more unit test coverage for BlockOutputStream 
> functionalities.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1312) Add more unit tests to verify BlockOutputStream functionalities

2019-03-19 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1312:
---
Target Version/s: 0.4.0  (was: 0.5.0)

> Add more unit tests to verify BlockOutputStream functionalities
> ---
>
> Key: HDDS-1312
> URL: https://issues.apache.org/jira/browse/HDDS-1312
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
> Attachments: HDDS-1312.000.patch
>
>
> This jira aims to add more unit test coverage for BlockOutputStream 
> functionalities.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1312) Add more unit tests to verify BlockOutputStream functionalities

2019-03-19 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1312:
---
Status: Patch Available  (was: Open)

> Add more unit tests to verify BlockOutputStream functionalities
> ---
>
> Key: HDDS-1312
> URL: https://issues.apache.org/jira/browse/HDDS-1312
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Affects Versions: 0.5.0
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
> Fix For: 0.5.0
>
> Attachments: HDDS-1312.000.patch
>
>
> This jira aims to add more unit test coverage for BlockOutputStream 
> functionalities.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1310) In datanode once a container becomes unhealthy, datanode restart fails.

2019-03-19 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1310:
---
Priority: Blocker  (was: Major)

> In datanode once a container becomes unhealthy, datanode restart fails.
> ---
>
> Key: HDDS-1310
> URL: https://issues.apache.org/jira/browse/HDDS-1310
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.3.0
>Reporter: Sandeep Nemuri
>Assignee: Sandeep Nemuri
>Priority: Blocker
>
> When a container is marked as {{UNHEALTHY}} in a datanode, subsequent restart 
> of that datanode fails as it cannot generate ContainerReports anymore. 
> Unhealthy state of a container is not handled in ContainerReport generation 
> inside a datanode.
> We get the below exception when a datanode tries to generate the 
> ContainerReport which contains unhealthy container(s)
> {noformat}
> 2019-03-19 13:51:13,646 [Datanode State Machine Thread - 0] ERROR  - 
> Unable to communicate to SCM server at x.x.xxx:9861 for past 3300 
> seconds.
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  Invalid Container state found: 86
> at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.getHddsState(KeyValueContainer.java:623)
> at 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueContainer.getContainerReport(KeyValueContainer.java:593)
> at 
> org.apache.hadoop.ozone.container.common.impl.ContainerSet.getContainerReport(ContainerSet.java:204)
> at 
> org.apache.hadoop.ozone.container.ozoneimpl.ContainerController.getContainerReport(ContainerController.java:82)
> at 
> org.apache.hadoop.ozone.container.common.states.endpoint.RegisterEndpointTask.call(RegisterEndpointTask.java:114)
> at 
> org.apache.hadoop.ozone.container.common.states.endpoint.RegisterEndpointTask.call(RegisterEndpointTask.java:47)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-1289) get Key failed on SCM restart

2019-03-15 Thread Jitendra Nath Pandey (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-1289:
---
Priority: Blocker  (was: Critical)

> get Key failed on SCM restart
> -
>
> Key: HDDS-1289
> URL: https://issues.apache.org/jira/browse/HDDS-1289
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Reporter: Nilotpal Nandi
>Assignee: Shashikant Banerjee
>Priority: Blocker
> Attachments: 
> hadoop-hdfs-scm-ctr-e139-1542663976389-86524-01-03.log
>
>
> Seeing ContainerNotFoundException in scm log when get key operation tried 
> after scm restart.
> scm.log:
> [^hadoop-hdfs-scm-ctr-e139-1542663976389-86524-01-03.log]
>  
> {noformat}
>  
>  
> ozone version :
> 
> Source code repository g...@github.com:hortonworks/ozone.git -r 
> 67b7c4fd071b3f557bdb54be2a266b8a611cbce6
> Compiled by jenkins on 2019-03-06T22:02Z
> Compiled with protoc 2.5.0
> From source with checksum 65be9a337d178cd3855f5c5a2f111
> Using HDDS 0.4.0.3.0.100.0-348
> Source code repository g...@github.com:hortonworks/ozone.git -r 
> 67b7c4fd071b3f557bdb54be2a266b8a611cbce6
> Compiled by jenkins on 2019-03-06T22:01Z
> Compiled with protoc 2.5.0
> From source with checksum 324109cb3e8b188c1b89dc0b328c3a
> root@ctr-e139-1542663976389-86524-01-06 hdfs# hadoop version
> Hadoop 3.1.1.3.0.100.0-348
> Source code repository g...@github.com:hortonworks/hadoop.git -r 
> 484434b1c2480bdc9314a7ee1ade8a0f4db1758f
> Compiled by jenkins on 2019-03-06T22:14Z
> Compiled with protoc 2.5.0
> From source with checksum ba6aad94c14256ef3ad8634e3b5086
> This command was run using 
> /usr/hdp/3.0.100.0-348/hadoop/hadoop-common-3.1.1.3.0.100.0-348.jar
> {noformat}
>  
>  
>  
> {noformat}
> 2019-03-13 17:00:54,348 ERROR container.ContainerReportHandler 
> (ContainerReportHandler.java:processContainerReplicas(173)) - Received 
> container report for an unknown container 22 from datanode 
> 80f046cb-6fe2-4a05-bb67-9bf46f48723b{ip: 172.27.69.155, host: 
> ctr-e139-1542663976389-86524-01-05.hwx.site} {} 
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: #22 at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:543)
>  at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.updateContainerReplica(ContainerStateMap.java:230)
>  at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.updateContainerReplica(ContainerStateManager.java:565)
>  at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.updateContainerReplica(SCMContainerManager.java:393)
>  at 
> org.apache.hadoop.hdds.scm.container.ReportHandlerHelper.processContainerReplica(ReportHandlerHelper.java:74)
>  at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:159)
>  at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:110)
>  at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:51)
>  at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:85)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) 2019-03-13 17:00:54,349 ERROR 
> container.ContainerReportHandler 
> (ContainerReportHandler.java:processContainerReplicas(173)) - Received 
> container report for an unknown container 23 from datanode 
> 80f046cb-6fe2-4a05-bb67-9bf46f48723b{ip: 172.27.69.155, host: 
> ctr-e139-1542663976389-86524-01-05.hwx.site} {} 
> org.apache.hadoop.hdds.scm.container.ContainerNotFoundException: #23 at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.checkIfContainerExist(ContainerStateMap.java:543)
>  at 
> org.apache.hadoop.hdds.scm.container.states.ContainerStateMap.updateContainerReplica(ContainerStateMap.java:230)
>  at 
> org.apache.hadoop.hdds.scm.container.ContainerStateManager.updateContainerReplica(ContainerStateManager.java:565)
>  at 
> org.apache.hadoop.hdds.scm.container.SCMContainerManager.updateContainerReplica(SCMContainerManager.java:393)
>  at 
> org.apache.hadoop.hdds.scm.container.ReportHandlerHelper.processContainerReplica(ReportHandlerHelper.java:74)
>  at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.processContainerReplicas(ContainerReportHandler.java:159)
>  at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:110)
>  at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:51)
>  at 
> 

  1   2   3   4   5   6   7   8   9   10   >