[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-11-14 Thread Anu Engineer (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974483#comment-16974483
 ] 

Anu Engineer commented on HDDS-2372:


bq. It's possible to remove the usage of the tmp files but only if we allow 
overwrite for all the chunk files (in case of a leader failure the next attempt 
to write may find the previous chunk file in place). It may be accepted but 
it's a change with more risk.

Why this is an enforced constraint? It is the artifact of our code. It should 
be trivial to check if file exists , and write chunk_file_v1, chunk_file_v2 
etc. Anyway, as you mentioned, we will anyway rewrite this whole path. So it is 
probably ok to do what you think works now.

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Assignee: Marton Elek
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-11-14 Thread Marton Elek (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974161#comment-16974161
 ] 

Marton Elek commented on HDDS-2372:
---

We had a long discussion with [~shashikant] . Here is the summary:
 # It's possible to remove the usage of the tmp files but only if we allow 
overwrite for all the chunk files (in case of a leader failure the next attempt 
to write may find the previous chunk file in place). It may be accepted but 
it's a change with more risk.
 # The proper solution is to use the same file to write multiple chunks. It's a 
bigger change, requires time and will enable to remove the usage of tmp files 
anyway.
 # It seems to be a safe option to keep the usage of the tmp file (but with 
triple FileNotFound check based on exceptions) and remove it only as part of 
the bigger change (2) which should be done very soon, anyway.

I uploaded the initial patch (including a fix for a problem found by 
[~shashikant] during an IRL code review. Thanks for that).

As of now I started to test it in my cluster with ChunkWriter freon test.

 

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-11-07 Thread Anu Engineer (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969470#comment-16969470
 ] 

Anu Engineer commented on HDDS-2372:


> Thanks Anu Engineer for the suggestion. Writing to the actual chunk file may 
> lead to handling truncation log entries in Ratis inside Ozone which we don't 
> need to handle right now as we always write to tmp chunk files

That is correct. That is one of the reasons why we did the tmp way. But that 
time we did not have the Data Scrubber thread. Now we do that a data scrubber 
thread, so it is trivial for the chunkfile to be detected as junk and cleaned 
up by this thread.

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Assignee: Shashikant Banerjee
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-11-07 Thread Shashikant Banerjee (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969379#comment-16969379
 ] 

Shashikant Banerjee commented on HDDS-2372:
---

In ratis, raft log entries can get truncated after leader election happens. The 
data write actually happens as a part of append the log entry itself. 
Currently, if the raft log gets truncated , we don't do any handling for those 
entries i.e, we don't delete/validate the chunk files written as a part of log 
entry itself as the the data always exist in the tmp files which is stamped 
with the term and log index  which are not visible and will remain as garbage 
even if the corresponding log entries in the raft log have been truncated. 

If we write to the actual chunk file which happens as a part of writing the log 
itself, then correspondingly, if the those log entries get truncated, we might 
need to handle this inside ozone by deleting the corresponding chunk files as 
well to maintain the consistency or have to validate the data while updating 
the rocks db entries as well.

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Assignee: Shashikant Banerjee
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-11-07 Thread Marton Elek (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969286#comment-16969286
 ] 

Marton Elek commented on HDDS-2372:
---

{quote}Writing to the actual chunk file may lead to handling truncation log 
entries in Ratis inside Ozone which we don't need to handle right now as we 
always write to tmp chunk files. Even if log entries get truncated inside Ratis 
, tmp files are left behind as garbage.
{quote}
Sorry, it's not clear for me what does it mean. Can you please give more 
details about this scenario? We may have garbage tmp files anyway.

The suggestion from [~aengineer] would have a big benefit. Could would be 
simplified a lot as we don't need to write anything  during commit the commit 
phse. The current code is a little tricky we have the same writeChunk method 
for both commit and write and we have a flag (inside DispatcherContext) which 
shows which function is called.

 

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Assignee: Shashikant Banerjee
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-11-06 Thread Shashikant Banerjee (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968960#comment-16968960
 ] 

Shashikant Banerjee commented on HDDS-2372:
---

Thanks [~aengineer] for the suggestion. Writing to the actual chunk file may 
lead to handling truncation log entries in Ratis inside Ozone which we don't 
need to handle right now as we always write to tmp chunk files. Even if log 
entries get truncated inside Ratis , tmp files are left behind as garbage.

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Assignee: Shashikant Banerjee
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-11-06 Thread Anu Engineer (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968767#comment-16968767
 ] 

Anu Engineer commented on HDDS-2372:



In the Chunk write path, we write chunks to a temp file and then rename them to 
the file file. 

However, until we commit a block, any chunk file is a temp file for real since 
no one can see the chunk file name until we commit the ChunkInfo into the 
RocksDB.

So if we remove the tmpChunkFile and always write to the real chunk file, this 
race condition will go away.

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Assignee: Shashikant Banerjee
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-11-06 Thread Marton Elek (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968264#comment-16968264
 ] 

Marton Elek commented on HDDS-2372:
---

I had a long conversation with [~shashikant] and He helped me a lot to 
understood the problem (thanks here, again). Here are our proposals:

 

[Problem 1]: race condition between read (read the statemachine data to send it 
to the followers) and commit

This can be solved using a a second read attempt after throwing the exception.

[Problem 2]: race condition between writeStateMachineData and 
readStateMachineData (the statemachine data write might not be finished when we 
start to read back the data (in case of missing cache entry).


This can be fixed with checking the size of the data and compare it with the 
length which is part of the chunk write request.

[Problem 3] race condition between close container / write chunk / read chunk : 
write chunk may be declined because the container is closed, in this case the 
read chunk error should be ignored silently instead of throwing an exception 
for ratis. This can be done with using the bcsid. If it's newer than the 
term/index of the close container, the request can be safely ignored.

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Assignee: Shashikant Banerjee
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-11-06 Thread Marton Elek (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968213#comment-16968213
 ] 

Marton Elek commented on HDDS-2372:
---

Yes, I tried this, but it doesn't work. Assuming the algorithm is the following:

 
 # file := finalPath
 # if (!file.exists()) file := tmpPath
 # if (!file.exist()) file:= finalPath

If the move happens between 2 and 3 the file value will be tmpPath instead of 
finalPath. One option is to catch the FileNotFound exception and retry.

 

But there is an other (slightly different) question:

What about having read and write at the same time? How is it guaranteed that 
the writeStateMachineData is finished before the next readStateMachineData is 
started. Is it guaranteed by Ratis?

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Assignee: Shashikant Banerjee
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-11-05 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967873#comment-16967873
 ] 

Tsz-wo Sze commented on HDDS-2372:
--

It makes sense to check the chunk file again after temporary chunk file failure 
to avoid the problem here.  This solution is simple and no synchronization is 
need.

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Assignee: Shashikant Banerjee
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-10-31 Thread Shashikant Banerjee (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963815#comment-16963815
 ] 

Shashikant Banerjee commented on HDDS-2372:
---

[~szetszwo], to answer your question precisely, while reading the data from 
stateMachine, it first checks whether the chunk file does exist. If this 
exists, it reads from the actual chunk file and if it does not exist, it reads 
from the temporary chunk file.

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-10-31 Thread Shashikant Banerjee (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963813#comment-16963813
 ] 

Shashikant Banerjee commented on HDDS-2372:
---

Thanks [~elek] . I do agree that, there is no synchronisation between 
readStateMachineData and applyTransaction which may lead to NoSuchFile 
exception as you suggested but the appendRequest will be retried in the leader 
and the system should recover thereafter once the commit of writeChunk 
completes.

In teragen testing as well, i ran into same issue but my test did complete. Can 
you share the logs/test to reproduce this?

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-10-31 Thread Marton Elek (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963796#comment-16963796
 ] 

Marton Elek commented on HDDS-2372:
---

cc: [~msingh]

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-10-29 Thread Marton Elek (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962159#comment-16962159
 ] 

Marton Elek commented on HDDS-2372:
---

I tried to reproduce it locally with docker-compose. In container state machine 
I reduced the capacity of the cache:
{code:java}
stateMachineDataCache = CacheBuilder.newBuilder()
.expireAfterAccess(500, TimeUnit.MILLISECONDS)
// set the limit on no of cached entries equal to no of max threads
// executing writeStateMachineData
.maximumSize(10).build();

{code}
And added a random wait to the readStateMachine:
{code:java}
private ByteString readStateMachineData(
ContainerCommandRequestProto requestProto, long term, long index)
throws IOException {
  if (Math.random() > 0.7) {
try {
  Thread.sleep(100);
} catch (InterruptedException e) {
  e.printStackTrace();
}
  } {code}
I got similar, but different error:
{code:java}
-SegmentedRaftLogWorker: created new log segment 
/data/metadata/ratis/68c226d2-356c-4eb0-aee2-ce458d4b0095/current/log_inprogress_6872
ESC[32mdatanode_3|ESC[0m 2019-10-29 15:54:10,084 [pool-7-thread-38] ERROR   
   - Unable to find the chunk file. chunk info : 
ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024}
ESC[32mdatanode_3|ESC[0m 2019-10-29 15:54:10,085 [pool-7-thread-38] INFO
   - Operation: ReadChunk : Trace ID: 
b93bcdcdd7fd37c:a3bed642046e9e09:b93bcdcdd7fd37c:1 : Message: Unable to find 
the chunk file. chunk info ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, 
offset=0, len=1024} : Result: UNABLE_TO_FIND_CHUNK
ESC[32mdatanode_3|ESC[0m 2019-10-29 15:54:10,085 [pool-7-thread-38] ERROR   
   - gid group-CE458D4B0095 : ReadStateMachine failed. cmd ReadChunk logIndex 
8773 msg : Unable to find the chunk file. chunk info 
ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024} 
Container Result: UNABLE_TO_FIND_CHUNK
ESC[32mdatanode_3|ESC[0m 2019-10-29 15:54:10,086 ERROR raftlog.RaftLog: 
06f4231d-30a8-42fd-839e-aeaea7b1aa72@group-CE458D4B0095-SegmentedRaftLog: 
Failed readStateMachineData for (t:2, i:8773), STATEMACHINELOGENTRY, 
client-BCA58E609475, cid=4367
ESC[32mdatanode_3|ESC[0m java.util.concurrent.ExecutionException: 
java.util.concurrent.ExecutionException: 
org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: 
Unable to find the chunk file. chunk info 
ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024}
ESC[32mdatanode_3|ESC[0mat 
java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
ESC[32mdatanode_3|ESC[0mat 
java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022)
ESC[32mdatanode_3|ESC[0mat 
org.apache.ratis.server.raftlog.RaftLog$EntryWithData.getEntry(RaftLog.java:472)
ESC[32mdatanode_3|ESC[0mat 
org.apache.ratis.util.DataQueue.pollList(DataQueue.java:134)
ESC[32mdatanode_3|ESC[0mat 
org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:220)
ESC[32mdatanode_3|ESC[0mat 
org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:178)
ESC[32mdatanode_3|ESC[0mat 
org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:121)
ESC[32mdatanode_3|ESC[0mat 
org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:76)
ESC[32mdatanode_3|ESC[0mat 
java.base/java.lang.Thread.run(Thread.java:834) {code}
And the cluster is stuck in a bad state (couln't write any more chunk, ever)
{code:java}
datanode_1| 2019-10-29 15:54:10,099 INFO impl.RaftServerImpl: 
6b9ca1af-467f-40c7-a21d-118cb34080b1@group-CE458D4B0095: inconsistency entries. 
Reply:06f4231d-30a8-42fd-839e-aeaea7b1aa72<-6b9ca1af-467f-40c7-a21d-118cb34080b1#0:FAIL,INCONSISTENCY,nextIndex:8773,term:2,followerCommit:8768
 {code}
Fix me If I am wrong, but
 * I think the write path should work even if the cache is limited or there are 
unexpected sleep
 * If there are some inconsistencies the raft ring should be healed or closed 
and reopened (but it's an independent issue)

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileExc

[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-10-29 Thread Marton Elek (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962153#comment-16962153
 ] 

Marton Elek commented on HDDS-2372:
---

Thanks the help [~szetszwo]
 # I found it only at one datanode. But it's hard to reproduce, usually I need 
to write a lot of write chunks
 # Yes, the test writes chunks to one ratis pipeline without using any real 
block id / container id. It's uploaded in HDDS-2327 (Use patch + ozone freon 
dcg -n 10)
 # Yes, this is the logic in ChunkManagerImpl.readChunk but I can't see any 
lock / sync between checking the files. Chunk can be committed in the middle of 
the read / tests (IMHO)

{code:java}

if (containerData.getLayOutVersion() == ChunkLayOutVersion
.getLatestVersion().getVersion()) {
  File chunkFile = ChunkUtils.getChunkFile(containerData, info);

  // In case the chunk file does not exist but tmp chunk file exist,
  // read from tmp chunk file if readFromTmpFile is set to true
  if (!chunkFile.exists() && dispatcherContext != null
  && dispatcherContext.isReadFromTmpFile()) {

 //WHAT IF CHUNK IS COMMITTED AT THIS POINT?

chunkFile = getTmpChunkFile(chunkFile, dispatcherContext);
  }
  data = ChunkUtils.readData(chunkFile, info, volumeIOStats); {code}
 

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-10-29 Thread Tsz-wo Sze (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962112#comment-16962112
 ] 

Tsz-wo Sze commented on HDDS-2372:
--

Some questions (sorry that I don't understand the test):
- Did the NoSuchFileException happen in all three data nodes?  Or just one?
- What did the test do?  Writing a lot of chunks to one Ratis pipeline?
- Did the read in B.3 fail?  It sounds like yes according to "the chunk can't 
be read any more from the tmp file."  Was the tmp file moved to another 
location?  If yes, the read should also try reading from there.

Since this can be reproduced, we should add more log messages to trace back 
when did the tmp file get created, moved/deleted.

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-10-29 Thread Marton Elek (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961939#comment-16961939
 ] 

Marton Elek commented on HDDS-2372:
---

@szetszwo Do you think it's a possible explanation?

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException

2019-10-29 Thread Marton Elek (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961938#comment-16961938
 ] 

Marton Elek commented on HDDS-2372:
---

Let's say I am writing chunks. Imagine the following timing.

 

Flow A
 # Leader receive the write chunk request
 # Write chunk is written to the disk (WRITE_DATE) and saved to the cache
 # WriteChunk is sent to the Follower1 with the next HB
 # As the WriteChunk has beed added to the Follower1 and Leader1 it can be 
committed
 # Write chunk write is called (COMMIT_DATA) the tmp file is renamed to the 
final name

 

Flow B
 #  HB should be sent to Follower2
 # For some reason cache is empty (too many other requests?) the write chunk 
should be read from the disk
 # A new ReadChunk request is executed by the HddsDispatcher and the chunk data 
is read (from an other thread, it's *async*)
 # The read HB is sent to the leader

 

As B.3 is an async operation it's possible that during the B.3 the write chunk 
is committed (A.5) and the chunk can't be read any more from the tmp file. 

 

> Datanode pipeline is failing with NoSuchFileException
> -
>
> Key: HDDS-2372
> URL: https://issues.apache.org/jira/browse/HDDS-2372
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Marton Elek
>Priority: Critical
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org