[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974483#comment-16974483 ] Anu Engineer commented on HDDS-2372: bq. It's possible to remove the usage of the tmp files but only if we allow overwrite for all the chunk files (in case of a leader failure the next attempt to write may find the previous chunk file in place). It may be accepted but it's a change with more risk. Why this is an enforced constraint? It is the artifact of our code. It should be trivial to check if file exists , and write chunk_file_v1, chunk_file_v2 etc. Anyway, as you mentioned, we will anyway rewrite this whole path. So it is probably ok to do what you think works now. > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Critical > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974161#comment-16974161 ] Marton Elek commented on HDDS-2372: --- We had a long discussion with [~shashikant] . Here is the summary: # It's possible to remove the usage of the tmp files but only if we allow overwrite for all the chunk files (in case of a leader failure the next attempt to write may find the previous chunk file in place). It may be accepted but it's a change with more risk. # The proper solution is to use the same file to write multiple chunks. It's a bigger change, requires time and will enable to remove the usage of tmp files anyway. # It seems to be a safe option to keep the usage of the tmp file (but with triple FileNotFound check based on exceptions) and remove it only as part of the bigger change (2) which should be done very soon, anyway. I uploaded the initial patch (including a fix for a problem found by [~shashikant] during an IRL code review. Thanks for that). As of now I started to test it in my cluster with ChunkWriter freon test. > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Assignee: Shashikant Banerjee >Priority: Critical > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969470#comment-16969470 ] Anu Engineer commented on HDDS-2372: > Thanks Anu Engineer for the suggestion. Writing to the actual chunk file may > lead to handling truncation log entries in Ratis inside Ozone which we don't > need to handle right now as we always write to tmp chunk files That is correct. That is one of the reasons why we did the tmp way. But that time we did not have the Data Scrubber thread. Now we do that a data scrubber thread, so it is trivial for the chunkfile to be detected as junk and cleaned up by this thread. > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Assignee: Shashikant Banerjee >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969379#comment-16969379 ] Shashikant Banerjee commented on HDDS-2372: --- In ratis, raft log entries can get truncated after leader election happens. The data write actually happens as a part of append the log entry itself. Currently, if the raft log gets truncated , we don't do any handling for those entries i.e, we don't delete/validate the chunk files written as a part of log entry itself as the the data always exist in the tmp files which is stamped with the term and log index which are not visible and will remain as garbage even if the corresponding log entries in the raft log have been truncated. If we write to the actual chunk file which happens as a part of writing the log itself, then correspondingly, if the those log entries get truncated, we might need to handle this inside ozone by deleting the corresponding chunk files as well to maintain the consistency or have to validate the data while updating the rocks db entries as well. > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Assignee: Shashikant Banerjee >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16969286#comment-16969286 ] Marton Elek commented on HDDS-2372: --- {quote}Writing to the actual chunk file may lead to handling truncation log entries in Ratis inside Ozone which we don't need to handle right now as we always write to tmp chunk files. Even if log entries get truncated inside Ratis , tmp files are left behind as garbage. {quote} Sorry, it's not clear for me what does it mean. Can you please give more details about this scenario? We may have garbage tmp files anyway. The suggestion from [~aengineer] would have a big benefit. Could would be simplified a lot as we don't need to write anything during commit the commit phse. The current code is a little tricky we have the same writeChunk method for both commit and write and we have a flag (inside DispatcherContext) which shows which function is called. > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Assignee: Shashikant Banerjee >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968960#comment-16968960 ] Shashikant Banerjee commented on HDDS-2372: --- Thanks [~aengineer] for the suggestion. Writing to the actual chunk file may lead to handling truncation log entries in Ratis inside Ozone which we don't need to handle right now as we always write to tmp chunk files. Even if log entries get truncated inside Ratis , tmp files are left behind as garbage. > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Assignee: Shashikant Banerjee >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968767#comment-16968767 ] Anu Engineer commented on HDDS-2372: In the Chunk write path, we write chunks to a temp file and then rename them to the file file. However, until we commit a block, any chunk file is a temp file for real since no one can see the chunk file name until we commit the ChunkInfo into the RocksDB. So if we remove the tmpChunkFile and always write to the real chunk file, this race condition will go away. > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Assignee: Shashikant Banerjee >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968264#comment-16968264 ] Marton Elek commented on HDDS-2372: --- I had a long conversation with [~shashikant] and He helped me a lot to understood the problem (thanks here, again). Here are our proposals: [Problem 1]: race condition between read (read the statemachine data to send it to the followers) and commit This can be solved using a a second read attempt after throwing the exception. [Problem 2]: race condition between writeStateMachineData and readStateMachineData (the statemachine data write might not be finished when we start to read back the data (in case of missing cache entry). This can be fixed with checking the size of the data and compare it with the length which is part of the chunk write request. [Problem 3] race condition between close container / write chunk / read chunk : write chunk may be declined because the container is closed, in this case the read chunk error should be ignored silently instead of throwing an exception for ratis. This can be done with using the bcsid. If it's newer than the term/index of the close container, the request can be safely ignored. > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Assignee: Shashikant Banerjee >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16968213#comment-16968213 ] Marton Elek commented on HDDS-2372: --- Yes, I tried this, but it doesn't work. Assuming the algorithm is the following: # file := finalPath # if (!file.exists()) file := tmpPath # if (!file.exist()) file:= finalPath If the move happens between 2 and 3 the file value will be tmpPath instead of finalPath. One option is to catch the FileNotFound exception and retry. But there is an other (slightly different) question: What about having read and write at the same time? How is it guaranteed that the writeStateMachineData is finished before the next readStateMachineData is started. Is it guaranteed by Ratis? > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Assignee: Shashikant Banerjee >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967873#comment-16967873 ] Tsz-wo Sze commented on HDDS-2372: -- It makes sense to check the chunk file again after temporary chunk file failure to avoid the problem here. This solution is simple and no synchronization is need. > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Assignee: Shashikant Banerjee >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963815#comment-16963815 ] Shashikant Banerjee commented on HDDS-2372: --- [~szetszwo], to answer your question precisely, while reading the data from stateMachine, it first checks whether the chunk file does exist. If this exists, it reads from the actual chunk file and if it does not exist, it reads from the temporary chunk file. > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963813#comment-16963813 ] Shashikant Banerjee commented on HDDS-2372: --- Thanks [~elek] . I do agree that, there is no synchronisation between readStateMachineData and applyTransaction which may lead to NoSuchFile exception as you suggested but the appendRequest will be retried in the leader and the system should recover thereafter once the commit of writeChunk completes. In teragen testing as well, i ran into same issue but my test did complete. Can you share the logs/test to reproduce this? > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963796#comment-16963796 ] Marton Elek commented on HDDS-2372: --- cc: [~msingh] > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962159#comment-16962159 ] Marton Elek commented on HDDS-2372: --- I tried to reproduce it locally with docker-compose. In container state machine I reduced the capacity of the cache: {code:java} stateMachineDataCache = CacheBuilder.newBuilder() .expireAfterAccess(500, TimeUnit.MILLISECONDS) // set the limit on no of cached entries equal to no of max threads // executing writeStateMachineData .maximumSize(10).build(); {code} And added a random wait to the readStateMachine: {code:java} private ByteString readStateMachineData( ContainerCommandRequestProto requestProto, long term, long index) throws IOException { if (Math.random() > 0.7) { try { Thread.sleep(100); } catch (InterruptedException e) { e.printStackTrace(); } } {code} I got similar, but different error: {code:java} -SegmentedRaftLogWorker: created new log segment /data/metadata/ratis/68c226d2-356c-4eb0-aee2-ce458d4b0095/current/log_inprogress_6872 ESC[32mdatanode_3|ESC[0m 2019-10-29 15:54:10,084 [pool-7-thread-38] ERROR - Unable to find the chunk file. chunk info : ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024} ESC[32mdatanode_3|ESC[0m 2019-10-29 15:54:10,085 [pool-7-thread-38] INFO - Operation: ReadChunk : Trace ID: b93bcdcdd7fd37c:a3bed642046e9e09:b93bcdcdd7fd37c:1 : Message: Unable to find the chunk file. chunk info ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024} : Result: UNABLE_TO_FIND_CHUNK ESC[32mdatanode_3|ESC[0m 2019-10-29 15:54:10,085 [pool-7-thread-38] ERROR - gid group-CE458D4B0095 : ReadStateMachine failed. cmd ReadChunk logIndex 8773 msg : Unable to find the chunk file. chunk info ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024} Container Result: UNABLE_TO_FIND_CHUNK ESC[32mdatanode_3|ESC[0m 2019-10-29 15:54:10,086 ERROR raftlog.RaftLog: 06f4231d-30a8-42fd-839e-aeaea7b1aa72@group-CE458D4B0095-SegmentedRaftLog: Failed readStateMachineData for (t:2, i:8773), STATEMACHINELOGENTRY, client-BCA58E609475, cid=4367 ESC[32mdatanode_3|ESC[0m java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: Unable to find the chunk file. chunk info ChunkInfo{chunkName='3D4nM8ycqh_testdata_chunk_4366, offset=0, len=1024} ESC[32mdatanode_3|ESC[0mat java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ESC[32mdatanode_3|ESC[0mat java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2022) ESC[32mdatanode_3|ESC[0mat org.apache.ratis.server.raftlog.RaftLog$EntryWithData.getEntry(RaftLog.java:472) ESC[32mdatanode_3|ESC[0mat org.apache.ratis.util.DataQueue.pollList(DataQueue.java:134) ESC[32mdatanode_3|ESC[0mat org.apache.ratis.server.impl.LogAppender.createRequest(LogAppender.java:220) ESC[32mdatanode_3|ESC[0mat org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:178) ESC[32mdatanode_3|ESC[0mat org.apache.ratis.grpc.server.GrpcLogAppender.runAppenderImpl(GrpcLogAppender.java:121) ESC[32mdatanode_3|ESC[0mat org.apache.ratis.server.impl.LogAppender$AppenderDaemon.run(LogAppender.java:76) ESC[32mdatanode_3|ESC[0mat java.base/java.lang.Thread.run(Thread.java:834) {code} And the cluster is stuck in a bad state (couln't write any more chunk, ever) {code:java} datanode_1| 2019-10-29 15:54:10,099 INFO impl.RaftServerImpl: 6b9ca1af-467f-40c7-a21d-118cb34080b1@group-CE458D4B0095: inconsistency entries. Reply:06f4231d-30a8-42fd-839e-aeaea7b1aa72<-6b9ca1af-467f-40c7-a21d-118cb34080b1#0:FAIL,INCONSISTENCY,nextIndex:8773,term:2,followerCommit:8768 {code} Fix me If I am wrong, but * I think the write path should work even if the cache is limited or there are unexpected sleep * If there are some inconsistencies the raft ring should be healed or closed and reopened (but it's an independent issue) > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileExc
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962153#comment-16962153 ] Marton Elek commented on HDDS-2372: --- Thanks the help [~szetszwo] # I found it only at one datanode. But it's hard to reproduce, usually I need to write a lot of write chunks # Yes, the test writes chunks to one ratis pipeline without using any real block id / container id. It's uploaded in HDDS-2327 (Use patch + ozone freon dcg -n 10) # Yes, this is the logic in ChunkManagerImpl.readChunk but I can't see any lock / sync between checking the files. Chunk can be committed in the middle of the read / tests (IMHO) {code:java} if (containerData.getLayOutVersion() == ChunkLayOutVersion .getLatestVersion().getVersion()) { File chunkFile = ChunkUtils.getChunkFile(containerData, info); // In case the chunk file does not exist but tmp chunk file exist, // read from tmp chunk file if readFromTmpFile is set to true if (!chunkFile.exists() && dispatcherContext != null && dispatcherContext.isReadFromTmpFile()) { //WHAT IF CHUNK IS COMMITTED AT THIS POINT? chunkFile = getTmpChunkFile(chunkFile, dispatcherContext); } data = ChunkUtils.readData(chunkFile, info, volumeIOStats); {code} > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962112#comment-16962112 ] Tsz-wo Sze commented on HDDS-2372: -- Some questions (sorry that I don't understand the test): - Did the NoSuchFileException happen in all three data nodes? Or just one? - What did the test do? Writing a lot of chunks to one Ratis pipeline? - Did the read in B.3 fail? It sounds like yes according to "the chunk can't be read any more from the tmp file." Was the tmp file moved to another location? If yes, the read should also try reading from there. Since this can be reproduced, we should add more log messages to trace back when did the tmp file get created, moved/deleted. > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961939#comment-16961939 ] Marton Elek commented on HDDS-2372: --- @szetszwo Do you think it's a possible explanation? > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
[ https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961938#comment-16961938 ] Marton Elek commented on HDDS-2372: --- Let's say I am writing chunks. Imagine the following timing. Flow A # Leader receive the write chunk request # Write chunk is written to the disk (WRITE_DATE) and saved to the cache # WriteChunk is sent to the Follower1 with the next HB # As the WriteChunk has beed added to the Follower1 and Leader1 it can be committed # Write chunk write is called (COMMIT_DATA) the tmp file is renamed to the final name Flow B # HB should be sent to Follower2 # For some reason cache is empty (too many other requests?) the write chunk should be read from the disk # A new ReadChunk request is executed by the HddsDispatcher and the chunk data is read (from an other thread, it's *async*) # The read HB is sent to the leader As B.3 is an async operation it's possible that during the B.3 the write chunk is committed (A.5) and the chunk can't be read any more from the tmp file. > Datanode pipeline is failing with NoSuchFileException > - > > Key: HDDS-2372 > URL: https://issues.apache.org/jira/browse/HDDS-2372 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Marton Elek >Priority: Critical > > Found it on a k8s based test cluster using a simple 3 node cluster and > HDDS-2327 freon test. After a while the StateMachine become unhealthy after > this error: > {code:java} > datanode-0 datanode java.util.concurrent.ExecutionException: > java.util.concurrent.ExecutionException: > org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: > java.nio.file.NoSuchFileException: > /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 > {code} > Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org