[ 
https://issues.apache.org/jira/browse/HDDS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974161#comment-16974161
 ] 

Marton Elek commented on HDDS-2372:
-----------------------------------

We had a long discussion with [~shashikant] . Here is the summary:
 # It's possible to remove the usage of the tmp files but only if we allow 
overwrite for all the chunk files (in case of a leader failure the next attempt 
to write may find the previous chunk file in place). It may be accepted but 
it's a change with more risk.
 # The proper solution is to use the same file to write multiple chunks. It's a 
bigger change, requires time and will enable to remove the usage of tmp files 
anyway.
 # It seems to be a safe option to keep the usage of the tmp file (but with 
triple FileNotFound check based on exceptions) and remove it only as part of 
the bigger change (2) which should be done very soon, anyway.

I uploaded the initial patch (including a fix for a problem found by 
[~shashikant] during an IRL code review. Thanks for that).

As of now I started to test it in my cluster with ChunkWriter freon test.

 

> Datanode pipeline is failing with NoSuchFileException
> -----------------------------------------------------
>
>                 Key: HDDS-2372
>                 URL: https://issues.apache.org/jira/browse/HDDS-2372
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: Marton Elek
>            Assignee: Shashikant Banerjee
>            Priority: Critical
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Found it on a k8s based test cluster using a simple 3 node cluster and 
> HDDS-2327 freon test. After a while the StateMachine become unhealthy after 
> this error:
> {code:java}
> datanode-0 datanode java.util.concurrent.ExecutionException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException:
>  java.nio.file.NoSuchFileException: 
> /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830
>  {code}
> Can be reproduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to