[ 
https://issues.apache.org/jira/browse/FLINK-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746125#comment-15746125
 ] 

ASF GitHub Bot commented on FLINK-5332:
---------------------------------------

GitHub user StephanEwen opened a pull request:

    https://github.com/apache/flink/pull/2999

    [FLINK-5332] [core] Synchronize FileSystem::initOutPathLocalFS() to prevent 
lost files/directories when called concurrently

    This is mainly relevant to tests and Local Mini Cluster executions.
    
    The `FileOutputFormat` and its subclasses rely on 
`FileSystem::initOutPathLocalFS()` to prepare the output directory. When 
multiple parallel output writers call that method, there is a slim chance that 
one parallel threads deletes the others directory. The checks that the method 
has are not bullet proof.
    
    I believe that this is the cause for many Travis test instabilities that we 
observed over time.
    
    Simply synchronizing that method per process should do the trick. Since it 
is a rare initialization method, and only relevant in tests & local mini 
cluster executions, it should be a price that is okay to pay. I see no other 
way, as we do not have simple access to an atomic "check and delete and 
recreate" file operation.
    The synchronization also makes many "re-try" code paths obsolete (there 
should be no re-tries needed on proper file systems).
    
    ### Tests
    
    This is tricky to test. The test in `InitOutputPathTest.java` uses a series 
of latch to re-produce the problematic thread execution interleaving to 
validate the problem. The properly fixed variant cannot use that interleaving 
(because it fixes the problem, duh), but pushes the thread interleaving 
best-effort towards the case where the problem would occur, were the method not 
properly synchronized. Sounds weird, I know.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/StephanEwen/incubator-flink fs_fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2999.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2999
    
----
commit 7a4d5b2f53e3b8c27a5224e405cf1b671f474a58
Author: Stephan Ewen <se...@apache.org>
Date:   2016-12-13T18:15:11Z

    [tests] Add 'CheckedThread' as a common test utility

commit e25b436f1de18be0dc2c3b02b82bf0b8203f0b44
Author: Stephan Ewen <se...@apache.org>
Date:   2016-12-13T18:12:12Z

    [FLINK-5332] [core] Synchronize FileSystem::initOutPathLocalFS() to prevent 
lost files when called concurrently.

----


> Non-thread safe FileSystem::initOutPathLocalFS() can cause lost 
> files/directories in local execution
> ----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-5332
>                 URL: https://issues.apache.org/jira/browse/FLINK-5332
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.2.0
>            Reporter: Stephan Ewen
>            Assignee: Stephan Ewen
>            Priority: Critical
>             Fix For: 1.2.0
>
>
> This is mainly relevant to tests and Local Mini Cluster executions.
> The {{FileOutputFormat}} and its subclasses rely on 
> {{FileSystem::initOutPathLocalFS()}} to prepare the output directory. When 
> multiple parallel output writers call that method, there is a slim chance 
> that one parallel threads deletes the others directory. The checks that the 
> method has are not bullet proof.
> I believe that this is the cause for many Travis test instabilities that we 
> observed over time.
> Simply synchronizing that method per process should do the trick. Since it is 
> a rare initialization method, and only relevant in tests & local mini cluster 
> executions, it should be a price that is okay to pay. I see no other way, as 
> we do not have simple access to an atomic "check and delete and recreate" 
> file operation.
> The synchronization also makes many "re-try" code paths obsolete (there 
> should be no re-tries needed on proper file systems).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to