GitHub user StephanEwen opened a pull request:

    https://github.com/apache/flink/pull/2999

    [FLINK-5332] [core] Synchronize FileSystem::initOutPathLocalFS() to prevent 
lost files/directories when called concurrently

    This is mainly relevant to tests and Local Mini Cluster executions.
    
    The `FileOutputFormat` and its subclasses rely on 
`FileSystem::initOutPathLocalFS()` to prepare the output directory. When 
multiple parallel output writers call that method, there is a slim chance that 
one parallel threads deletes the others directory. The checks that the method 
has are not bullet proof.
    
    I believe that this is the cause for many Travis test instabilities that we 
observed over time.
    
    Simply synchronizing that method per process should do the trick. Since it 
is a rare initialization method, and only relevant in tests & local mini 
cluster executions, it should be a price that is okay to pay. I see no other 
way, as we do not have simple access to an atomic "check and delete and 
recreate" file operation.
    The synchronization also makes many "re-try" code paths obsolete (there 
should be no re-tries needed on proper file systems).
    
    ### Tests
    
    This is tricky to test. The test in `InitOutputPathTest.java` uses a series 
of latch to re-produce the problematic thread execution interleaving to 
validate the problem. The properly fixed variant cannot use that interleaving 
(because it fixes the problem, duh), but pushes the thread interleaving 
best-effort towards the case where the problem would occur, were the method not 
properly synchronized. Sounds weird, I know.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/StephanEwen/incubator-flink fs_fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2999.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2999
    
----
commit 7a4d5b2f53e3b8c27a5224e405cf1b671f474a58
Author: Stephan Ewen <se...@apache.org>
Date:   2016-12-13T18:15:11Z

    [tests] Add 'CheckedThread' as a common test utility

commit e25b436f1de18be0dc2c3b02b82bf0b8203f0b44
Author: Stephan Ewen <se...@apache.org>
Date:   2016-12-13T18:12:12Z

    [FLINK-5332] [core] Synchronize FileSystem::initOutPathLocalFS() to prevent 
lost files when called concurrently.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to