[ https://issues.apache.org/jira/browse/FLINK-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746125#comment-15746125 ]
ASF GitHub Bot commented on FLINK-5332: --------------------------------------- GitHub user StephanEwen opened a pull request: https://github.com/apache/flink/pull/2999 [FLINK-5332] [core] Synchronize FileSystem::initOutPathLocalFS() to prevent lost files/directories when called concurrently This is mainly relevant to tests and Local Mini Cluster executions. The `FileOutputFormat` and its subclasses rely on `FileSystem::initOutPathLocalFS()` to prepare the output directory. When multiple parallel output writers call that method, there is a slim chance that one parallel threads deletes the others directory. The checks that the method has are not bullet proof. I believe that this is the cause for many Travis test instabilities that we observed over time. Simply synchronizing that method per process should do the trick. Since it is a rare initialization method, and only relevant in tests & local mini cluster executions, it should be a price that is okay to pay. I see no other way, as we do not have simple access to an atomic "check and delete and recreate" file operation. The synchronization also makes many "re-try" code paths obsolete (there should be no re-tries needed on proper file systems). ### Tests This is tricky to test. The test in `InitOutputPathTest.java` uses a series of latch to re-produce the problematic thread execution interleaving to validate the problem. The properly fixed variant cannot use that interleaving (because it fixes the problem, duh), but pushes the thread interleaving best-effort towards the case where the problem would occur, were the method not properly synchronized. Sounds weird, I know. You can merge this pull request into a Git repository by running: $ git pull https://github.com/StephanEwen/incubator-flink fs_fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/2999.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2999 ---- commit 7a4d5b2f53e3b8c27a5224e405cf1b671f474a58 Author: Stephan Ewen <se...@apache.org> Date: 2016-12-13T18:15:11Z [tests] Add 'CheckedThread' as a common test utility commit e25b436f1de18be0dc2c3b02b82bf0b8203f0b44 Author: Stephan Ewen <se...@apache.org> Date: 2016-12-13T18:12:12Z [FLINK-5332] [core] Synchronize FileSystem::initOutPathLocalFS() to prevent lost files when called concurrently. ---- > Non-thread safe FileSystem::initOutPathLocalFS() can cause lost > files/directories in local execution > ---------------------------------------------------------------------------------------------------- > > Key: FLINK-5332 > URL: https://issues.apache.org/jira/browse/FLINK-5332 > Project: Flink > Issue Type: Bug > Components: Core > Affects Versions: 1.2.0 > Reporter: Stephan Ewen > Assignee: Stephan Ewen > Priority: Critical > Fix For: 1.2.0 > > > This is mainly relevant to tests and Local Mini Cluster executions. > The {{FileOutputFormat}} and its subclasses rely on > {{FileSystem::initOutPathLocalFS()}} to prepare the output directory. When > multiple parallel output writers call that method, there is a slim chance > that one parallel threads deletes the others directory. The checks that the > method has are not bullet proof. > I believe that this is the cause for many Travis test instabilities that we > observed over time. > Simply synchronizing that method per process should do the trick. Since it is > a rare initialization method, and only relevant in tests & local mini cluster > executions, it should be a price that is okay to pay. I see no other way, as > we do not have simple access to an atomic "check and delete and recreate" > file operation. > The synchronization also makes many "re-try" code paths obsolete (there > should be no re-tries needed on proper file systems). -- This message was sent by Atlassian JIRA (v6.3.4#6332)