[ 
https://issues.apache.org/jira/browse/BEAM-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354263#comment-16354263
 ] 

Kenneth Knowles commented on BEAM-3272:
---------------------------------------

It is worse in gradle, perhaps due to parallelism and/or tighter management of 
directories that gradle considers that it owns.

> ParDoTranslatorTest: Error creating local cluster while creating checkpoint 
> file
> --------------------------------------------------------------------------------
>
>                 Key: BEAM-3272
>                 URL: https://issues.apache.org/jira/browse/BEAM-3272
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-apex
>            Reporter: Eugene Kirpichov
>            Assignee: Kenneth Knowles
>            Priority: Critical
>              Labels: flake
>
> Failed build: 
> https://builds.apache.org/job/beam_PostCommit_Java_MavenInstall/org.apache.beam$beam-runners-apex/5330/console
> Key output:
> {code}
> 2017-11-29T01:21:26.956 [ERROR] 
> testAssertionFailure(org.apache.beam.runners.apex.translation.ParDoTranslatorTest)
>   Time elapsed: 2.007 s  <<< ERROR!
> java.lang.RuntimeException: Error creating local cluster
>       at 
> org.apache.apex.engine.EmbeddedAppLauncherImpl.getController(EmbeddedAppLauncherImpl.java:122)
>       at 
> org.apache.apex.engine.EmbeddedAppLauncherImpl.launchApp(EmbeddedAppLauncherImpl.java:71)
>       at 
> org.apache.apex.engine.EmbeddedAppLauncherImpl.launchApp(EmbeddedAppLauncherImpl.java:46)
>       at org.apache.beam.runners.apex.ApexRunner.run(ApexRunner.java:197)
>       at 
> org.apache.beam.runners.apex.TestApexRunner.run(TestApexRunner.java:57)
>       at 
> org.apache.beam.runners.apex.TestApexRunner.run(TestApexRunner.java:31)
>       at org.apache.beam.sdk.Pipeline.run(Pipeline.java:304)
>       at org.apache.beam.sdk.Pipeline.run(Pipeline.java:290)
>       at 
> org.apache.beam.runners.apex.translation.ParDoTranslatorTest.runExpectingAssertionFailure(ParDoTranslatorTest.java:156)
> {code}
> ...
> {code}
> Caused by: ExitCodeException exitCode=1: chmod: cannot access 
> ‘/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Java_MavenInstall/src/runners/apex/target/com.datatorrent.stram.StramLocalCluster/checkpoints/2/_tmp’:
>  No such file or directory
>       at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
>       at org.apache.hadoop.util.Shell.run(Shell.java:479)
>       at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
>       at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
>       at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:225)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:209)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)
>       at 
> org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)
>       at org.apache.hadoop.fs.FileSystem.primitiveCreate(FileSystem.java:1017)
>       at 
> org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:99)
>       at 
> org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.<init>(ChecksumFs.java:352)
>       at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:399)
>       at 
> org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:584)
>       at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:686)
>       at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:682)
>       at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
>       at org.apache.hadoop.fs.FileContext.create(FileContext.java:688)
>       at 
> com.datatorrent.common.util.AsyncFSStorageAgent.copyToHDFS(AsyncFSStorageAgent.java:119)
>       ... 50 more
> {code}
> By inspecting code at the stack frames, seems it's trying to copy an 
> operator's checkpoint "to HDFS" (which in this case is the local disk), but 
> fails while creating the target file of the copy - creation creates the file 
> (successfully) and chmods it writable (unsuccessfully). Barring something 
> subtle (e.g. chmod being not allowed to call immediately after creating a 
> FileOutputStream), this looks like the whole directory was possibly deleted 
> from under the process. I don't know why this would be the case though, or 
> how to debug it.
> Either way, the path being accessed is funky: 
> /home/jenkins/jenkins-slave/workspace/beam_PostCommit_Java_MavenInstall/src/runners/apex/target/...
>  - I think it'd be better if this test used a "@Rule TemporaryFolder" to 
> store Apex checkpoints. I don't know whether the Apex runner allows that, but 
> I can see how it could help reduce interference between tests and potentially 
> resolve this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to