[ https://issues.apache.org/jira/browse/FLINK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066584#comment-17066584 ]
Yun Tang commented on FLINK-16770: ---------------------------------- I think this issue is the same root cause as FLINK-16561 , and cannot come to an idea why this could happen if we retain checkpoint and at least one checkpoint completed. If there any place to find the cluster logs in detail? It seems it has uploaded logs failed. > Resuming Externalized Checkpoint (rocks, incremental, scale up) end-to-end > test fails with no such file > ------------------------------------------------------------------------------------------------------- > > Key: FLINK-16770 > URL: https://issues.apache.org/jira/browse/FLINK-16770 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Tests > Reporter: Zhijiang > Priority: Major > Labels: test-stability > Fix For: 1.11.0 > > > The log : > [https://dev.azure.com/rmetzger/Flink/_build/results?buildId=6603&view=logs&j=c88eea3b-64a0-564d-0031-9fdcd7b8abee&t=1e2bbe5b-4657-50be-1f07-d84bfce5b1f5] > > There was also the similar problem in > https://issues.apache.org/jira/browse/FLINK-16561, but for the case of no > parallelism change. And this case is for scaling up. Not quite sure whether > the root cause is the same one. > {code:java} > 2020-03-25T06:50:31.3894841Z Running 'Resuming Externalized Checkpoint > (rocks, incremental, scale up) end-to-end test' > 2020-03-25T06:50:31.3895308Z > ============================================================================== > 2020-03-25T06:50:31.3907274Z TEST_DATA_DIR: > /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304 > 2020-03-25T06:50:31.5500274Z Flink dist directory: > /home/vsts/work/1/s/flink-dist/target/flink-1.11-SNAPSHOT-bin/flink-1.11-SNAPSHOT > 2020-03-25T06:50:31.6354639Z Starting cluster. > 2020-03-25T06:50:31.8871932Z Starting standalonesession daemon on host > fv-az655. > 2020-03-25T06:50:33.5021784Z Starting taskexecutor daemon on host fv-az655. > 2020-03-25T06:50:33.5152274Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:34.5498116Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:35.6031346Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:36.9848425Z Waiting for Dispatcher REST endpoint to come > up... > 2020-03-25T06:50:38.0283377Z Dispatcher REST endpoint is up. > 2020-03-25T06:50:38.0285490Z Running externalized checkpoints test, with > ORIGINAL_DOP=2 NEW_DOP=4 and STATE_BACKEND_TYPE=rocks > STATE_BACKEND_FILE_ASYNC=true STATE_BACKEND_ROCKSDB_INCREMENTAL=true > SIMULATE_FAILURE=false ... > 2020-03-25T06:50:46.1754645Z Job (b8cb04e4b1e730585bc616aa352866d0) is > running. > 2020-03-25T06:50:46.1758132Z Waiting for job > (b8cb04e4b1e730585bc616aa352866d0) to have at least 1 completed checkpoints > ... > 2020-03-25T06:50:46.3478276Z Waiting for job to process up to 200 records, > current progress: 173 records ... > 2020-03-25T06:50:49.6332988Z Cancelling job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.4875673Z Cancelled job b8cb04e4b1e730585bc616aa352866d0. > 2020-03-25T06:50:50.5468230Z ls: cannot access > '/home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/temp-test-directory-31390197304/externalized-chckpt-e2e-backend-dir/b8cb04e4b1e730585bc616aa352866d0/chk-[1-9]*/_metadata': > No such file or directory > 2020-03-25T06:50:50.5606260Z Restoring job with externalized checkpoint at . > ... > 2020-03-25T06:50:58.4728245Z > 2020-03-25T06:50:58.4732663Z > ------------------------------------------------------------ > 2020-03-25T06:50:58.4735785Z The program finished with the following > exception: > 2020-03-25T06:50:58.4737759Z > 2020-03-25T06:50:58.4742666Z > org.apache.flink.client.program.ProgramInvocationException: The main method > caused an error: java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4746274Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335) > 2020-03-25T06:50:58.4749954Z at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205) > 2020-03-25T06:50:58.4752753Z at > org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:142) > 2020-03-25T06:50:58.4755400Z at > org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:659) > 2020-03-25T06:50:58.4757862Z at > org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210) > 2020-03-25T06:50:58.4760282Z at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:890) > 2020-03-25T06:50:58.4763591Z at > org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:963) > 2020-03-25T06:50:58.4764274Z at > java.security.AccessController.doPrivileged(Native Method) > 2020-03-25T06:50:58.4764809Z at > javax.security.auth.Subject.doAs(Subject.java:422) > 2020-03-25T06:50:58.4765434Z at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) > 2020-03-25T06:50:58.4766180Z at > org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > 2020-03-25T06:50:58.4773549Z at > org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:963) > 2020-03-25T06:50:58.4774502Z Caused by: java.lang.RuntimeException: > java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4775382Z at > org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:276) > 2020-03-25T06:50:58.4776163Z at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1741) > 2020-03-25T06:50:58.4777706Z at > org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:90) > 2020-03-25T06:50:58.4778334Z at > org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:58) > 2020-03-25T06:50:58.4779007Z at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1620) > 2020-03-25T06:50:58.4779654Z at > org.apache.flink.streaming.tests.DataStreamAllroundTestProgram.main(DataStreamAllroundTestProgram.java:215) > 2020-03-25T06:50:58.4780371Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-03-25T06:50:58.4784367Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-03-25T06:50:58.4785063Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-03-25T06:50:58.4785557Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-03-25T06:50:58.4786204Z at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321) > 2020-03-25T06:50:58.4786547Z ... 11 more > 2020-03-25T06:50:58.4787007Z Caused by: > java.util.concurrent.ExecutionException: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4787717Z at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > 2020-03-25T06:50:58.4788203Z at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > 2020-03-25T06:50:58.4788835Z at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1736) > 2020-03-25T06:50:58.4789362Z ... 20 more > 2020-03-25T06:50:58.4789720Z Caused by: > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit > JobGraph. > 2020-03-25T06:50:58.4790467Z at > org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$7(RestClusterClient.java:359) > 2020-03-25T06:50:58.4791087Z at > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) > 2020-03-25T06:50:58.4791650Z at > java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) > 2020-03-25T06:50:58.4792560Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2020-03-25T06:50:58.4793617Z at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > 2020-03-25T06:50:58.4794496Z at > org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:274) > 2020-03-25T06:50:58.4795255Z at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774) > 2020-03-25T06:50:58.4796264Z at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750) > 2020-03-25T06:50:58.4796867Z at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > 2020-03-25T06:50:58.4797439Z at > java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575) > 2020-03-25T06:50:58.4798000Z at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:943) > 2020-03-25T06:50:58.4798589Z at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > 2020-03-25T06:50:58.4799162Z at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > 2020-03-25T06:50:58.4799727Z at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > 2020-03-25T06:50:58.4800210Z at java.lang.Thread.run(Thread.java:748) > 2020-03-25T06:50:58.4800767Z Caused by: > org.apache.flink.runtime.rest.util.RestClientException: [Internal server > error., <Exception on server side: > 2020-03-25T06:50:58.4801351Z > org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job. > 2020-03-25T06:50:58.4801938Z at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$internalSubmitJob$3(Dispatcher.java:336) > 2020-03-25T06:50:58.4803660Z at > java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836) > 2020-03-25T06:50:58.4804555Z at > java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811) > 2020-03-25T06:50:58.4805235Z at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456) > 2020-03-25T06:50:58.4805839Z at > akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > 2020-03-25T06:50:58.4806515Z at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) > 2020-03-25T06:50:58.4807184Z at > akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > 2020-03-25T06:50:58.4807807Z at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > 2020-03-25T06:50:58.4808417Z at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > 2020-03-25T06:50:58.4809055Z at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 2020-03-25T06:50:58.4809783Z Caused by: java.lang.RuntimeException: > org.apache.flink.runtime.client.JobExecutionException: Could not set up > JobManager > 2020-03-25T06:50:58.4810756Z at > org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36) > 2020-03-25T06:50:58.4811444Z at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) > 2020-03-25T06:50:58.4811937Z ... 6 more > 2020-03-25T06:50:58.4812414Z Caused by: > org.apache.flink.runtime.client.JobExecutionException: Could not set up > JobManager > 2020-03-25T06:50:58.4813330Z at > org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:152) > 2020-03-25T06:50:58.4814154Z at > org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:84) > 2020-03-25T06:50:58.4814846Z at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$6(Dispatcher.java:379) > 2020-03-25T06:50:58.4815622Z at > org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34) > 2020-03-25T06:50:58.4816074Z ... 7 more > 2020-03-25T06:50:58.4816924Z Caused by: java.io.IOException: Cannot access > file system for checkpoint/savepoint path 'file://.'. > 2020-03-25T06:50:58.4817673Z at > org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:233) > 2020-03-25T06:50:58.4818450Z at > org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpoint(AbstractFsCheckpointStorage.java:110) > 2020-03-25T06:50:58.4819276Z at > org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1312) > 2020-03-25T06:50:58.4819943Z at > org.apache.flink.runtime.scheduler.SchedulerBase.tryRestoreExecutionGraphFromSavepoint(SchedulerBase.java:314) > 2020-03-25T06:50:58.4820633Z at > org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:247) > 2020-03-25T06:50:58.4821258Z at > org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:223) > 2020-03-25T06:50:58.4821862Z at > org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:118) > 2020-03-25T06:50:58.4822505Z at > org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:103) > 2020-03-25T06:50:58.4823115Z at > org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:282) > 2020-03-25T06:50:58.4823665Z at > org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:270) > 2020-03-25T06:50:58.4824485Z at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98) > 2020-03-25T06:50:58.4825597Z at > org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40) > 2020-03-25T06:50:58.4826400Z at > org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.<init>(JobManagerRunnerImpl.java:146) > 2020-03-25T06:50:58.4826919Z ... 10 more > 2020-03-25T06:50:58.4829018Z Caused by: java.io.IOException: Found local file > path with authority '.' in path 'file://.'. Hint: Did you forget a slash? > (correct path would be 'file:///.') > 2020-03-25T06:50:58.4829875Z at > org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:441) > 2020-03-25T06:50:58.4830364Z at > org.apache.flink.core.fs.FileSystem.get(FileSystem.java:389) > 2020-03-25T06:50:58.4830807Z at > org.apache.flink.core.fs.Path.getFileSystem(Path.java:292) > 2020-03-25T06:50:58.4831408Z at > org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:230) > 2020-03-25T06:50:58.4832021Z ... 22 more > 2020-03-25T06:50:58.4832151Z > 2020-03-25T06:50:58.4832356Z End of exception on server side>] > 2020-03-25T06:50:58.4832720Z at > org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:390) > 2020-03-25T06:50:58.4833238Z at > org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:374) > 2020-03-25T06:50:58.4833884Z at > java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:966) > 2020-03-25T06:50:58.4834376Z at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940) > 2020-03-25T06:50:58.4834724Z ... 4 more > 2020-03-25T06:50:58.5042321Z Resuming from externalized checkpoint job could > not be started. > 2020-03-25T06:50:58.5044210Z [FAIL] Test script contains errors. > 2020-03-25T06:50:58.5052826Z Checking of logs skipped. > 2020-03-25T06:50:58.5053164Z > 2020-03-25T06:50:58.5054116Z [FAIL] 'Resuming Externalized Checkpoint (rocks, > incremental, scale up) end-to-end test' failed after 0 minutes and 27 > seconds! Test exited with exit code 1 > 2020-03-25T06:50:58.5054639Z > 2020-03-25T06:50:58.8067813Z Stopping taskexecutor daemon (pid: 86888) on > host fv-az655. > 2020-03-25T06:50:59.0257270Z Stopping standalonesession daemon (pid: 86603) > on host fv-az655. > 2020-03-25T06:50:59.4920994Z > 2020-03-25T06:50:59.5000014Z ##[error]Bash exited with code '1'. > 2020-03-25T06:50:59.5015374Z ##[section]Finishing: Run e2e tests > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)