Patrick Lucas created FLINK-14973:
-------------------------------------
Summary: OSS filesystem does not relocate many dependencies
Key: FLINK-14973
URL: https://issues.apache.org/jira/browse/FLINK-14973
Project: Flink
Issue Type: Bug
Components: FileSystems
Affects Versions: 1.9.1
Reporter: Patrick Lucas
Whereas the Azure and S3 Hadoop filesystem jars relocate all of their
depdendencies:
{noformat}
$ jar tf opt/flink-azure-fs-hadoop-1.9.1.jar | grep -v '^org/apache/fs/shaded/'
| grep -v '^META-INF' | grep '/' | cut -f -2 -d / | sort | uniq
org/
org/apache
$ jar tf opt/flink-s3-fs-hadoop-1.9.1.jar | grep -v '^org/apache/fs/shaded/' |
grep -v '^META-INF' | grep '/' | cut -f -2 -d / | sort | uniq
org/
org/apache
{noformat}
The OSS Hadoop filesystem leaves many things un-relocated:
{noformat}
$ jar tf opt/flink-oss-fs-hadoop-1.9.1.jar | grep -v '^org/apache/fs/shaded/' |
grep -v '^META-INF' | grep '/' | cut -f -2 -d / | sort | uniq
assets/
assets/org
avro/
avro/shaded
com/
com/ctc
com/fasterxml
com/google
com/jcraft
com/nimbusds
com/sun
com/thoughtworks
javax/
javax/activation
javax/el
javax/servlet
javax/ws
javax/xml
jersey/
jersey/repackaged
licenses/
licenses/LICENSE.asm
licenses/LICENSE.cddlv1.0
licenses/LICENSE.cddlv1.1
licenses/LICENSE.jdom
licenses/LICENSE.jzlib
licenses/LICENSE.paranamer
licenses/LICENSE.protobuf
licenses/LICENSE.re2j
licenses/LICENSE.stax2api
net/
net/jcip
net/minidev
org/
org/apache
org/codehaus
org/eclipse
org/jdom
org/objectweb
org/tukaani
org/xerial
{noformat}
The first symptom of this I ran into was that Flink is unable to restore from a
savepoint if both the OSS and Azure Hadoop filesystems are on the classpath,
but I assume this has the potential to cause further problems, at least until
more progress is made on the module/classloading front.
h3. Steps to reproduce
# Copy both the Azure and OSS Hadoop filesystem JARs from opt/ into lib/
# Run a job that restores from a savepoint (the savepoint might need to be
stored on OSS)
# See a crash and traceback like:
{noformat}
2019-11-26 15:59:25,318 ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error
occurred in the cluster entrypoint.
org.apache.flink.runtime.dispatcher.DispatcherException: Failed to take
leadership with session id 00000000-0000-0000-0000-000000000000.
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$null$30(Dispatcher.java:915)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
~[?:1.8.0_232]
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
~[?:1.8.0_232]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_232]
at
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_232]
at
org.apache.flink.runtime.concurrent.FutureUtils$WaitingConjunctFuture.handleCompletedFuture(FutureUtils.java:691)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
~[?:1.8.0_232]
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
~[?:1.8.0_232]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_232]
at
java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
~[?:1.8.0_232]
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:753)
~[?:1.8.0_232]
at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
~[?:1.8.0_232]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at akka.actor.Actor.aroundReceive(Actor.scala:517)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
Caused by: java.lang.RuntimeException:
org.apache.flink.runtime.client.JobExecutionException: Could not set up
JobManager
at
org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
~[?:1.8.0_232]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
... 4 more
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set
up JobManager
at
org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:152)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:83)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:375)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
~[?:1.8.0_232]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
... 4 more
Caused by: java.lang.NoSuchMethodError:
org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
at
org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.open(AliyunOSSFileSystem.java:570)
~[flink-oss-fs-hadoop-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950)
~[flink-azure-fs-hadoop-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.open(HadoopFileSystem.java:120)
~[flink-oss-fs-hadoop-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.open(HadoopFileSystem.java:37)
~[flink-oss-fs-hadoop-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.checkpoint.Checkpoints.loadAndValidateCheckpoint(Checkpoints.java:141)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1132)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.scheduler.LegacyScheduler.tryRestoreExecutionGraphFromSavepoint(LegacyScheduler.java:237)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.scheduler.LegacyScheduler.createAndRestoreExecutionGraph(LegacyScheduler.java:196)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.scheduler.LegacyScheduler.<init>(LegacyScheduler.java:176)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.scheduler.LegacySchedulerFactory.createInstance(LegacySchedulerFactory.java:70)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:275)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:265)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:98)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.createJobMasterService(DefaultJobMasterServiceFactory.java:40)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:146)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.dispatcher.DefaultJobManagerRunnerFactory.createJobManagerRunner(DefaultJobManagerRunnerFactory.java:83)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:375)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
~[?:1.8.0_232]
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
~[flink-dist_2.12-1.9.1-stream2.jar:1.9.1-stream2]
... 4 more
{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)