[
https://issues.apache.org/jira/browse/FLINK-32212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17841985#comment-17841985
]
Robert Metzger commented on FLINK-32212:
----------------------------------------
Thanks [~rickysaltzer]. Have you figured out what exactly is triggering this?
If I remember correctly, I was able to trigger it by regenerating the JobGraph
on the JobManager (while keeping the jobid the same) and resubmitting the job
(e.g. job would restart) while keeping TaskManagers running.
So probably some integrity check on the TaskManager noticed that the jars of a
jobid have changed, causing this exception.
For debugging this, I'd like to find an easy way of reproducing it. My approach
seems somewhat "illegal", because messing with the JobGraph that not what Flink
has been designed for (in my understanding).
So I'm asking if somebody who has observed this issue has a reliable way (and a
way of using Flink in an expected / documented manner) to reproduce this issue?
> Job restarting indefinitely after an IllegalStateException from
> BlobLibraryCacheManager
> ---------------------------------------------------------------------------------------
>
> Key: FLINK-32212
> URL: https://issues.apache.org/jira/browse/FLINK-32212
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.16.1
> Environment: Apache Flink Kubernetes Operator 1.4
> Reporter: Matheus Felisberto
> Priority: Major
>
> After running for a few hours the job starts to throw IllegalStateException
> and I can't figure out why. To restore the job, I need to manually delete the
> FlinkDeployment to be recreated and redeploy everything.
> The jar is built-in into the docker image, hence is defined accordingly with
> the Operator's documentation:
> {code:java}
> // jarURI: local:///opt/flink/usrlib/my-job.jar {code}
> I've tried to move it into /opt/flink/lib/my-job.jar but it didn't work
> either.
>
> {code:java}
> // Source: my-topic (1/2)#30587
> (b82d2c7f9696449a2d9f4dc298c0a008_bc764cd8ddf7a0cff126f51c16239658_0_30587)
> switched from DEPLOYING to FAILED with failure cause:
> java.lang.IllegalStateException: The library registration references a
> different set of library BLOBs than previous registrations for this job:
> old:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-7237ecbb12b0b021934b0c81aef78396]
> new:[p-5d91888083d38a3ff0b6c350f05a3013632137c6-943737c6790a3ec6870cecd652b956c2]
> at
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.verifyClassLoader(BlobLibraryCacheManager.java:419)
> at
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$ResolvedClassLoader.access$500(BlobLibraryCacheManager.java:359)
> at
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.getOrResolveClassLoader(BlobLibraryCacheManager.java:235)
> at
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$LibraryCacheEntry.access$1100(BlobLibraryCacheManager.java:202)
> at
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager$DefaultClassLoaderLease.getOrResolveClassLoader(BlobLibraryCacheManager.java:336)
> at
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:1024)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:612)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:550)
> at java.base/java.lang.Thread.run(Unknown Source) {code}
> If there is any other information that can help to identify the problem,
> please let me know.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)