nateab commented on PR #27579:
URL: https://github.com/apache/flink/pull/27579#issuecomment-4849496026
@peach12345 Thanks for confirming you're hitting this too, that helps make
the case for the fix.
A few options until it's merged:
1) Recovering a job that's stuck in the loop right now: the stale state
lives in the
TaskManagers' cached classloaders (keyed per job), not the JobManager.
Recycling the
TaskManagers (e.g. deleting the TM pods so they come back fresh) clears the
cached
classloader that still holds the old blob keys, so the next deployment
resolves cleanly.
A full stop + resubmit works too, but restarting just the TMs is usually
enough to break
the loop.
2) Avoiding the trigger: the mismatch only happens when the job's JARs get
re-uploaded and
produce new PermanentBlobKeys — the key includes a random component, so
identical content
still yields a different key. That re-upload typically happens on a
JobManager failover, so:
- keep the JobManager off spot/preemptible nodes (fewer JM restarts), and
- make sure JobManager HA is enabled, so a failover recovers the
persisted JobGraph
(with its original blob keys) rather than resubmitting the job with
freshly uploaded JARs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]