Zhile opened a new pull request, #4456:
URL: https://github.com/apache/flink-cdc/pull/4456
## What is the purpose of the change
Schema-evolution coordination responses are dropped when the failure's cause
is a class that only lives in the user classloader (e.g.
`com.mysql.cj.exceptions.ConnectionIsClosedException`). `flink-rpc-akka`
deserializes the coordination response with an isolated classloader and throws
`ClassNotFoundException`, so the response never reaches `SchemaOperator`, which
then blocks on `responseFuture.get(rpcTimeout)` and fails with a misleading
`TimeoutException` -- turning a transient error into a restart loop that only a
full job restart recovers from.
Observed in production (Flink CDC 3.6.0 on Flink 1.20.3):
```
ERROR org.apache.pekko.remote.Remoting -
com.mysql.cj.exceptions.ConnectionIsClosedException
java.lang.ClassNotFoundException:
com.mysql.cj.exceptions.ConnectionIsClosedException
at org.apache.pekko.util.ClassLoaderObjectInputStream.resolveClass(...)
at ...MiscMessageSerializer.deserializeStatusFailure(...)
```
JIRA: https://issues.apache.org/jira/browse/FLINK-40040
## Brief change log
- `SchemaRegistry.failJob` now wraps the cause into `SerializedThrowable`
(idempotent, not re-wrapped if already one) before calling
`handleUnrecoverableError` (which completes pending coordination responses
exceptionally) and `context.failJob`. `SerializedThrowable` carries the
original exception as bytes plus a stringified stack trace, so the receiving
side can deserialize it without the original class and still see the real cause.
- `SchemaRegistry.runInEventLoop`'s catch block now routes through `failJob`
instead of duplicating the fail path, so every exit shares the wrapping.
- Applies to both the regular and distributed `SchemaCoordinator`, since
both complete pending response futures via the shared `failJob` ->
`handleUnrecoverableError` path.
## Verifying this change
Added `SchemaRegistryFailJobSerializationTest`:
- `failJobWrapsExceptionIntoSerializedThrowable` asserts the failure
delivered to the coordinator context is a `SerializedThrowable`.
- `wrappedFailureSurvivesIsolatedClassloaderButRawOneDoesNot` shows a raw
exception whose cause is a classloader-isolated type fails to deserialize
(`ClassNotFoundException`), while the `SerializedThrowable`-wrapped one
deserializes and preserves the real cause text.
## Does this pull request potentially affect one of the following parts
- Dependencies (does it add or upgrade a dependency): no
- Public API / SDK: no
- Serializers / state: no (wraps only the failure carried over the
coordinator RPC channel; no state format change)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]