Zhile opened a new pull request, #4456:
URL: https://github.com/apache/flink-cdc/pull/4456

   ## What is the purpose of the change
   
   Schema-evolution coordination responses are dropped when the failure's cause 
is a class that only lives in the user classloader (e.g. 
`com.mysql.cj.exceptions.ConnectionIsClosedException`). `flink-rpc-akka` 
deserializes the coordination response with an isolated classloader and throws 
`ClassNotFoundException`, so the response never reaches `SchemaOperator`, which 
then blocks on `responseFuture.get(rpcTimeout)` and fails with a misleading 
`TimeoutException` -- turning a transient error into a restart loop that only a 
full job restart recovers from.
   
   Observed in production (Flink CDC 3.6.0 on Flink 1.20.3):
   
   ```
   ERROR org.apache.pekko.remote.Remoting - 
com.mysql.cj.exceptions.ConnectionIsClosedException
   java.lang.ClassNotFoundException: 
com.mysql.cj.exceptions.ConnectionIsClosedException
       at org.apache.pekko.util.ClassLoaderObjectInputStream.resolveClass(...)
       at ...MiscMessageSerializer.deserializeStatusFailure(...)
   ```
   
   JIRA: https://issues.apache.org/jira/browse/FLINK-40040
   
   ## Brief change log
   
   - `SchemaRegistry.failJob` now wraps the cause into `SerializedThrowable` 
(idempotent, not re-wrapped if already one) before calling 
`handleUnrecoverableError` (which completes pending coordination responses 
exceptionally) and `context.failJob`. `SerializedThrowable` carries the 
original exception as bytes plus a stringified stack trace, so the receiving 
side can deserialize it without the original class and still see the real cause.
   - `SchemaRegistry.runInEventLoop`'s catch block now routes through `failJob` 
instead of duplicating the fail path, so every exit shares the wrapping.
   - Applies to both the regular and distributed `SchemaCoordinator`, since 
both complete pending response futures via the shared `failJob` -> 
`handleUnrecoverableError` path.
   
   ## Verifying this change
   
   Added `SchemaRegistryFailJobSerializationTest`:
   
   - `failJobWrapsExceptionIntoSerializedThrowable` asserts the failure 
delivered to the coordinator context is a `SerializedThrowable`.
   - `wrappedFailureSurvivesIsolatedClassloaderButRawOneDoesNot` shows a raw 
exception whose cause is a classloader-isolated type fails to deserialize 
(`ClassNotFoundException`), while the `SerializedThrowable`-wrapped one 
deserializes and preserves the real cause text.
   
   ## Does this pull request potentially affect one of the following parts
   
   - Dependencies (does it add or upgrade a dependency): no
   - Public API / SDK: no
   - Serializers / state: no (wraps only the failure carried over the 
coordinator RPC channel; no state format change)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to