zjjiang created FLINK-36701:
-------------------------------
Summary: Pipeline failover again after handling a schema change
event as the first event after a failover
Key: FLINK-36701
URL: https://issues.apache.org/jira/browse/FLINK-36701
Project: Flink
Issue Type: Bug
Components: Flink CDC
Affects Versions: cdc-3.2.0, cdc-3.3.0
Reporter: zjjiang
Attachments: image-2024-11-13-16-29-30-065.png
Currently, directly after a failover, when the pipeline first handles a schema
change event (e.g. addColumnEvent) and then a DataChangeEvent, it may cause the
job to fail again as sink has repeatedly applied that schema change.
The cause of the problem can be explained as follows:
1. SinkWriterOperator now requests the latest schema when it receives the first
non-createTableEvent schema change event (assuming there is no schema in the
local cache).
2. The schema manager applies the schema change after confirming flush success.
3. Assume that the sequence after failover is to process a schema change event
first, followed by a data change event.
On the schema manager side, the schema manager will apply the schema change
event to its cached schema(i.e. evolvedSchema) after confirming a successful
flush.
On the SinkWriterOperator side, the processing flow is:
1) Handle the flushEvent;
2) Handle the schema change event (in this step, the latest schema will be
fetched from the schema manager and sent downstream; then the schema change
event will be emitted) -- note that this step does not report an error;
3) Handle the data change -- here the failover occurs because the data record
column size does not match the schema.
!image-2024-11-13-16-29-30-065.png!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)