[GitHub] [incubator-seatunnel] JesseAtSZ commented on issue #1461: [Feature][CDC] Support Flink CDC

GitBox Mon, 21 Mar 2022 04:28:42 -0700


JesseAtSZ commented on issue #1461:
URL: 
https://github.com/apache/incubator-seatunnel/issues/1461#issuecomment-1073782876



   > > > > @BenJFan Recently, I haven't started to write code, mainly to 
understand how to ensure strict consistency of transactions when using Flink 
CDC: [Can Flink CDC guarantee MySQL 
transactions](https://github.com/ververica/flink-cdc-connectors/issues/956)
   > > > > In addition, for the specific code level, I still have the above 
three problems to be solved.
   > > > 
   > > > 
   > > > Guaranteed transaction is not only one component that can be 
completed, the transaction that cdc can guarantee requires not only that the 
data source can be replayed (binlog can be replayed), but also the sink side to 
support transactions (traditional transaction or distributed transaction) or 
write idempotency
   > > 
   > > 
   > > Flink CDC supports binlog replay. The problem I want to solve is that 
the sink side can strictly guarantee the transactions on the source side, 
rather than simply inserting and modifying them line by line through SQL ( it 
only replays SQL, but it can not guarantee transactions. For example, if a sink 
side transaction suddenly goes down in the middle of execution, there is a 
problem with the data on the sink side at this time). I think there are several 
key points to this problem:
   > > 
   > > 1. The source side can obtain transaction information and ensure the 
order
   > > 2. The sink side can ensure the sequential insertion of transactions and 
the idempotency during fault recovery
   > > 
   > > I have some understanding of these two questions:
   > > 
   > > 1. I found that the changelog event in debezium contains transaction 
information, but the transaction information in Flink's SourceRecord is not 
complete. I'm considering whether to improve the transaction information of 
Flink CDC, then construct different queues through different transaction id, 
and finally submit in the order of gtids?
   > > 2. The idempotency of fault recovery is mainly reflected in ensuring 
that the transaction will not be executed repeatedly, so it may be necessary to 
introduce checkpoints to record the transaction id, which I haven't thought 
about yet.
   > 
   > 1. The order of transactions is determined by the transaction id. 
Idempotency needs to be supported by the design of data writing methods, and 
has nothing to do with fault recovery.
   > 2. CDC should already support checkpoint.
   
   The combination of Flink CDC and Flink JDBC has achieved idempotency. There 
are checkpoints on the Source side and upsert on the Sink side, however, this 
combination can only meet the final consistency, but can not meet the real-time 
consistency, (as I said above, Flink CDC and Flink JDBC will split the 
operations in a transaction into many SQL). The transaction order and 
checkpoint I mentioned here refer to the implementation under the condition of 
ensuring transactions.
   
   If we just want to ensure the final consistency, I don't think it's 
difficult to realize data synchronization. However, if strict consistency is 
required, there will be transaction problems, which depends on the transaction 
information provided by Flink CDC. However, at present, the transaction 
information is incomplete. I'm not sure to what extent we want to achieve, 
maybe we just need to ensure the final consistency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-seatunnel] JesseAtSZ commented on issue #1461: [Feature][CDC] Support Flink CDC

Reply via email to