Hi, While building a CDC pipeline for capturing data changes in SQL using HoodieDeltaStreamer, I came across the following problem. We need to read SQL's bin log file for fetching all the modifications made to a particular table. However in production environment where we are handling hundreds of transactions per second (TPS), it is possible to have the same table row getting modified multiple times within a second.
Here comes the problem with Mysql binlog as it has 32 bit timestamp upto seconds resolution. If we build CDC pipeline on top of such a table with huge TPS, then breaking ties between records with the same Hoodie key will not be possible with a single source-ordering-field (mentioned in HoodieDeltaStreamer.Config), which is binlog timestamp in this case. Example - https://github.com/zendesk/maxwell/issues/925. Hence as a part of Hudi improvement, the proposal is to add one secondary-source-ordering-field for breaking ties among incoming records in such cases. For example, we could have ingestion_timestamp or binlog_position as the secondary field. Please suggest. I have raised the issue here <https://issues.apache.org/jira/browse/HUDI-207>.