Hi Pratyaksh,
The usual way we support this is to make use of
com.uber.hoodie.utilities.transform.Transformer plugin in HoodieDeltaStreamer.
You can implement your own Transformer to add a new derived field which could
be a combination of timestamp and binlog-position. You can then configure this
new field to be used as source ordering field.
Balaji.V
On Wednesday, August 21, 2019, 07:35:40 AM PDT, Pratyaksh Sharma
<[email protected]> wrote:
Hi,
While building a CDC pipeline for capturing data changes in SQL using
HoodieDeltaStreamer, I came across the following problem. We need to read
SQL's bin log file for fetching all the modifications made to a particular
table. However in production environment where we are handling hundreds
of transactions per second (TPS), it is possible to have the same table row
getting modified multiple times within a second.
Here comes the problem with Mysql binlog as it has 32 bit timestamp upto
seconds resolution. If we build CDC pipeline on top of such a table
with huge TPS, then breaking ties between records with the same Hoodie key
will not be possible with a single source-ordering-field (mentioned in
HoodieDeltaStreamer.Config), which is binlog timestamp in this case.
Example - https://github.com/zendesk/maxwell/issues/925.
Hence as a part of Hudi improvement, the proposal is to add one
secondary-source-ordering-field for breaking ties among incoming records in
such cases. For example, we could have ingestion_timestamp or
binlog_position as the secondary field.
Please suggest. I have raised the issue here
<https://issues.apache.org/jira/browse/HUDI-207>.