[
https://issues.apache.org/jira/browse/HUDI-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920656#comment-16920656
]
Pratyaksh Sharma commented on HUDI-207:
---------------------------------------
As discussed, if we use a combination of binlog position and timestamp, the
binlog position will get exhausted after some time. Hence I am planning to add
one secondary ordering field (this is going to be ingestion_timestamp in my
case). Please let me know your thoughts on this. [[~vinoth]]
> Introduce secondary source ordering field for breaking ties while writing
> -------------------------------------------------------------------------
>
> Key: HUDI-207
> URL: https://issues.apache.org/jira/browse/HUDI-207
> Project: Apache Hudi (incubating)
> Issue Type: Improvement
> Components: deltastreamer
> Reporter: Pratyaksh Sharma
> Assignee: Pratyaksh Sharma
> Priority: Major
> Labels: patch
>
> When building CDC pipelines for capturing data changes in SQL, we need to
> read SQL's bin log file for fetching all the modifications made to a
> particular table. However in production environment where we are handling
> hundreds of transactions per second (TPS), it is possible to have the same
> table row getting modified multiple times within a second.
> Here comes the problem with Mysql binlog as it has 32 bit timestamp upto
> seconds resolution. If we build CDC pipeline on top of such a table with huge
> TPS, then breaking ties between records with the same Hoodie key will not be
> possible with a single source-ordering-field (mentioned in
> HoodieDeltaStreamer.Config).
> Example - [https://github.com/zendesk/maxwell/issues/925]
> The proposal is to add one secondary-source-ordering-field for breaking ties
> among incoming records in such cases. For example, we could have
> ingestion_timestamp or binlog_position as the secondary field.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)