Hi Balaji,

Sure I can do that. However after a considerable amount of time, the
bin-log position will get exhausted. To handle this, we can have secondary
ordering field as the ingestion_timestamp (the time when I am pushing the
event to Kafka to be consumed by DeltaStreamer) which will work always.

Please suggest.

On Thu, Aug 22, 2019 at 9:49 PM vbal...@apache.org <vbal...@apache.org>
wrote:

>  Hi Pratyaksh,
> The usual way we support this is to make use of
> com.uber.hoodie.utilities.transform.Transformer plugin in
> HoodieDeltaStreamer.  You can implement your own Transformer to add a new
> derived field which could be a combination of timestamp and
> binlog-position. You can then configure this new field to be used as source
> ordering field.
> Balaji.V
>
>     On Wednesday, August 21, 2019, 07:35:40 AM PDT, Pratyaksh Sharma <
> pratyaks...@gmail.com> wrote:
>
>  Hi,
>
> While building a CDC pipeline for capturing data changes in SQL using
> HoodieDeltaStreamer, I came across the following problem. We need to read
> SQL's bin log file for fetching all the modifications made to a particular
> table. However in production environment where we are handling hundreds
> of transactions per second (TPS), it is possible to have the same table row
> getting modified multiple times within a second.
>
> Here comes the problem with Mysql binlog as it has 32 bit timestamp upto
> seconds resolution. If we build CDC pipeline on top of such a table
> with huge TPS, then breaking ties between records with the same Hoodie key
> will not be possible with a single source-ordering-field (mentioned in
> HoodieDeltaStreamer.Config), which is binlog timestamp in this case.
>
> Example -  https://github.com/zendesk/maxwell/issues/925.
>
> Hence as a part of Hudi improvement, the proposal is to add one
> secondary-source-ordering-field for breaking ties among incoming records in
> such cases.  For example, we could have ingestion_timestamp or
> binlog_position as the secondary field.
>
> Please suggest. I have raised the issue here
> <https://issues.apache.org/jira/browse/HUDI-207>.
>

Reply via email to