Assigned to you. and also added you to the role for future tickets,, On Thu, Aug 29, 2019 at 11:57 PM Pratyaksh Sharma <[email protected]> wrote:
> Hi Vinoth, > > The jira is HUDI-207 <https://issues.apache.org/jira/browse/HUDI-207>. > > On Thu, Aug 29, 2019 at 10:17 PM Vinoth Chandar <[email protected]> wrote: > > > HI, > > > > whats your JIRA id? if you could share that, will add you the > contributors > > role. > > > > On Thu, Aug 29, 2019 at 12:02 AM Pratyaksh Sharma <[email protected] > > > > wrote: > > > > > Sure Balaji, > > > > > > Please give me permissions so I can assign this jira > > > <https://issues.apache.org/jira/browse/HUDI-207> to me and start > working > > > on > > > it. > > > > > > On Wed, Aug 28, 2019 at 7:23 PM [email protected] <[email protected] > > > > > wrote: > > > > > > > Sure Pratyaksh, Whatever field works for your use-case is good > enough. > > > > You do have the flexibility to generate a derived field or use one of > > the > > > > source fields > > > > Balaji.V On Wednesday, August 28, 2019, 06:48:44 AM PDT, Pratyaksh > > > > Sharma <[email protected]> wrote: > > > > > > > > Hi Balaji, > > > > > > > > Sure I can do that. However after a considerable amount of time, the > > > > bin-log position will get exhausted. To handle this, we can have > > > secondary > > > > ordering field as the ingestion_timestamp (the time when I am pushing > > the > > > > event to Kafka to be consumed by DeltaStreamer) which will work > always. > > > > > > > > Please suggest. > > > > > > > > On Thu, Aug 22, 2019 at 9:49 PM [email protected] < > [email protected] > > > > > > > wrote: > > > > > > > > > Hi Pratyaksh, > > > > > The usual way we support this is to make use of > > > > > com.uber.hoodie.utilities.transform.Transformer plugin in > > > > > HoodieDeltaStreamer. You can implement your own Transformer to > add a > > > new > > > > > derived field which could be a combination of timestamp and > > > > > binlog-position. You can then configure this new field to be used > as > > > > source > > > > > ordering field. > > > > > Balaji.V > > > > > > > > > > On Wednesday, August 21, 2019, 07:35:40 AM PDT, Pratyaksh > Sharma < > > > > > [email protected]> wrote: > > > > > > > > > > Hi, > > > > > > > > > > While building a CDC pipeline for capturing data changes in SQL > using > > > > > HoodieDeltaStreamer, I came across the following problem. We need > to > > > read > > > > > SQL's bin log file for fetching all the modifications made to a > > > > particular > > > > > table. However in production environment where we are handling > > hundreds > > > > > of transactions per second (TPS), it is possible to have the same > > table > > > > row > > > > > getting modified multiple times within a second. > > > > > > > > > > Here comes the problem with Mysql binlog as it has 32 bit timestamp > > > upto > > > > > seconds resolution. If we build CDC pipeline on top of such a table > > > > > with huge TPS, then breaking ties between records with the same > > Hoodie > > > > key > > > > > will not be possible with a single source-ordering-field (mentioned > > in > > > > > HoodieDeltaStreamer.Config), which is binlog timestamp in this > case. > > > > > > > > > > Example - https://github.com/zendesk/maxwell/issues/925. > > > > > > > > > > Hence as a part of Hudi improvement, the proposal is to add one > > > > > secondary-source-ordering-field for breaking ties among incoming > > > records > > > > in > > > > > such cases. For example, we could have ingestion_timestamp or > > > > > binlog_position as the secondary field. > > > > > > > > > > Please suggest. I have raised the issue here > > > > > <https://issues.apache.org/jira/browse/HUDI-207>. > > > > > > > > > > > > > > >
