Hi Yixue, Thanks for starting this thread! I have actually been thinking if we should just deprecate preCombine() and simply use combineAndGetUpdateValue() there as well. But, it boiled down to implementation efficiency.. Having the entire payload during preCombine() helps us keep the actual data serialized during shuffles (much more compact than shuffling avro in my experience) .
I am fine with this addition overall. We can deprecate existing preCombine and remove over the next few releases.. Let's wait for others to chime in as well Thanks Vinoth On Thu, May 14, 2020 at 1:20 PM Yixue Zhu <[email protected]> wrote: > We are working on Mongo Oplog integration with Hudi, to stream Mongo > updates to Hudi tables. > > There are 4 Mongo OpLog operations we need to handle, CRUD (create, > read, update, delete). > > Currently Hudi handle create/read, delete, but not update well with > existing preCombine API in HoodieRecordPayload class. In particularly, > Update operation contains "patch" field, which is extended Json > describing update for dot separated field paths. > > We need to pass Avro schema to preCombine API for it to work: > > Even though BaseAvroPayload constructor accepts GenericRecord, which > has Avro schema reference, but it materialize GenericRecord to bytes, > to support serialization/deserialization by ExternalSpillableMap. > > > Is there concern/objection to this? in other words, have I overlooked > something? > > I have created https://issues.apache.org/jira/browse/HUDI-898 to track it. > > Best, > Yixue > > -- > Best Regards, > yixue >
