Merging schema's during Incremental load

Gautam Nayak Fri, 13 Sep 2019 14:33:50 -0700

Hi,
We have been evaluating Hudi and there is one use case we are trying to solve, 
where incremental datasets can have fewer columns than the ones that have been 
already persisted in Hudi format.


For example : In initial batch , We have a total of 4 columns
    val initial = Seq(("id1", "col1", "col2", 123456)).toDF("pk", "col1", 
"col2", "ts")

and in the incremental batch, We have 3 columns
 val incremental = Seq(("id2", "col1", 123879)).toDF("id", "col1", "ts")

We want to have a union of initial and incremental schemas such that col2 of 
id2 has some default type associated to it. But what we are seeing is the 
latest schema(incremental) for both the records when we persist the data (COW) 
and read it back through Spark. The actual incrementals datasets would be in 
Avro format but we do not maintain their schemas.
I tried looking through the documentation to see if there is a specific 
configuration to achieve this, but couldn’t find any.
We would also want to achieve this via Deltastreamer and then query these 
results from Presto.

Thanks,
Gautam

Merging schema's during Incremental load

Reply via email to