Hi, I have a very basic question regarding how Hudi writes parquet files
when it finds duplicates/updates/deletes in the daily feed data. Lets say
we have the following dataframes

val feedDay1DF = Seq(
  Data("a", "0"),
  Data("b", "1"),
  Data("c", "2"),
  Data("d", "3")
).toDF()

I assume when Hudi stores above feedDay1DF as parquet file lets assume just
one parquet file with 4 records with keys a,b,c,d

//c and d keys values changed
val feedDay2DF = Seq(
  Data("a", "0"),
  Data("b", "1"),
  Data("c", "200"),
  Data("d", "300")
).toDF()

Now when we try to store feedDay2DF assume it will again store one more
parquet file now question is will it store it with only two updated records
c and d keys or it will store all keys a,b,c,d in a parquet file? Please
guide.

Reply via email to