Hi Umesh, Let me try to answer this. At a very high level, if this table is of type COPY_ON_WRITE another version of parquet file will be written with all keys - a,b,c and d. However, if the table is of MERGE_ON_READ type then updates are stored as avro files allowing for read-time reconciliation.
Depending on number of entries in your inputDF and file size configs, there could be one or more parquet files produced instead of just 1 parquet file. You can refer to documentation on File Management here - https://hudi.apache.org/concepts.html#file-management and different storage types here - https://hudi.apache.org/concepts.html#file-management Hope this helps. Thanks, Sudha On Fri, Sep 27, 2019 at 11:38 PM Umesh Kacha <[email protected]> wrote: > Hi, I have a very basic question regarding how Hudi writes parquet files > when it finds duplicates/updates/deletes in the daily feed data. Lets say > we have the following dataframes > > val feedDay1DF = Seq( > Data("a", "0"), > Data("b", "1"), > Data("c", "2"), > Data("d", "3") > ).toDF() > > I assume when Hudi stores above feedDay1DF as parquet file lets assume just > one parquet file with 4 records with keys a,b,c,d > > //c and d keys values changed > val feedDay2DF = Seq( > Data("a", "0"), > Data("b", "1"), > Data("c", "200"), > Data("d", "300") > ).toDF() > > Now when we try to store feedDay2DF assume it will again store one more > parquet file now question is will it store it with only two updated records > c and d keys or it will store all keys a,b,c,d in a parquet file? Please > guide. >
