Hi Umesh,

Let me try to answer this. At a very high level, if this table is of type
COPY_ON_WRITE another version of parquet file will be written with all keys
- a,b,c and d. However, if the table is of MERGE_ON_READ type then updates
are stored as avro files allowing for read-time reconciliation.

Depending on number of entries in your inputDF and file size configs, there
could be one or more parquet files produced instead of just 1 parquet file.
You can refer to documentation on File Management here -
https://hudi.apache.org/concepts.html#file-management and different storage
types here - https://hudi.apache.org/concepts.html#file-management

Hope this helps.

Thanks,
Sudha

On Fri, Sep 27, 2019 at 11:38 PM Umesh Kacha <[email protected]> wrote:

> Hi, I have a very basic question regarding how Hudi writes parquet files
> when it finds duplicates/updates/deletes in the daily feed data. Lets say
> we have the following dataframes
>
> val feedDay1DF = Seq(
>   Data("a", "0"),
>   Data("b", "1"),
>   Data("c", "2"),
>   Data("d", "3")
> ).toDF()
>
> I assume when Hudi stores above feedDay1DF as parquet file lets assume just
> one parquet file with 4 records with keys a,b,c,d
>
> //c and d keys values changed
> val feedDay2DF = Seq(
>   Data("a", "0"),
>   Data("b", "1"),
>   Data("c", "200"),
>   Data("d", "300")
> ).toDF()
>
> Now when we try to store feedDay2DF assume it will again store one more
> parquet file now question is will it store it with only two updated records
> c and d keys or it will store all keys a,b,c,d in a parquet file? Please
> guide.
>

Reply via email to