Re: Hudi Parquet Storage Basic Question

Umesh Kacha Sat, 28 Sep 2019 23:39:12 -0700

Ok so I read the following line from the FAQ link shared.

Your current job is rewriting entire table/partition to deal with updates,
while only a few files actually change in each partition.


My understanding from above statement is if my daily feed has 100 rows and
if only 2 changed every day then Hudi will store first day parquet file
with 100 rows and next day onwards parquet files inside daily partitions
will have only 2 rows changed. Yes? No?

On Sun, Sep 29, 2019 at 1:47 AM Umesh Kacha <[email protected]> wrote:

> OK thanks let's say I use copy on write and I have daily feeds with
> hundreds of thousands of rows now out of these rows only few changed and
> rest are same every day. Now my question is will copy on write create
> parquet files with only changed rows or it will create parquet files with
> duplicates plus charged rows and wasting storage on hadoop.
>
> On Sat, Sep 28, 2019, 9:20 PM Vinoth Chandar <[email protected]> wrote:
>
>> +1 There is also a faq entry here
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#Frequentlyaskedquestions(FAQ)-Whatisthedifferencebetweencopy-on-write(COW)vsmerge-on-read(MOR)storagetypes
>> ?
>>
>>
>> On Sat, Sep 28, 2019 at 1:10 AM Bhavani Sudha Saktheeswaran
>> <[email protected]> wrote:
>>
>> > Hi Umesh,
>> >
>> > Let me try to answer this. At a very high level, if this table is of
>> type
>> > COPY_ON_WRITE another version of parquet file will be written with all
>> keys
>> > - a,b,c and d. However, if the table is of MERGE_ON_READ type then
>> updates
>> > are stored as avro files allowing for read-time reconciliation.
>> >
>> > Depending on number of entries in your inputDF and file size configs,
>> there
>> > could be one or more parquet files produced instead of just 1 parquet
>> file.
>> > You can refer to documentation on File Management here -
>> > https://hudi.apache.org/concepts.html#file-management and different
>> > storage
>> > types here - https://hudi.apache.org/concepts.html#file-management
>> >
>> > Hope this helps.
>> >
>> > Thanks,
>> > Sudha
>> >
>> > On Fri, Sep 27, 2019 at 11:38 PM Umesh Kacha <[email protected]>
>> > wrote:
>> >
>> > > Hi, I have a very basic question regarding how Hudi writes parquet
>> files
>> > > when it finds duplicates/updates/deletes in the daily feed data. Lets
>> say
>> > > we have the following dataframes
>> > >
>> > > val feedDay1DF = Seq(
>> > >   Data("a", "0"),
>> > >   Data("b", "1"),
>> > >   Data("c", "2"),
>> > >   Data("d", "3")
>> > > ).toDF()
>> > >
>> > > I assume when Hudi stores above feedDay1DF as parquet file lets assume
>> > just
>> > > one parquet file with 4 records with keys a,b,c,d
>> > >
>> > > //c and d keys values changed
>> > > val feedDay2DF = Seq(
>> > >   Data("a", "0"),
>> > >   Data("b", "1"),
>> > >   Data("c", "200"),
>> > >   Data("d", "300")
>> > > ).toDF()
>> > >
>> > > Now when we try to store feedDay2DF assume it will again store one
>> more
>> > > parquet file now question is will it store it with only two updated
>> > records
>> > > c and d keys or it will store all keys a,b,c,d in a parquet file?
>> Please
>> > > guide.
>> > >
>> >
>>
>

Re: Hudi Parquet Storage Basic Question

Reply via email to