Re: Hudi Parquet Storage Basic Question

Vinoth Chandar Sun, 29 Sep 2019 19:14:45 -0700

>> now out of these rows only few changed and rest are same every day.
Say the hundreds of thousands of rows were saved initially in a 100 files.
Copy on write will only (re)create the parquet files containing the "few"
rows that changes.
Best case, all changed rows are in a single file => 1 file re-written.
Worst case, changes rows are spread across all 100 files. => 100 files
re-written.
Regarding storage amplification, you can control that using
https://hudi.apache.org/configurations.html#retainCommits .


>>create parquet files with only changed rows or it will create parquet
files with duplicates plus charged rows and wasting storage on hadoop
Copy on write will create parquet files with both old and changed data and
rewrite them i.e if a file had 100 rows and 2 changes, then a new parquet
file is created with 98 old values and 2 changed values. If you want lower
write amplification, merge on read is what fits that. btw there wont be any
duplicates introduced by hudi.

>> then Hudi will store first day parquet file with 100 rows and next day
onwards parquet files inside daily partitions will have only 2 rows
changed. Yes? No?
No in copy on write. yes in merge on read.


On Sat, Sep 28, 2019 at 11:38 PM Umesh Kacha <[email protected]> wrote:

> Ok so I read the following line from the FAQ link shared.
>
> Your current job is rewriting entire table/partition to deal with updates,
> while only a few files actually change in each partition.
>
> My understanding from above statement is if my daily feed has 100 rows and
> if only 2 changed every day then Hudi will store first day parquet file
> with 100 rows and next day onwards parquet files inside daily partitions
> will have only 2 rows changed. Yes? No?
>
> On Sun, Sep 29, 2019 at 1:47 AM Umesh Kacha <[email protected]> wrote:
>
> > OK thanks let's say I use copy on write and I have daily feeds with
> > hundreds of thousands of rows now out of these rows only few changed and
> > rest are same every day. Now my question is will copy on write create
> > parquet files with only changed rows or it will create parquet files with
> > duplicates plus charged rows and wasting storage on hadoop.
> >
> > On Sat, Sep 28, 2019, 9:20 PM Vinoth Chandar <[email protected]> wrote:
> >
> >> +1 There is also a faq entry here
> >>
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#Frequentlyaskedquestions(FAQ)-Whatisthedifferencebetweencopy-on-write(COW)vsmerge-on-read(MOR)storagetypes
> >> ?
> >>
> >>
> >> On Sat, Sep 28, 2019 at 1:10 AM Bhavani Sudha Saktheeswaran
> >> <[email protected]> wrote:
> >>
> >> > Hi Umesh,
> >> >
> >> > Let me try to answer this. At a very high level, if this table is of
> >> type
> >> > COPY_ON_WRITE another version of parquet file will be written with all
> >> keys
> >> > - a,b,c and d. However, if the table is of MERGE_ON_READ type then
> >> updates
> >> > are stored as avro files allowing for read-time reconciliation.
> >> >
> >> > Depending on number of entries in your inputDF and file size configs,
> >> there
> >> > could be one or more parquet files produced instead of just 1 parquet
> >> file.
> >> > You can refer to documentation on File Management here -
> >> > https://hudi.apache.org/concepts.html#file-management and different
> >> > storage
> >> > types here - https://hudi.apache.org/concepts.html#file-management
> >> >
> >> > Hope this helps.
> >> >
> >> > Thanks,
> >> > Sudha
> >> >
> >> > On Fri, Sep 27, 2019 at 11:38 PM Umesh Kacha <[email protected]>
> >> > wrote:
> >> >
> >> > > Hi, I have a very basic question regarding how Hudi writes parquet
> >> files
> >> > > when it finds duplicates/updates/deletes in the daily feed data.
> Lets
> >> say
> >> > > we have the following dataframes
> >> > >
> >> > > val feedDay1DF = Seq(
> >> > >   Data("a", "0"),
> >> > >   Data("b", "1"),
> >> > >   Data("c", "2"),
> >> > >   Data("d", "3")
> >> > > ).toDF()
> >> > >
> >> > > I assume when Hudi stores above feedDay1DF as parquet file lets
> assume
> >> > just
> >> > > one parquet file with 4 records with keys a,b,c,d
> >> > >
> >> > > //c and d keys values changed
> >> > > val feedDay2DF = Seq(
> >> > >   Data("a", "0"),
> >> > >   Data("b", "1"),
> >> > >   Data("c", "200"),
> >> > >   Data("d", "300")
> >> > > ).toDF()
> >> > >
> >> > > Now when we try to store feedDay2DF assume it will again store one
> >> more
> >> > > parquet file now question is will it store it with only two updated
> >> > records
> >> > > c and d keys or it will store all keys a,b,c,d in a parquet file?
> >> Please
> >> > > guide.
> >> > >
> >> >
> >>
> >
>

Re: Hudi Parquet Storage Basic Question

Reply via email to