Ok so I read the following line from the FAQ link shared. Your current job is rewriting entire table/partition to deal with updates, while only a few files actually change in each partition.
My understanding from above statement is if my daily feed has 100 rows and if only 2 changed every day then Hudi will store first day parquet file with 100 rows and next day onwards parquet files inside daily partitions will have only 2 rows changed. Yes? No? On Sun, Sep 29, 2019 at 1:47 AM Umesh Kacha <[email protected]> wrote: > OK thanks let's say I use copy on write and I have daily feeds with > hundreds of thousands of rows now out of these rows only few changed and > rest are same every day. Now my question is will copy on write create > parquet files with only changed rows or it will create parquet files with > duplicates plus charged rows and wasting storage on hadoop. > > On Sat, Sep 28, 2019, 9:20 PM Vinoth Chandar <[email protected]> wrote: > >> +1 There is also a faq entry here >> >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#Frequentlyaskedquestions(FAQ)-Whatisthedifferencebetweencopy-on-write(COW)vsmerge-on-read(MOR)storagetypes >> ? >> >> >> On Sat, Sep 28, 2019 at 1:10 AM Bhavani Sudha Saktheeswaran >> <[email protected]> wrote: >> >> > Hi Umesh, >> > >> > Let me try to answer this. At a very high level, if this table is of >> type >> > COPY_ON_WRITE another version of parquet file will be written with all >> keys >> > - a,b,c and d. However, if the table is of MERGE_ON_READ type then >> updates >> > are stored as avro files allowing for read-time reconciliation. >> > >> > Depending on number of entries in your inputDF and file size configs, >> there >> > could be one or more parquet files produced instead of just 1 parquet >> file. >> > You can refer to documentation on File Management here - >> > https://hudi.apache.org/concepts.html#file-management and different >> > storage >> > types here - https://hudi.apache.org/concepts.html#file-management >> > >> > Hope this helps. >> > >> > Thanks, >> > Sudha >> > >> > On Fri, Sep 27, 2019 at 11:38 PM Umesh Kacha <[email protected]> >> > wrote: >> > >> > > Hi, I have a very basic question regarding how Hudi writes parquet >> files >> > > when it finds duplicates/updates/deletes in the daily feed data. Lets >> say >> > > we have the following dataframes >> > > >> > > val feedDay1DF = Seq( >> > > Data("a", "0"), >> > > Data("b", "1"), >> > > Data("c", "2"), >> > > Data("d", "3") >> > > ).toDF() >> > > >> > > I assume when Hudi stores above feedDay1DF as parquet file lets assume >> > just >> > > one parquet file with 4 records with keys a,b,c,d >> > > >> > > //c and d keys values changed >> > > val feedDay2DF = Seq( >> > > Data("a", "0"), >> > > Data("b", "1"), >> > > Data("c", "200"), >> > > Data("d", "300") >> > > ).toDF() >> > > >> > > Now when we try to store feedDay2DF assume it will again store one >> more >> > > parquet file now question is will it store it with only two updated >> > records >> > > c and d keys or it will store all keys a,b,c,d in a parquet file? >> Please >> > > guide. >> > > >> > >> >
