OK thanks let's say I use copy on write and I have daily feeds with hundreds of thousands of rows now out of these rows only few changed and rest are same every day. Now my question is will copy on write create parquet files with only changed rows or it will create parquet files with duplicates plus charged rows and wasting storage on hadoop.
On Sat, Sep 28, 2019, 9:20 PM Vinoth Chandar <[email protected]> wrote: > +1 There is also a faq entry here > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#Frequentlyaskedquestions(FAQ)-Whatisthedifferencebetweencopy-on-write(COW)vsmerge-on-read(MOR)storagetypes > ? > > > On Sat, Sep 28, 2019 at 1:10 AM Bhavani Sudha Saktheeswaran > <[email protected]> wrote: > > > Hi Umesh, > > > > Let me try to answer this. At a very high level, if this table is of type > > COPY_ON_WRITE another version of parquet file will be written with all > keys > > - a,b,c and d. However, if the table is of MERGE_ON_READ type then > updates > > are stored as avro files allowing for read-time reconciliation. > > > > Depending on number of entries in your inputDF and file size configs, > there > > could be one or more parquet files produced instead of just 1 parquet > file. > > You can refer to documentation on File Management here - > > https://hudi.apache.org/concepts.html#file-management and different > > storage > > types here - https://hudi.apache.org/concepts.html#file-management > > > > Hope this helps. > > > > Thanks, > > Sudha > > > > On Fri, Sep 27, 2019 at 11:38 PM Umesh Kacha <[email protected]> > > wrote: > > > > > Hi, I have a very basic question regarding how Hudi writes parquet > files > > > when it finds duplicates/updates/deletes in the daily feed data. Lets > say > > > we have the following dataframes > > > > > > val feedDay1DF = Seq( > > > Data("a", "0"), > > > Data("b", "1"), > > > Data("c", "2"), > > > Data("d", "3") > > > ).toDF() > > > > > > I assume when Hudi stores above feedDay1DF as parquet file lets assume > > just > > > one parquet file with 4 records with keys a,b,c,d > > > > > > //c and d keys values changed > > > val feedDay2DF = Seq( > > > Data("a", "0"), > > > Data("b", "1"), > > > Data("c", "200"), > > > Data("d", "300") > > > ).toDF() > > > > > > Now when we try to store feedDay2DF assume it will again store one more > > > parquet file now question is will it store it with only two updated > > records > > > c and d keys or it will store all keys a,b,c,d in a parquet file? > Please > > > guide. > > > > > >
