+1 There is also a faq entry here https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#Frequentlyaskedquestions(FAQ)-Whatisthedifferencebetweencopy-on-write(COW)vsmerge-on-read(MOR)storagetypes?
On Sat, Sep 28, 2019 at 1:10 AM Bhavani Sudha Saktheeswaran <bhasu...@uber.com.invalid> wrote: > Hi Umesh, > > Let me try to answer this. At a very high level, if this table is of type > COPY_ON_WRITE another version of parquet file will be written with all keys > - a,b,c and d. However, if the table is of MERGE_ON_READ type then updates > are stored as avro files allowing for read-time reconciliation. > > Depending on number of entries in your inputDF and file size configs, there > could be one or more parquet files produced instead of just 1 parquet file. > You can refer to documentation on File Management here - > https://hudi.apache.org/concepts.html#file-management and different > storage > types here - https://hudi.apache.org/concepts.html#file-management > > Hope this helps. > > Thanks, > Sudha > > On Fri, Sep 27, 2019 at 11:38 PM Umesh Kacha <umesh.ka...@gmail.com> > wrote: > > > Hi, I have a very basic question regarding how Hudi writes parquet files > > when it finds duplicates/updates/deletes in the daily feed data. Lets say > > we have the following dataframes > > > > val feedDay1DF = Seq( > > Data("a", "0"), > > Data("b", "1"), > > Data("c", "2"), > > Data("d", "3") > > ).toDF() > > > > I assume when Hudi stores above feedDay1DF as parquet file lets assume > just > > one parquet file with 4 records with keys a,b,c,d > > > > //c and d keys values changed > > val feedDay2DF = Seq( > > Data("a", "0"), > > Data("b", "1"), > > Data("c", "200"), > > Data("d", "300") > > ).toDF() > > > > Now when we try to store feedDay2DF assume it will again store one more > > parquet file now question is will it store it with only two updated > records > > c and d keys or it will store all keys a,b,c,d in a parquet file? Please > > guide. > > >