+1 There is also a faq entry here
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=113709185#Frequentlyaskedquestions(FAQ)-Whatisthedifferencebetweencopy-on-write(COW)vsmerge-on-read(MOR)storagetypes?


On Sat, Sep 28, 2019 at 1:10 AM Bhavani Sudha Saktheeswaran
<bhasu...@uber.com.invalid> wrote:

> Hi Umesh,
>
> Let me try to answer this. At a very high level, if this table is of type
> COPY_ON_WRITE another version of parquet file will be written with all keys
> - a,b,c and d. However, if the table is of MERGE_ON_READ type then updates
> are stored as avro files allowing for read-time reconciliation.
>
> Depending on number of entries in your inputDF and file size configs, there
> could be one or more parquet files produced instead of just 1 parquet file.
> You can refer to documentation on File Management here -
> https://hudi.apache.org/concepts.html#file-management and different
> storage
> types here - https://hudi.apache.org/concepts.html#file-management
>
> Hope this helps.
>
> Thanks,
> Sudha
>
> On Fri, Sep 27, 2019 at 11:38 PM Umesh Kacha <umesh.ka...@gmail.com>
> wrote:
>
> > Hi, I have a very basic question regarding how Hudi writes parquet files
> > when it finds duplicates/updates/deletes in the daily feed data. Lets say
> > we have the following dataframes
> >
> > val feedDay1DF = Seq(
> >   Data("a", "0"),
> >   Data("b", "1"),
> >   Data("c", "2"),
> >   Data("d", "3")
> > ).toDF()
> >
> > I assume when Hudi stores above feedDay1DF as parquet file lets assume
> just
> > one parquet file with 4 records with keys a,b,c,d
> >
> > //c and d keys values changed
> > val feedDay2DF = Seq(
> >   Data("a", "0"),
> >   Data("b", "1"),
> >   Data("c", "200"),
> >   Data("d", "300")
> > ).toDF()
> >
> > Now when we try to store feedDay2DF assume it will again store one more
> > parquet file now question is will it store it with only two updated
> records
> > c and d keys or it will store all keys a,b,c,d in a parquet file? Please
> > guide.
> >
>

Reply via email to