Re: [Discussion] hudi support log append scenario with better write and asynchronous compaction

Vinoth Chandar Tue, 19 May 2020 10:43:41 -0700

Bear with me for 1-2 days.. Will circle around on this.. This is a dear
topic to me as well :)


On Tue, May 19, 2020 at 9:21 AM Shiyan Xu <[email protected]>
wrote:

> Hi Wei,
>
> +1 on the proposal; append-only is a commonly seen use case.
>
> IIUC, the main concern is, Hudi by default generates small files internally
> in COW tables. And by setting `hoodie.parquet.small.file.limit` can reduce
> the number of small files but slow down the pipeline (by doing compaction).
>
> To the option you mentioned
> When writing to parquet directly, do you consider setting params for bulk
> write? It should be possible to make bulk write bounded by time and size so
> that you can always have a reasonable size for the output.
>
> I agree with Vinoth's point
> > The main blocker for us to send inserts into logs, is having the ability
> to
> do log indexing (we wanted to support someone who may want to do inserts
> and suddenly wants to upsert the table)
>
> Logs are most of time append-only. Due to GDPR or other compliance, we may
> have to scrub some fields later.
> Looks like we may phase the support. 1 is to write parquet as log files. 2
> is to support upsert on demand. This seems to be a different table type
> (neither COW nor MOR. Sounds like Merge-on-demand?)
>
>
>
> On Sun, May 17, 2020 at 10:10 AM wei li <[email protected]> wrote:
>
> > Thanks, Vinoth Chandar
> > Just like  https://issues.apache.org/jira/projects/HUDI/issues/HUDI-112
> ,
> > we need  a mechanism to  solve two issues.
> > 1.  On the write side: do not compaction for faster write. (now merge on
> > read can solve this problem)
> > 2. compaction and read : also a mechanism to collapse older smaller files
> > into larger ones while also keeping the query cost low.(if use merge on
> > read, if do not compaction, the realtime read will slow)
> >
> > we have a option:
> > 1. On the write side: just write parquet, not compaction
> > 2. compaction and read : because the small file is parquet, the realtime
> > read can be fast, also user can use asynchronous compaction to  collapse
> > older smaller parquet files into larger parquet files
> >
> > Best Regards,
> > Wei Li.
> >
> > On 2020/05/14 16:54:24, Vinoth Chandar <[email protected]> wrote:
> > > Hi Wei,
> > >
> > > Thanks for starting this thread. I am trying to understand your
> concern -
> > > which seems to be that for inserts, we write parquet files instead of
> > > logging?  FWIW Hudi already supports asynchronous compaction... and a
> > > record reader flag that can avoid merging for cases where there are
> only
> > > inserts..
> > >
> > > The main blocker for us to send inserts into logs, is having the
> ability
> > to
> > > do log indexing (we wanted to support someone who may want to do
> inserts
> > > and suddenly wants to upsert the table).. If we can sacrifice on that
> > > initially, it's very doable.
> > >
> > > Will wait for others to chime in as well.
> > >
> > > On Thu, May 14, 2020 at 9:06 AM wei li <[email protected]> wrote:
> > >
> > > > The business scenarios of the data lake mainly include analysis of
> > > > databases, logs, and files.
> > > > [image: 11111.jpg]
> > > >
> > > > At present, hudi can better support the scenario where the database
> > cdc is
> > > > incrementally written to hudi, and it is also doing bulkload files to
> > hudi.
> > > >
> > > > However, there is no good native support for log scenarios (requiring
> > > > high-throughput writes, no updates, deletions, and focusing on small
> > file
> > > > scenarios);now can write through inserts without deduplication, but
> > they
> > > > will still merge on the write side.
> > > >
> > > >    - In copy on write mode when "hoodie.parquet.small.file.limit" is
> > > >    100MB, but  every batch small  will cost some time for merge,it
> > will reduce
> > > >    write throughput.
> > > >    - This scene is not suitable for  merge on read.
> > > >    - the actual scenario only needs to write parquet in batches when
> > > >    writing, and then provide reverse compaction (similar to delta
> lake
> > )
> > > >
> > > >
> > > > I created an RFC with more details
> > > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+hudi+support+log+append+scenario+with+better+write+and+asynchronous+compaction
> > > >
> > > >
> > > > Best Regards,
> > > > Wei Li.
> > > >
> > > >
> > > >
> > >
> >
>

Re: [Discussion] hudi support log append scenario with better write and asynchronous compaction

Reply via email to