Re: [Discussion] hudi support log append scenario with better write and asynchronous compaction

leesf Tue, 19 May 2020 18:22:32 -0700

+1 from me, also I updated the RFC-19, please take another look when you
get a chance.


Vinoth Chandar <[email protected]> 于2020年5月20日周三 上午1:43写道：

> Bear with me for 1-2 days.. Will circle around on this.. This is a dear
> topic to me as well :)
>
> On Tue, May 19, 2020 at 9:21 AM Shiyan Xu <[email protected]>
> wrote:
>
> > Hi Wei,
> >
> > +1 on the proposal; append-only is a commonly seen use case.
> >
> > IIUC, the main concern is, Hudi by default generates small files
> internally
> > in COW tables. And by setting `hoodie.parquet.small.file.limit` can
> reduce
> > the number of small files but slow down the pipeline (by doing
> compaction).
> >
> > To the option you mentioned
> > When writing to parquet directly, do you consider setting params for bulk
> > write? It should be possible to make bulk write bounded by time and size
> so
> > that you can always have a reasonable size for the output.
> >
> > I agree with Vinoth's point
> > > The main blocker for us to send inserts into logs, is having the
> ability
> > to
> > do log indexing (we wanted to support someone who may want to do inserts
> > and suddenly wants to upsert the table)
> >
> > Logs are most of time append-only. Due to GDPR or other compliance, we
> may
> > have to scrub some fields later.
> > Looks like we may phase the support. 1 is to write parquet as log files.
> 2
> > is to support upsert on demand. This seems to be a different table type
> > (neither COW nor MOR. Sounds like Merge-on-demand?)
> >
> >
> >
> > On Sun, May 17, 2020 at 10:10 AM wei li <[email protected]> wrote:
> >
> > > Thanks, Vinoth Chandar
> > > Just like
> https://issues.apache.org/jira/projects/HUDI/issues/HUDI-112
> > ,
> > > we need  a mechanism to  solve two issues.
> > > 1.  On the write side: do not compaction for faster write. (now merge
> on
> > > read can solve this problem)
> > > 2. compaction and read : also a mechanism to collapse older smaller
> files
> > > into larger ones while also keeping the query cost low.(if use merge on
> > > read, if do not compaction, the realtime read will slow)
> > >
> > > we have a option:
> > > 1. On the write side: just write parquet, not compaction
> > > 2. compaction and read : because the small file is parquet, the
> realtime
> > > read can be fast, also user can use asynchronous compaction to
> collapse
> > > older smaller parquet files into larger parquet files
> > >
> > > Best Regards,
> > > Wei Li.
> > >
> > > On 2020/05/14 16:54:24, Vinoth Chandar <[email protected]> wrote:
> > > > Hi Wei,
> > > >
> > > > Thanks for starting this thread. I am trying to understand your
> > concern -
> > > > which seems to be that for inserts, we write parquet files instead of
> > > > logging?  FWIW Hudi already supports asynchronous compaction... and a
> > > > record reader flag that can avoid merging for cases where there are
> > only
> > > > inserts..
> > > >
> > > > The main blocker for us to send inserts into logs, is having the
> > ability
> > > to
> > > > do log indexing (we wanted to support someone who may want to do
> > inserts
> > > > and suddenly wants to upsert the table).. If we can sacrifice on that
> > > > initially, it's very doable.
> > > >
> > > > Will wait for others to chime in as well.
> > > >
> > > > On Thu, May 14, 2020 at 9:06 AM wei li <[email protected]>
> wrote:
> > > >
> > > > > The business scenarios of the data lake mainly include analysis of
> > > > > databases, logs, and files.
> > > > > [image: 11111.jpg]
> > > > >
> > > > > At present, hudi can better support the scenario where the database
> > > cdc is
> > > > > incrementally written to hudi, and it is also doing bulkload files
> to
> > > hudi.
> > > > >
> > > > > However, there is no good native support for log scenarios
> (requiring
> > > > > high-throughput writes, no updates, deletions, and focusing on
> small
> > > file
> > > > > scenarios);now can write through inserts without deduplication, but
> > > they
> > > > > will still merge on the write side.
> > > > >
> > > > >    - In copy on write mode when "hoodie.parquet.small.file.limit"
> is
> > > > >    100MB, but  every batch small  will cost some time for merge,it
> > > will reduce
> > > > >    write throughput.
> > > > >    - This scene is not suitable for  merge on read.
> > > > >    - the actual scenario only needs to write parquet in batches
> when
> > > > >    writing, and then provide reverse compaction (similar to delta
> > lake
> > > )
> > > > >
> > > > >
> > > > > I created an RFC with more details
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+hudi+support+log+append+scenario+with+better+write+and+asynchronous+compaction
> > > > >
> > > > >
> > > > > Best Regards,
> > > > > Wei Li.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Discussion] hudi support log append scenario with better write and asynchronous compaction

Reply via email to