Hi Wei,

+1 on the proposal; append-only is a commonly seen use case.

IIUC, the main concern is, Hudi by default generates small files internally
in COW tables. And by setting `hoodie.parquet.small.file.limit` can reduce
the number of small files but slow down the pipeline (by doing compaction).

To the option you mentioned
When writing to parquet directly, do you consider setting params for bulk
write? It should be possible to make bulk write bounded by time and size so
that you can always have a reasonable size for the output.

I agree with Vinoth's point
> The main blocker for us to send inserts into logs, is having the ability
to
do log indexing (we wanted to support someone who may want to do inserts
and suddenly wants to upsert the table)

Logs are most of time append-only. Due to GDPR or other compliance, we may
have to scrub some fields later.
Looks like we may phase the support. 1 is to write parquet as log files. 2
is to support upsert on demand. This seems to be a different table type
(neither COW nor MOR. Sounds like Merge-on-demand?)



On Sun, May 17, 2020 at 10:10 AM wei li <lw309637...@gmail.com> wrote:

> Thanks, Vinoth Chandar
> Just like  https://issues.apache.org/jira/projects/HUDI/issues/HUDI-112 ,
> we need  a mechanism to  solve two issues.
> 1.  On the write side: do not compaction for faster write. (now merge on
> read can solve this problem)
> 2. compaction and read : also a mechanism to collapse older smaller files
> into larger ones while also keeping the query cost low.(if use merge on
> read, if do not compaction, the realtime read will slow)
>
> we have a option:
> 1. On the write side: just write parquet, not compaction
> 2. compaction and read : because the small file is parquet, the realtime
> read can be fast, also user can use asynchronous compaction to  collapse
> older smaller parquet files into larger parquet files
>
> Best Regards,
> Wei Li.
>
> On 2020/05/14 16:54:24, Vinoth Chandar <vin...@apache.org> wrote:
> > Hi Wei,
> >
> > Thanks for starting this thread. I am trying to understand your concern -
> > which seems to be that for inserts, we write parquet files instead of
> > logging?  FWIW Hudi already supports asynchronous compaction... and a
> > record reader flag that can avoid merging for cases where there are only
> > inserts..
> >
> > The main blocker for us to send inserts into logs, is having the ability
> to
> > do log indexing (we wanted to support someone who may want to do inserts
> > and suddenly wants to upsert the table).. If we can sacrifice on that
> > initially, it's very doable.
> >
> > Will wait for others to chime in as well.
> >
> > On Thu, May 14, 2020 at 9:06 AM wei li <lw309637...@gmail.com> wrote:
> >
> > > The business scenarios of the data lake mainly include analysis of
> > > databases, logs, and files.
> > > [image: 11111.jpg]
> > >
> > > At present, hudi can better support the scenario where the database
> cdc is
> > > incrementally written to hudi, and it is also doing bulkload files to
> hudi.
> > >
> > > However, there is no good native support for log scenarios (requiring
> > > high-throughput writes, no updates, deletions, and focusing on small
> file
> > > scenarios);now can write through inserts without deduplication, but
> they
> > > will still merge on the write side.
> > >
> > >    - In copy on write mode when "hoodie.parquet.small.file.limit" is
> > >    100MB, but  every batch small  will cost some time for merge,it
> will reduce
> > >    write throughput.
> > >    - This scene is not suitable for  merge on read.
> > >    - the actual scenario only needs to write parquet in batches when
> > >    writing, and then provide reverse compaction (similar to delta lake
> )
> > >
> > >
> > > I created an RFC with more details
> > >
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+hudi+support+log+append+scenario+with+better+write+and+asynchronous+compaction
> > >
> > >
> > > Best Regards,
> > > Wei Li.
> > >
> > >
> > >
> >
>

Reply via email to