+1 from me, also I updated the RFC-19, please take another look when you get a chance.
Vinoth Chandar <[email protected]> 于2020年5月20日周三 上午1:43写道: > Bear with me for 1-2 days.. Will circle around on this.. This is a dear > topic to me as well :) > > On Tue, May 19, 2020 at 9:21 AM Shiyan Xu <[email protected]> > wrote: > > > Hi Wei, > > > > +1 on the proposal; append-only is a commonly seen use case. > > > > IIUC, the main concern is, Hudi by default generates small files > internally > > in COW tables. And by setting `hoodie.parquet.small.file.limit` can > reduce > > the number of small files but slow down the pipeline (by doing > compaction). > > > > To the option you mentioned > > When writing to parquet directly, do you consider setting params for bulk > > write? It should be possible to make bulk write bounded by time and size > so > > that you can always have a reasonable size for the output. > > > > I agree with Vinoth's point > > > The main blocker for us to send inserts into logs, is having the > ability > > to > > do log indexing (we wanted to support someone who may want to do inserts > > and suddenly wants to upsert the table) > > > > Logs are most of time append-only. Due to GDPR or other compliance, we > may > > have to scrub some fields later. > > Looks like we may phase the support. 1 is to write parquet as log files. > 2 > > is to support upsert on demand. This seems to be a different table type > > (neither COW nor MOR. Sounds like Merge-on-demand?) > > > > > > > > On Sun, May 17, 2020 at 10:10 AM wei li <[email protected]> wrote: > > > > > Thanks, Vinoth Chandar > > > Just like > https://issues.apache.org/jira/projects/HUDI/issues/HUDI-112 > > , > > > we need a mechanism to solve two issues. > > > 1. On the write side: do not compaction for faster write. (now merge > on > > > read can solve this problem) > > > 2. compaction and read : also a mechanism to collapse older smaller > files > > > into larger ones while also keeping the query cost low.(if use merge on > > > read, if do not compaction, the realtime read will slow) > > > > > > we have a option: > > > 1. On the write side: just write parquet, not compaction > > > 2. compaction and read : because the small file is parquet, the > realtime > > > read can be fast, also user can use asynchronous compaction to > collapse > > > older smaller parquet files into larger parquet files > > > > > > Best Regards, > > > Wei Li. > > > > > > On 2020/05/14 16:54:24, Vinoth Chandar <[email protected]> wrote: > > > > Hi Wei, > > > > > > > > Thanks for starting this thread. I am trying to understand your > > concern - > > > > which seems to be that for inserts, we write parquet files instead of > > > > logging? FWIW Hudi already supports asynchronous compaction... and a > > > > record reader flag that can avoid merging for cases where there are > > only > > > > inserts.. > > > > > > > > The main blocker for us to send inserts into logs, is having the > > ability > > > to > > > > do log indexing (we wanted to support someone who may want to do > > inserts > > > > and suddenly wants to upsert the table).. If we can sacrifice on that > > > > initially, it's very doable. > > > > > > > > Will wait for others to chime in as well. > > > > > > > > On Thu, May 14, 2020 at 9:06 AM wei li <[email protected]> > wrote: > > > > > > > > > The business scenarios of the data lake mainly include analysis of > > > > > databases, logs, and files. > > > > > [image: 11111.jpg] > > > > > > > > > > At present, hudi can better support the scenario where the database > > > cdc is > > > > > incrementally written to hudi, and it is also doing bulkload files > to > > > hudi. > > > > > > > > > > However, there is no good native support for log scenarios > (requiring > > > > > high-throughput writes, no updates, deletions, and focusing on > small > > > file > > > > > scenarios);now can write through inserts without deduplication, but > > > they > > > > > will still merge on the write side. > > > > > > > > > > - In copy on write mode when "hoodie.parquet.small.file.limit" > is > > > > > 100MB, but every batch small will cost some time for merge,it > > > will reduce > > > > > write throughput. > > > > > - This scene is not suitable for merge on read. > > > > > - the actual scenario only needs to write parquet in batches > when > > > > > writing, and then provide reverse compaction (similar to delta > > lake > > > ) > > > > > > > > > > > > > > > I created an RFC with more details > > > > > > > > > > > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+hudi+support+log+append+scenario+with+better+write+and+asynchronous+compaction > > > > > > > > > > > > > > > Best Regards, > > > > > Wei Li. > > > > > > > > > > > > > > > > > > > > > > > > >
