Thank you for your patience... This was a thought provoking RFC. I think we can solve a even more generalized problem here.. data clustering (which we support in limited form for only bulk_insert today).
Please read my comment here.. https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+hudi+support+log+append+scenario+with+better+write+and+asynchronous+compaction?focusedCommentId=154995482#comment-154995482 Few notable changes I am suggesting : - First of all, let's give this a better action (IMHO) name : ` *clustering*` (since it clusters filegroups together based on some criteria; we will get to these later). We will continue referring to what we do today as `*compaction*`. - Let's implement this as a "Write mode", rather than a new append API ? I would like to keep the things simple to insert, delete, update.. like it is now.. As you will see below, what I am suggesting is a generalization of what was proposed in the RFC. If we are going to collapse file groups, then might as well do things like sorting (we already support this for bulk_insert alone) to speed up queries? Also an user can want to do this clustering, without having the need to write small files/quickly ingest as well .. - We should assume that we will cluster N input file groups into M output file groups, not just 1 output file group. Say we want to target file size of 256MB, then it might turn out all your accumulated small groups are worth about 450MB, requiring two file groups instead of 1. (this introduces few limitations as we will see) On Tue, May 19, 2020 at 6:21 PM leesf <[email protected]> wrote: > +1 from me, also I updated the RFC-19, please take another look when you > get a chance. > > Vinoth Chandar <[email protected]> 于2020年5月20日周三 上午1:43写道: > > > Bear with me for 1-2 days.. Will circle around on this.. This is a dear > > topic to me as well :) > > > > On Tue, May 19, 2020 at 9:21 AM Shiyan Xu <[email protected]> > > wrote: > > > > > Hi Wei, > > > > > > +1 on the proposal; append-only is a commonly seen use case. > > > > > > IIUC, the main concern is, Hudi by default generates small files > > internally > > > in COW tables. And by setting `hoodie.parquet.small.file.limit` can > > reduce > > > the number of small files but slow down the pipeline (by doing > > compaction). > > > > > > To the option you mentioned > > > When writing to parquet directly, do you consider setting params for > bulk > > > write? It should be possible to make bulk write bounded by time and > size > > so > > > that you can always have a reasonable size for the output. > > > > > > I agree with Vinoth's point > > > > The main blocker for us to send inserts into logs, is having the > > ability > > > to > > > do log indexing (we wanted to support someone who may want to do > inserts > > > and suddenly wants to upsert the table) > > > > > > Logs are most of time append-only. Due to GDPR or other compliance, we > > may > > > have to scrub some fields later. > > > Looks like we may phase the support. 1 is to write parquet as log > files. > > 2 > > > is to support upsert on demand. This seems to be a different table type > > > (neither COW nor MOR. Sounds like Merge-on-demand?) > > > > > > > > > > > > On Sun, May 17, 2020 at 10:10 AM wei li <[email protected]> wrote: > > > > > > > Thanks, Vinoth Chandar > > > > Just like > > https://issues.apache.org/jira/projects/HUDI/issues/HUDI-112 > > > , > > > > we need a mechanism to solve two issues. > > > > 1. On the write side: do not compaction for faster write. (now merge > > on > > > > read can solve this problem) > > > > 2. compaction and read : also a mechanism to collapse older smaller > > files > > > > into larger ones while also keeping the query cost low.(if use merge > on > > > > read, if do not compaction, the realtime read will slow) > > > > > > > > we have a option: > > > > 1. On the write side: just write parquet, not compaction > > > > 2. compaction and read : because the small file is parquet, the > > realtime > > > > read can be fast, also user can use asynchronous compaction to > > collapse > > > > older smaller parquet files into larger parquet files > > > > > > > > Best Regards, > > > > Wei Li. > > > > > > > > On 2020/05/14 16:54:24, Vinoth Chandar <[email protected]> wrote: > > > > > Hi Wei, > > > > > > > > > > Thanks for starting this thread. I am trying to understand your > > > concern - > > > > > which seems to be that for inserts, we write parquet files instead > of > > > > > logging? FWIW Hudi already supports asynchronous compaction... > and a > > > > > record reader flag that can avoid merging for cases where there are > > > only > > > > > inserts.. > > > > > > > > > > The main blocker for us to send inserts into logs, is having the > > > ability > > > > to > > > > > do log indexing (we wanted to support someone who may want to do > > > inserts > > > > > and suddenly wants to upsert the table).. If we can sacrifice on > that > > > > > initially, it's very doable. > > > > > > > > > > Will wait for others to chime in as well. > > > > > > > > > > On Thu, May 14, 2020 at 9:06 AM wei li <[email protected]> > > wrote: > > > > > > > > > > > The business scenarios of the data lake mainly include analysis > of > > > > > > databases, logs, and files. > > > > > > [image: 11111.jpg] > > > > > > > > > > > > At present, hudi can better support the scenario where the > database > > > > cdc is > > > > > > incrementally written to hudi, and it is also doing bulkload > files > > to > > > > hudi. > > > > > > > > > > > > However, there is no good native support for log scenarios > > (requiring > > > > > > high-throughput writes, no updates, deletions, and focusing on > > small > > > > file > > > > > > scenarios);now can write through inserts without deduplication, > but > > > > they > > > > > > will still merge on the write side. > > > > > > > > > > > > - In copy on write mode when "hoodie.parquet.small.file.limit" > > is > > > > > > 100MB, but every batch small will cost some time for > merge,it > > > > will reduce > > > > > > write throughput. > > > > > > - This scene is not suitable for merge on read. > > > > > > - the actual scenario only needs to write parquet in batches > > when > > > > > > writing, and then provide reverse compaction (similar to delta > > > lake > > > > ) > > > > > > > > > > > > > > > > > > I created an RFC with more details > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+hudi+support+log+append+scenario+with+better+write+and+asynchronous+compaction > > > > > > > > > > > > > > > > > > Best Regards, > > > > > > Wei Li. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
