The business scenarios of the data lake mainly include analysis of
databases, logs, and files.
[image: 11111.jpg]

At present, hudi can better support the scenario where the database cdc is
incrementally written to hudi, and it is also doing bulkload files to hudi.

However, there is no good native support for log scenarios (requiring
high-throughput writes, no updates, deletions, and focusing on small file
scenarios);now can write through inserts without deduplication, but they
will still merge on the write side.

   - In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB,
   but  every batch small  will cost some time for merge,it will reduce write
   throughput.
   - This scene is not suitable for  merge on read.
   - the actual scenario only needs to write parquet in batches when
   writing, and then provide reverse compaction (similar to delta lake )


I created an RFC with more details
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+hudi+support+log+append+scenario+with+better+write+and+asynchronous+compaction


Best Regards,
Wei Li.

Reply via email to