stevenzwu commented on pull request #2867:
URL: https://github.com/apache/iceberg/pull/2867#issuecomment-892373055


   if the writers are parallel (like 100) and the compactor is single 
parallelism, it is likely the compactor can't keep up with the workload. Even 
though compaction is running asynchronously with snapshotState(), it will 
eventually back up/block the threads executing notifyCheckpointComplete().
   
   In the streaming ingestion path, here are a few things we can do or improve 
to mitigate the small files
   1. writer parallelism
   2. for partitioned tables, we can use DistributionMode to improve data 
clustering so that we can avoid that every writer task write to many partitions 
at the same time. Right now, Flink writer only has support for HASH mode. In 
the future, RANGE or BINPACKING might be more useful and general. They can also 
improve query performance with better data clustering.
   
   Even with above changes, Flink streaming ingestion can still generate small 
files. The parallel compactors and 2nd committer that Ryan mentioned might be 
able to keep up with the throughput. However, personally I would rather not 
over-complicate the streaming ingestion path and make it less stable. Let's get 
the data into long-term data storage (like Iceberg tables) first.  Other 
optimizations (like compaction or sorting) can happen in the background with 
scheduled (Spark) batch jobs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to