stevenzwu commented on pull request #2867: URL: https://github.com/apache/iceberg/pull/2867#issuecomment-892373055
if the writers are parallel (like 100) and the compactor is single parallelism, it is likely the compactor can't keep up with the workload. Even though compaction is running asynchronously with snapshotState(), it will eventually back up/block the threads executing notifyCheckpointComplete(). In the streaming ingestion path, here are a few things we can do or improve to mitigate the small files 1. writer parallelism 2. for partitioned tables, we can use DistributionMode to improve data clustering so that we can avoid that every writer task write to many partitions at the same time. Right now, Flink writer only has support for HASH mode. In the future, RANGE or BINPACKING might be more useful and general. They can also improve query performance with better data clustering. Even with above changes, Flink streaming ingestion can still generate small files. The parallel compactors and 2nd committer that Ryan mentioned might be able to keep up with the throughput. However, personally I would rather not over-complicate the streaming ingestion path and make it less stable. Let's get the data into long-term data storage (like Iceberg tables) first. Other optimizations (like compaction or sorting) can happen in the background with scheduled (Spark) batch jobs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
