JingsongLi commented on code in PR #4255: URL: https://github.com/apache/paimon/pull/4255#discussion_r1776187272
########## docs/content/maintenance/write-performance.md: ########## @@ -160,3 +160,16 @@ You can use fine-grained-resource-management of Flink to increase committer heap 1. Configure Flink Configuration `cluster.fine-grained-resource-management.enabled: true`. (This is default after Flink 1.18) 2. Configure Paimon Table Options: `sink.committer-memory`, for example 300 MB, depends on your `TaskManager`. (`sink.committer-cpu` is also supported) + +## Changelog Compaction + +If Flink's checkpoint interval is short (for example, 30 seconds) and the number of buckets is large, +each snapshot may produce lots of small changelog files. +Too many files may put a burden on the distributed storage cluster. + +In order to compact small changelog files into large ones, you can set the table option `changelog.compact.parallelism`. +This option will add a compact operator after the writer operator, which copies changelog files into large ones. +If the parallelism becomes larger, file copying will become faster. +However, the number of resulting files will also become larger. +As file copying is fast in most storage system, +we suggest that you start experimenting with `'changelog.compact.parallelism' = '1'` and increase the value if needed. Review Comment: My idea is to have only one switch: `changelog.precommit-compact` = `true`. We can add a Coordinator node to this pipeline to decide how to concatenate it into a target file size result file, which can be one or multiple files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@paimon.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org