This is an automated email from the ASF dual-hosted git repository.
junhao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-paimon.git
The following commit(s) were added to refs/heads/master by this push:
new 385bd9049 [DOC] Change the description of the document about writing
to multiple partitions (#2394)
385bd9049 is described below
commit 385bd9049f449fd8872a11b53a32e8ded2a913ed
Author: PLASH SPEED <[email protected]>
AuthorDate: Mon Nov 27 13:48:47 2023 +0800
[DOC] Change the description of the document about writing to multiple
partitions (#2394)
---
docs/content/concepts/append-only-table.md | 24 +++++++++++++++++++-----
1 file changed, 19 insertions(+), 5 deletions(-)
diff --git a/docs/content/concepts/append-only-table.md
b/docs/content/concepts/append-only-table.md
index c549f1d84..3e4356c5d 100644
--- a/docs/content/concepts/append-only-table.md
+++ b/docs/content/concepts/append-only-table.md
@@ -264,11 +264,25 @@ CREATE TABLE MyTable (
## Multiple Partitions Write
-While writing multiple partitions in a single insert job, we may get an
out-of-memory error if
-too many records arrived between two checkpoint.
+Since the number of write tasks that Paimon-sink needs to handle is: the
number of partitions to which the data is written * the number of buckets per
partition.
+Therefore, we need to try to control the number of write tasks per paimon-sink
task as much as possible,so that it is distributed within a reasonable range.
+If each sink-task handles too many write tasks, not only will it cause
problems with too many small files, but it may also lead to out-of-memory
errors.
-You can `write-buffer-for-append` option for append-only table. Setting this
parameter to true, writer will cache
-the records use Segment Pool to avoid OOM.
+In addition, write failures introduce orphan files, which undoubtedly adds to
the cost of maintaining paimon. We need to avoid this problem as much as
possible.
+For flink-jobs with auto-merge enabled, we recommend trying to follow the
following formula to adjust the parallelism of paimon-sink(This doesn't just
apply to append-only-tables, it actually applies to most scenarios):
+```
+(N*B)/P < 100 (This value needs to be adjusted according to the actual
situation)
+N(the number of partitions to which the data is written)
+B(bucket number)
+P(parallelism of paimon-sink)
+100 (This is an empirically derived threshold,For flink-jobs with auto-merge
disabled, this value can be reduced.
+However, please note that you are only transferring part of the work to the
user-compaction-job, you still have to deal with the problem in essence,
+the amount of work you have to deal with has not been reduced, and the
user-compaction-job still needs to be adjusted according to the above formula.)
+```
You can also set `write-buffer-spillable` to true, writer can spill the
records to disk. This can reduce small
-files as much as possible.
+files as much as possible.To use this option, you need to have a certain size
of local disks for your flink cluster. This is especially important for those
using flink on k8s.
+
+For append-only-table,You can set `write-buffer-for-append` option for
append-only table. Setting this parameter to true, writer will cache
+the records use Segment Pool to avoid OOM.
+