[incubator-paimon] branch master updated: [doc] rephrase some maintenance docs (#1803)

lzljs3620320 Sun, 13 Aug 2023 06:24:28 -0700

This is an automated email from the ASF dual-hosted git repository.

lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-paimon.git



The following commit(s) were added to refs/heads/master by this push:
     new b4e90c6bf [doc] rephrase some maintenance docs (#1803)
b4e90c6bf is described below

commit b4e90c6bf36e1fabefed2c7fd91fb5a6bdccb89c
Author: stayrascal <[email protected]>
AuthorDate: Sun Aug 13 21:24:20 2023 +0800

    [doc] rephrase some maintenance docs (#1803)
---
 docs/content/maintenance/manage-files.md      |  2 +-
 docs/content/maintenance/multiple-writers.md  | 14 ++++-----
 docs/content/maintenance/read-performance.md  |  8 ++---
 docs/content/maintenance/write-performance.md | 43 ++++++++++++++++++++-------
 4 files changed, 44 insertions(+), 23 deletions(-)

diff --git a/docs/content/maintenance/manage-files.md 
b/docs/content/maintenance/manage-files.md
index 160a875a3..ab4f8b21f 100644
--- a/docs/content/maintenance/manage-files.md
+++ b/docs/content/maintenance/manage-files.md
@@ -36,7 +36,7 @@ Many users are concerned about small files, which can lead to:
 Assuming you are using Flink Writer, each checkpoint generates 1-2 snapshots, 
and the checkpoint forces the files to be
 generated on DFS, so the smaller the checkpoint interval the more small files 
will be generated.
 
-1. So first thing is decrease checkpoint interval.
+1. So first thing is increase checkpoint interval.
 
 By default, not only checkpoint will cause the file to be generated, but 
writer's memory (write-buffer-size) exhaustion
 will also flush data to DFS and generate the corresponding file. You can 
enable `write-buffer-spillable` to generate
diff --git a/docs/content/maintenance/multiple-writers.md 
b/docs/content/maintenance/multiple-writers.md
index 29650d523..17ede6059 100644
--- a/docs/content/maintenance/multiple-writers.md
+++ b/docs/content/maintenance/multiple-writers.md
@@ -26,10 +26,10 @@ under the License.
 
 # Multiple Writers
 
-Paimon's snapshot management supports writing to multiple writers.
+Paimon's snapshot management supports writing with multiple writers.
 
 {{< hint info >}}
-For S3-like object store, its `'RENAME'` does not have atomic semantics. We 
need to configure Hive metastore and
+For S3-like object store, its `'RENAME'` does not have atomic semantic. We 
need to configure Hive metastore and
 enable `'lock.enabled'` option for the catalog.
 {{< /hint >}}
 
@@ -39,16 +39,16 @@ historical partition.
 
 {{< img src="/img/multiple-writers.png">}}
 
-So far, everything has worked very well, but if you need to multiple writers 
to the same partition, things
-become a bit more complicated. For example, you don't want to use `UNION ALL`, 
you need to have multiple
-streaming job to write a `'partial-update'` table. Please refer to the 
`'Dedicated Compaction Job'` below.
+So far, everything works very well, but if you need multiple writers to write 
records to the same partition, it will 
+be a bit more complicated. For example, you don't want to use `UNION ALL`, you 
have multiple
+streaming jobs to write records to a `'partial-update'` table. Please refer to 
the `'Dedicated Compaction Job'` below.
 
 ## Dedicated Compaction Job
 
-By default, Paimon writers will perform compaction as needed when writing 
records. This is sufficient for most use cases, but there are two downsides:
+By default, Paimon writers will perform compaction as needed during writing 
records. This is sufficient for most use cases, but there are two downsides:
 
 * This may result in unstable write throughput because throughput might 
temporarily drop when performing a compaction.
-* Compaction will mark some data files as "deleted" (not really deleted, see 
[expiring snapshots]({{< ref "maintenance/manage-snapshots#expiring-snapshots" 
>}}) for more info). If multiple writers mark the same file a conflict will 
occur when committing the changes. Paimon will automatically resolve the 
conflict, but this may result in job restarts.
+* Compaction will mark some data files as "deleted" (not really deleted, see 
[expiring snapshots]({{< ref "maintenance/manage-snapshots#expiring-snapshots" 
>}}) for more info). If multiple writers mark the same file, a conflict will 
occur when committing the changes. Paimon will automatically resolve the 
conflict, but this may result in job restarts.
 
 To avoid these downsides, users can also choose to skip compactions in 
writers, and run a dedicated job only for compaction. As compactions are 
performed only by the dedicated job, writers can continuously write records 
without pausing and no conflicts will ever occur.
 
diff --git a/docs/content/maintenance/read-performance.md 
b/docs/content/maintenance/read-performance.md
index eb8182656..70b35d14e 100644
--- a/docs/content/maintenance/read-performance.md
+++ b/docs/content/maintenance/read-performance.md
@@ -39,7 +39,7 @@ this full-compaction option without any requirements, as it 
will have a signific
 ### Primary Key Table
 
 For Primary Key Table, it's a 'MergeOnRead' technology. When reading data, 
multiple layers of LSM data are merged,
-and the number of parallelism will be limited by the number of buckets. 
Although Paimon's merge will be efficient,
+and the number of parallelism will be limited by the number of buckets. 
Although Paimon's merge performance is efficient,
 it still cannot catch up with the ordinary AppendOnly table.
 
 If you want to query fast enough in certain scenarios, but can only find older 
data, you can:
@@ -51,9 +51,9 @@ You can flexibly balance query performance and data latency 
when reading.
 
 ### Append Only Table
 
-Small files can slow reading and affect DFS stability. By default, when there 
are more than 'compaction.max.file-num'
-(default 50) small files in a single bucket, a compaction is triggered. 
However, when there are multiple buckets, many
-small files will be generated.
+Small files will slow down reading performance and affect the stability of 
DFS. By default, when there are more than 
+'compaction.max.file-num' (default 50) small files in a single bucket, a 
compaction task will be triggered to compact 
+them. Furthermore, if there are multiple buckets, many small files will be 
generated.
 
 You can use full-compaction to reduce small files. Full-compaction will 
eliminate most small files.
 
diff --git a/docs/content/maintenance/write-performance.md 
b/docs/content/maintenance/write-performance.md
index 4bb0feec4..4ce4ed29a 100644
--- a/docs/content/maintenance/write-performance.md
+++ b/docs/content/maintenance/write-performance.md
@@ -35,7 +35,7 @@ Paimon's write performance is closely related to checkpoint, 
so if you need grea
 
 Option `'changelog-producer' = 'lookup' or 'full-compaction'`, and option 
`'full-compaction.delta-commits'` have a
 large impact on write performance, if it is a snapshot / full synchronization 
phase you can unset these options and
-then enable them on again when needed in the incremental phase.
+then enable them again in the incremental phase.
 
 ## Parallelism
 
@@ -80,9 +80,9 @@ performance during low write periods.
 
 ### Number of Sorted Runs to Pause Writing
 
-When number of sorted runs is small, Paimon writers will perform compaction 
asynchronously in separated threads, so
-records can be continuously written into the table. However to avoid unbounded 
growth of sorted runs, writers will
-have to pause writing when the number of sorted runs hits the threshold. The 
following table property determines
+When the number of sorted runs is small, Paimon writers will perform 
compaction asynchronously in separated threads, so
+records can be continuously written into the table. However, to avoid 
unbounded growth of sorted runs, writers will
+pause writing when the number of sorted runs hits the threshold. The following 
table property determines
 the threshold.
 
 <table class="table table-bordered">
@@ -108,9 +108,30 @@ the threshold.
 
 Write stalls will become less frequent when `num-sorted-run.stop-trigger` 
becomes larger, thus improving writing
 performance. However, if this value becomes too large, more memory and CPU 
time will be needed when querying the
-table. If you are concerned about the OOM of memory, please configure the 
following option `sort-spill-threshold`.
+table. If you are concerned about the OOM problem, please configure the 
following option.
 Its value depends on your memory size.
 
+<table class="table table-bordered">
+    <thead>
+    <tr>
+      <th class="text-left" style="width: 20%">Option</th>
+      <th class="text-left" style="width: 5%">Required</th>
+      <th class="text-left" style="width: 5%">Default</th>
+      <th class="text-left" style="width: 10%">Type</th>
+      <th class="text-left" style="width: 60%">Description</th>
+    </tr>
+    </thead>
+    <tbody>
+    <tr>
+      <td><h5>sort-spill-threshold</h5></td>
+      <td>No</td>
+      <td style="word-wrap: break-word;">(none)</td>
+      <td>Integer</td>
+      <td>If the maximum number of sort readers exceeds this value, a spill 
will be attempted. This prevents too many readers from consuming too much 
memory and causing OOM.</td>
+    </tr>
+    </tbody>
+</table>
+
 ### Number of Sorted Runs to Trigger Compaction
 
 Paimon uses [LSM tree]({{< ref "concepts/file-layouts#lsm-trees" >}}) which 
supports a large number of updates. LSM organizes files in several [sorted 
runs]({{< ref "concepts/file-layouts#sorted-runs" >}}). When querying records 
from an LSM tree, all sorted runs must be combined to produce a complete view 
of all records.
@@ -166,15 +187,15 @@ layers to be in Avro format.
 
 ## File Compression
 
-By default, Paimon uses high-performance compression algorithms such as LZ4 
and SNAPPY. But their compression rate
-will be not so good. If you can reduce the write/read performance, you can 
modify the compression algorithm:
+By default, Paimon uses high-performance compression algorithms such as LZ4 
and SNAPPY, but their compression rates
+are not so good. If you want to reduce the write/read performance, you can 
modify the compression algorithm:
 
 1. `'file.compression'`: Default file compression format. If you need a higher 
compression rate, I recommend using `'ZSTD'`.
 2. `'file.compression.per.level'`: Define different compression policies for 
different level. For example `'0:lz4,1:zstd'`.
 
 ## Stability
 
-If there are too few buckets, or too few resources, full-compaction may cause 
checkpoint to timeout, Flink's default
+If there are too few buckets or resources, full-compaction may cause the 
checkpoint timeout, Flink's default
 checkpoint timeout is 10 minutes.
 
 If you expect stability even in this case, you can turn up the checkpoint 
timeout, for example:
@@ -195,10 +216,10 @@ There are three main places in Paimon writer that takes 
up memory:
 
 * Writer's memory buffer, shared and preempted by all writers of a single 
task. This memory value can be adjusted by the `write-buffer-size` table 
property.
 * Memory consumed when merging several sorted runs for compaction. Can be 
adjusted by the `num-sorted-run.compaction-trigger` option to change the number 
of sorted runs to be merged.
-* If the row is very large, reading too many lines of data at once can consume 
a lot of memory when making a compaction. Reducing the `read.batch-size` option 
can alleviate the impact of this case.
-* The memory consumed by writing columnar (ORC, Parquet, etc.) file. 
Decreasing the `orc.write.batch-size` option can reduce the consume of memory 
for ORC format.
+* If the row is very large, reading too many lines of data at once will 
consume a lot of memory when making a compaction. Reducing the 
`read.batch-size` option can alleviate the impact of this case.
+* The memory consumed by writing columnar (ORC, Parquet, etc.) file. 
Decreasing the `orc.write.batch-size` option can reduce the consumption of 
memory for ORC format.
 
-If your Flink job does not rely on state, please avoid using managed memory, 
which you can control with the following Flink parameters:
+If your Flink job does not rely on state, please avoid using managed memory, 
which you can control with the following Flink parameter:
 ```shell
 taskmanager.memory.managed.size=1m
 ```

[incubator-paimon] branch master updated: [doc] rephrase some maintenance docs (#1803)

Reply via email to