(paimon) 05/31: [doc] Reorg Append table pages

lzljs3620320 Thu, 30 May 2024 00:40:48 -0700

This is an automated email from the ASF dual-hosted git repository.

lzljs3620320 pushed a commit to branch release-0.8
in repository https://gitbox.apache.org/repos/asf/paimon.git


commit d83a06255d2ce8f9871ef3ac72457abfc50b5753
Author: Jingsong <[email protected]>
AuthorDate: Sat May 11 14:12:00 2024 +0800

    [doc] Reorg Append table pages
---
 .../{append-queue-table.md => append-queue.md}     |  48 ++++-----
 docs/content/append-table/append-scalable-table.md | 114 ---------------------
 docs/content/append-table/append-table.md          |  64 ++++++++++++
 docs/content/append-table/overview.md              |  36 -------
 docs/content/learn-paimon/understand-files.md      |   2 +-
 docs/content/maintenance/dedicated-compaction.md   |   2 +-
 docs/content/migration/migration-from-hive.md      |   2 +-
 docs/static/img/for-scalable.png                   | Bin 1331757 -> 0 bytes
 8 files changed, 89 insertions(+), 179 deletions(-)

diff --git a/docs/content/append-table/append-queue-table.md 
b/docs/content/append-table/append-queue.md
similarity index 92%
rename from docs/content/append-table/append-queue-table.md
rename to docs/content/append-table/append-queue.md
index 3d3e3f22e..07bd7d980 100644
--- a/docs/content/append-table/append-queue-table.md
+++ b/docs/content/append-table/append-queue.md
@@ -1,9 +1,9 @@
 ---
-title: "Append Queue Table"
+title: "Append Queue"
 weight: 3
 type: docs
 aliases:
-- /append-table/append-queue-table.html
+- /append-table/append-queue.html
 ---
 <!--
 Licensed to the Apache Software Foundation (ASF) under one
@@ -24,17 +24,35 @@ specific language governing permissions and limitations
 under the License.
 -->
 
-# Append Queue Table
+# Append Queue
 
 ## Definition
 
 In this mode, you can regard append table as a queue separated by bucket. 
Every record in the same bucket is ordered strictly,
 streaming read will transfer the record to down-stream exactly in the order of 
writing. To use this mode, you do not need
 to config special configurations, all the data will go into one bucket as a 
queue. You can also define the `bucket` and
-`bucket-key` to enable larger parallelism and disperse data (see [Example]({{< 
ref "#example" >}})).
+`bucket-key` to enable larger parallelism and disperse data.
 
 {{< img src="/img/for-queue.png">}}
 
+Example to create append queue table:
+
+{{< tabs "create-append-queue" >}}
+{{< tab "Flink" >}}
+
+```sql
+CREATE TABLE my_table (
+    product_id BIGINT,
+    price DOUBLE,
+    sales BIGINT
+) WITH (
+    'bucket' = '8',
+    'bucket-key' = 'product_id'
+);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
 ## Compaction
 
 By default, the sink node will automatically perform compaction to control the 
number of files. The following options
@@ -158,25 +176,3 @@ INSERT INTO paimon_table SELECT * FROM kakfa_table;
 -- launch a bounded streaming job to read paimon_table
 SELECT * FROM paimon_table /*+ OPTIONS('scan.bounded.watermark'='...') */;
 ```
-
-## Example
-
-The following is an example of creating the Append table and specifying the 
bucket key.
-
-{{< tabs "create-append-table" >}}
-
-{{< tab "Flink" >}}
-
-```sql
-CREATE TABLE my_table (
-                         product_id BIGINT,
-                         price DOUBLE,
-                         sales BIGINT
-) WITH (
-      'bucket' = '8',
-      'bucket-key' = 'product_id'
-      );
-```
-{{< /tab >}}
-
-{{< /tabs >}}
diff --git a/docs/content/append-table/append-scalable-table.md 
b/docs/content/append-table/append-scalable-table.md
deleted file mode 100644
index f2b80c202..000000000
--- a/docs/content/append-table/append-scalable-table.md
+++ /dev/null
@@ -1,114 +0,0 @@
----
-title: "Append Scalable Table"
-weight: 2
-type: docs
-aliases:
-- /append-table/append-scalable-table.html
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-# Append Scalable Table
-
-## Definition
-
-By defining `'bucket' = '-1'` in table properties, you can assign a special 
mode (we call it "unaware-bucket mode") to this
-table (see [Example]({{< ref "#example" >}})). In this mode, all the things 
are different. We don't have
-the concept of bucket anymore, and we don't guarantee the order of streaming 
read. We regard this table as a batch off-line table (
-although we can stream read and write still). All the records will go into one 
directory (for compatibility, we put them in bucket-0),
-and we do not maintain the order anymore. As we don't have the concept of 
bucket, we will not shuffle the input records by bucket anymore,
-which will speed up the inserting.
-
-Using this mode, you can replace your Hive table to lake table.
-
-{{< img src="/img/for-scalable.png">}}
-
-## Compaction
-
-In unaware-bucket mode, we don't do compaction in writer, instead, we use 
`Compact Coordinator` to scan the small files and submit compaction task
-to `Compact Worker`. By this, we can easily do compaction for one simple data 
directory in parallel. In streaming mode, if you run insert sql in flink,
-the topology will be like this:
-
-{{< img src="/img/unaware-bucket-topo.png">}}
-
-It will do its best to compact small files, but when a single small file in 
one partition remains long time
-and no new file added to the partition, the `Compact Coordinator` will remove 
it from memory to reduce memory usage.
-After you restart the job, it will scan the small files and add it to memory 
again. The options to control the compact
-behavior is exactly the same as [Append For Qeueue]({{< ref "#compaction" 
>}}). If you set `write-only` to true, the
-`Compact Coordinator` and `Compact Worker` will be removed in the topology.
-
-The auto compaction is only supported in Flink engine streaming mode. You can 
also start a compaction job in flink by flink action in paimon
-and disable all the other compaction by set `write-only`.
-
-## Sort Compact
-
-The data in a per-partition out of order will lead a slow select, compaction 
may slow down the inserting. It is a good choice for you to set
-`write-only` for inserting job, and after per-partition data done, trigger a 
partition `Sort Compact` action. See [Sort Compact]({{< ref 
"maintenance/dedicated-compaction#sort-compact" >}}).
-
-## Streaming Source
-
-Unaware-bucket mode append table supported streaming read and write, but no 
longer guarantee order anymore. You cannot regard it
-as a queue, instead, as a lake with storage bins. Every commit will generate a 
new record bin, we can read the
-increase by reading the new record bin, but records in one bin are flowing to 
anywhere they want, and we fetch them in any possible order.
-While in the `Append For Queue` mode, records are not stored in bins, but in 
record pipe. We can see the difference below.
-
-## Streaming Multiple Partitions Write
-
-Since the number of write tasks that Paimon-sink needs to handle is: the 
number of partitions to which the data is written * the number of buckets per 
partition.
-Therefore, we need to try to control the number of write tasks per paimon-sink 
task as much as possible,so that it is distributed within a reasonable range.
-If each sink-task handles too many write tasks, not only will it cause 
problems with too many small files, but it may also lead to out-of-memory 
errors.
-
-In addition, write failures introduce orphan files, which undoubtedly adds to 
the cost of maintaining paimon. We need to avoid this problem as much as 
possible.
-
-For flink-jobs with auto-merge enabled, we recommend trying to follow the 
following formula to adjust the parallelism of paimon-sink(This doesn't just 
apply to append-tables, it actually applies to most scenarios):
-```
-(N*B)/P < 100   (This value needs to be adjusted according to the actual 
situation)
-N(the number of partitions to which the data is written)
-B(bucket number)
-P(parallelism of paimon-sink)
-100 (This is an empirically derived threshold,For flink-jobs with auto-merge 
disabled, this value can be reduced.
-However, please note that you are only transferring part of the work to the 
user-compaction-job, you still have to deal with the problem in essence,
-the amount of work you have to deal with has not been reduced, and the 
user-compaction-job still needs to be adjusted according to the above formula.)
-```
-You can also set `write-buffer-spillable` to true, writer can spill the 
records to disk. This can reduce small
-files as much as possible.To use this option, you need to have a certain size 
of local disks for your flink cluster. This is especially important for those 
using flink on k8s.
-
-For append-table, You can set `write-buffer-for-append` option for append 
table. Setting this parameter to true, writer will cache
-the records use Segment Pool to avoid OOM.
-
-## Example
-
-The following is an example of creating the Append table and specifying the 
bucket key.
-
-{{< tabs "create-append-table-unaware-bucket" >}}
-
-{{< tab "Flink" >}}
-
-```sql
-CREATE TABLE my_table (
-    product_id BIGINT,
-    price DOUBLE,
-    sales BIGINT
-) WITH (
-    'bucket' = '-1'
-);
-```
-{{< /tab >}}
-
-{{< /tabs >}}
diff --git a/docs/content/append-table/append-table.md 
b/docs/content/append-table/append-table.md
new file mode 100644
index 000000000..239a19646
--- /dev/null
+++ b/docs/content/append-table/append-table.md
@@ -0,0 +1,64 @@
+---
+title: "Append Table"
+weight: 1
+type: docs
+aliases:
+- /append-table/append-table.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Append Table
+
+If a table does not have a primary key defined, it is an append table by 
default.
+
+You can only insert a complete record into the table in streaming. This type 
of table is suitable for use cases that
+do not require streaming updates (such as log data synchronization).
+
+{{< tabs "create-append-table" >}}
+{{< tab "Flink" >}}
+```sql
+CREATE TABLE my_table (
+    product_id BIGINT,
+    price DOUBLE,
+    sales BIGINT
+);
+```
+{{< /tab >}}
+{{< /tabs >}}
+
+## Data Distribution
+
+By default, append table has no bucket concept. It acts just like a Hive 
Table. The data files are placed under
+partitions where they can be reorganized and reordered to speed up queries.
+
+## Automatic small file merging
+
+In streaming writing job, without bucket definition, there is no compaction in 
writer, instead, will use
+`Compact Coordinator` to scan the small files and pass compaction task to 
`Compact Worker`. In streaming mode, if you
+run insert sql in flink, the topology will be like this:
+
+{{< img src="/img/unaware-bucket-topo.png">}}
+
+Do not worry about backpressure, compaction never backpressure.
+
+If you set `write-only` to true, the `Compact Coordinator` and `Compact 
Worker` will be removed in the topology.
+
+The auto compaction is only supported in Flink engine streaming mode. You can 
also start a compaction job in flink by
+flink action in paimon and disable all the other compaction by set 
`write-only`.
\ No newline at end of file
diff --git a/docs/content/append-table/overview.md 
b/docs/content/append-table/overview.md
deleted file mode 100644
index 5c0f4c560..000000000
--- a/docs/content/append-table/overview.md
+++ /dev/null
@@ -1,36 +0,0 @@
----
-title: "Overview"
-weight: 1
-type: docs
-aliases:
-- /append-table/overview.html
----
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-  http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-
-# Overview
-
-If a table does not have a primary key defined, it is an append table by 
default.
-
-You can only insert a complete record into the table in streaming. This type 
of table is suitable for use cases that
-do not require streaming updates (such as log data synchronization).
-
-{{< hint info >}}
-We highly recommend using [Append Scalable Table]({{< ref 
"append-table/append-scalable-table" >}}). (bucket = -1).
-{{< /hint >}}
\ No newline at end of file
diff --git a/docs/content/learn-paimon/understand-files.md 
b/docs/content/learn-paimon/understand-files.md
index adbc614ba..00129c2c0 100644
--- a/docs/content/learn-paimon/understand-files.md
+++ b/docs/content/learn-paimon/understand-files.md
@@ -455,7 +455,7 @@ this means that there are at least 5 files in a bucket. If 
you want to reduce th
 By default, Append also does automatic compaction to reduce the number of 
small files.
 
 However, for Bucket's Append table, it will only compact the files within the 
Bucket for sequential
-purposes, which may keep more small files. See [Append Queue Table]({{< ref 
"append-table/append-queue-table" >}}).
+purposes, which may keep more small files. See [Append Queue]({{< ref 
"append-table/append-queue" >}}).
 
 ### Understand Full-Compaction
 
diff --git a/docs/content/maintenance/dedicated-compaction.md 
b/docs/content/maintenance/dedicated-compaction.md
index 41f0e6c91..f000849bf 100644
--- a/docs/content/maintenance/dedicated-compaction.md
+++ b/docs/content/maintenance/dedicated-compaction.md
@@ -229,7 +229,7 @@ For more usage of the compact_database action, see
 ## Sort Compact
 
 If your table is configured with [dynamic bucket primary key table]({{< ref 
"primary-key-table/data-distribution#dynamic-bucket" >}})
-or [unaware bucket append table]({{< ref "append-table/append-scalable-table" 
>}}) ,
+or [append table]({{< ref "append-table/append-table" >}}) ,
 you can trigger a compact with specified column sort to speed up queries.
 
 ```bash  
diff --git a/docs/content/migration/migration-from-hive.md 
b/docs/content/migration/migration-from-hive.md
index 4f0383625..dd1132444 100644
--- a/docs/content/migration/migration-from-hive.md
+++ b/docs/content/migration/migration-from-hive.md
@@ -28,7 +28,7 @@ under the License.
 
 Apache Hive supports ORC, Parquet file formats that could be migrated to 
Paimon. 
 When migrating data to a paimon table, the origin table will be permanently 
disappeared. So please back up your data if you
-still need the original table. The migrated table will be [unaware-bucket 
append table]({{< ref "append-table/append-scalable-table" >}}).
+still need the original table. The migrated table will be [append table]({{< 
ref "append-table/append-table" >}}).
 
 Now, we can use paimon hive catalog with Migrate Table Procedure and Migrate 
File Procedure to totally migrate a table from hive to paimon.
 At the same time, you can use paimon hive catalog with Migrate Database 
Procedure to fully synchronize all tables in the database to paimon.
diff --git a/docs/static/img/for-scalable.png b/docs/static/img/for-scalable.png
deleted file mode 100644
index ea5a015c2..000000000
Binary files a/docs/static/img/for-scalable.png and /dev/null differ

(paimon) 05/31: [doc] Reorg Append table pages

Reply via email to