[incubator-paimon] branch master updated: [docs] docs for unaware bucket (#1510)

lzljs3620320 Fri, 21 Jul 2023 04:08:07 -0700

This is an automated email from the ASF dual-hosted git repository.

lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-paimon.git



The following commit(s) were added to refs/heads/master by this push:
     new 67851ca22 [docs] docs for unaware bucket (#1510)
67851ca22 is described below

commit 67851ca222d0f8cf3473c90b0c2f1316426dceb8
Author: YeJunHao <[email protected]>
AuthorDate: Fri Jul 21 19:07:44 2023 +0800

    [docs] docs for unaware bucket (#1510)
---
 docs/content/concepts/append-only-table.md |  91 +++++++++++++++++++++++++----
 docs/static/img/for-queue.png              | Bin 0 -> 339649 bytes
 docs/static/img/for-scalable.png           | Bin 0 -> 1331757 bytes
 docs/static/img/unaware-bucket-topo.png    | Bin 0 -> 41276 bytes
 4 files changed, 79 insertions(+), 12 deletions(-)

diff --git a/docs/content/concepts/append-only-table.md 
b/docs/content/concepts/append-only-table.md
index fd6ea6d82..cbede7344 100644
--- a/docs/content/concepts/append-only-table.md
+++ b/docs/content/concepts/append-only-table.md
@@ -26,19 +26,25 @@ under the License.
 
 # Append Only Table
 
-If a table does not have a primary key defined, it is an append-only table by 
default.
+If a table does not have a primary key defined, it is an append-only table by 
default. Separated by the definition of bucket,
+we have two different append-only mode: "Append For Queue" and "Append For 
Scalable Table". 
 
-You can only insert a complete record into the table. No delete or update is 
supported and you cannot define primary keys.
+## Append For Queue
+
+You can only insert a complete record into the table. No delete or update is 
supported, and you cannot define primary keys.
 This type of table is suitable for use cases that do not require updates (such 
as log data synchronization).
 
-## Bucketing
+### Definition
+
+In this mode, you can regard append-only table as a queue separated by bucket. 
Every record in the same bucket is ordered strictly,
+streaming read will transfer the record to down-stream exactly in the order of 
writing. To use this mode, you do not need 
+to config special configurations, all the data will go into one bucket as a 
queue. You can also define the `bucket` and
+`bucket-key` to enable larger parallelism and disperse data (see [Example]({{< 
ref "#example" >}})).
 
-You can also define bucket number for Append-only table, see [Bucket]({{< ref 
"concepts/basic-concepts#bucket" >}}).
+{{< img src="/img/for-queue.png">}}
 
-It is recommended that you set the `bucket-key` field. Otherwise, the data 
will be hashed according to the whole row,
-and the performance will be poor.
 
-## Compaction
+### Compaction
 
 By default, the sink node will automatically perform compaction to control the 
number of files. The following options
 control the strategy of compaction:
@@ -80,11 +86,11 @@ control the strategy of compaction:
     </tbody>
 </table>
 
-## Streaming Source
+### Streaming Source
 
 Streaming source behavior is only supported in Flink engine at present.
 
-### Streaming Read Order
+#### Streaming Read Order
 
 For streaming reads, records are produced in the following order:
 
@@ -94,7 +100,7 @@ For streaming reads, records are produced in the following 
order:
 * For any two records from the same partition and the same bucket, the first 
written record will be produced first.
 * For any two records from the same partition but two different buckets, 
different buckets are processed by different tasks, there is no order guarantee 
between them.
 
-### Watermark Definition
+#### Watermark Definition
 
 You can define watermark for reading Paimon tables:
 
@@ -139,7 +145,7 @@ which will make sure no sources/splits/shards/partitions 
increase their watermar
     </tbody>
 </table>
 
-### Bounded Stream
+#### Bounded Stream
 
 Streaming Source can also be bounded, you can specify 'scan.bounded.watermark' 
to define the end condition for bounded streaming mode, stream reading will end 
until a larger watermark snapshot is encountered.
 
@@ -162,7 +168,7 @@ INSERT INTO paimon_table SELECT * FROM kakfa_table;
 SELECT * FROM paimon_table /*+ OPTIONS('scan.bounded.watermark'='...') */;
 ```
 
-## Example
+### Example
 
 The following is an example of creating the Append-Only table and specifying 
the bucket key.
 
@@ -180,7 +186,68 @@ CREATE TABLE MyTable (
     'bucket-key' = 'product_id'
 );
 ```
+{{< /tab >}}
+
+{{< /tabs >}}
+
+
+
+## Append For Scalable Table
+
+### Definition
+
+By defining `'bucket' = '-1'` in table properties, you can assign a special 
mode (we call it "unaware-bucket mode") to this 
+table (see [Example]({{< ref "#example-1" >}})). In this mode, all the things 
are different. We don't have
+the concept of bucket anymore, and we don't guarantee the order of streaming 
read. We regard this table as a batch off-line table (
+although we can stream read and write still). All the records will go into one 
directory (for compatibility, we put them in bucket-0),
+and we do not maintain the order anymore. As we don't have the concept of 
bucket, we will not shuffle the input records by bucket anymore,
+which will speed up the inserting.
+
+{{< img src="/img/for-scalable.png">}}
+
+
+### Compaction
+
+In unaware-bucket mode, we don't do compaction in writer, instead, we use 
`Compact Coordinator` to scan the small files and submit compaction task
+to `Compact Worker`. By this, we can easily do compaction for one simple data 
directory in parallel. In streaming mode, if you run insert sql in flink,
+the topology will be like this:
+
+{{< img src="/img/unaware-bucket-topo.png">}}
+
+It will do its best to compact small files, but when a single small file in 
one partition remains long time 
+and no new file added to the partition, the `Compact Coordinator` will remove 
it from memory to reduce memory usage. 
+After you restart the job, it will scan the small files and add it to memory 
again. The options to control the compact
+behavior is exactly the same as [Append For Qeueue]({{< ref "#compaction" 
>}}). If you set `write-only` to true, the 
+`Compact Coordinator` and `Compact Worker` will be removed in the topology.
+
+The auto compaction is only supported in Flink engine streaming mode. You can 
also start a compaction job in flink by flink action in paimon
+and disable all the other compaction by set `write-only`.
+
+### Streaming Source
 
+Unaware-bucket mode append-only table supported streaming read and write, but 
no longer guarantee order anymore. You cannot regard it 
+as a queue, instead, as a lake with storage bins. Every commit will generate a 
new record bin, we can read the 
+increase by reading the new record bin, but records in one bin are flowing to 
anywhere they want, and we fetch them in any possible order.
+While in the `Append For Queue` mode, records are not stored in bins, but in 
record pipe. We can see the difference below.
+
+
+### Example
+
+The following is an example of creating the Append-Only table and specifying 
the bucket key.
+
+{{< tabs "create-append-only-table-unaware-bucket" >}}
+
+{{< tab "Flink" >}}
+
+```sql
+CREATE TABLE MyTable (
+    product_id BIGINT,
+    price DOUBLE,
+    sales BIGINT
+) WITH (
+    'bucket' = '-1'
+);
+```
 {{< /tab >}}
 
 {{< /tabs >}}
\ No newline at end of file
diff --git a/docs/static/img/for-queue.png b/docs/static/img/for-queue.png
new file mode 100644
index 000000000..5e453b7c1
Binary files /dev/null and b/docs/static/img/for-queue.png differ
diff --git a/docs/static/img/for-scalable.png b/docs/static/img/for-scalable.png
new file mode 100644
index 000000000..ea5a015c2
Binary files /dev/null and b/docs/static/img/for-scalable.png differ
diff --git a/docs/static/img/unaware-bucket-topo.png 
b/docs/static/img/unaware-bucket-topo.png
new file mode 100644
index 000000000..73bc86205
Binary files /dev/null and b/docs/static/img/unaware-bucket-topo.png differ

[incubator-paimon] branch master updated: [docs] docs for unaware bucket (#1510)

Reply via email to