This is an automated email from the ASF dual-hosted git repository.
lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-paimon.git
The following commit(s) were added to refs/heads/master by this push:
new d2978ddb6 [doc] Move Watermark and Bounded Stream to Append only table
page
d2978ddb6 is described below
commit d2978ddb6720186d27110f7ff14d06f3ed688b7c
Author: JingsongLi <[email protected]>
AuthorDate: Wed Apr 5 14:54:23 2023 +0800
[doc] Move Watermark and Bounded Stream to Append only table page
---
docs/content/concepts/append-only-table.md | 100 +++++++++++++++++++++++++----
docs/content/how-to/querying-tables.md | 72 ---------------------
2 files changed, 89 insertions(+), 83 deletions(-)
diff --git a/docs/content/concepts/append-only-table.md
b/docs/content/concepts/append-only-table.md
index 502593ce8..e20b9252c 100644
--- a/docs/content/concepts/append-only-table.md
+++ b/docs/content/concepts/append-only-table.md
@@ -38,16 +38,6 @@ You can also define bucket number for Append-only table, see
[Bucket]({{< ref "c
It is recommended that you set the `bucket-key` field. Otherwise, the data
will be hashed according to the whole row,
and the performance will be poor.
-## Streaming Read Order
-
-For streaming reads, records are produced in the following order:
-
-* For any two records from two different partitions
- * If `scan.plan-sort-partition` is set to true, the record with a smaller
partition value will be produced first.
- * Otherwise, the record with an earlier partition creation time will be
produced first.
-* For any two records from the same partition and the same bucket, the first
written record will be produced first.
-* For any two records from the same partition but two different buckets,
different buckets are processed by different tasks, there is no order guarantee
between them.
-
## Compaction
By default, the sink node will automatically perform compaction to control the
number of files. The following options
@@ -76,14 +66,102 @@ control the strategy of compaction:
<td>For file set [f_0,...,f_N], the minimum file number which
satisfies sum(size(f_i)) >= targetFileSize to trigger a compaction for
append-only table. This value avoids almost-full-file to be compacted, which is
not cost-effective.</td>
</tr>
<tr>
- <td><h5>compaction.early-max.file-num</h5></td>
+ <td><h5>compaction.max.file-num</h5></td>
<td style="word-wrap: break-word;">50</td>
<td>Integer</td>
<td>For file set [f_0,...,f_N], the maximum file number to trigger
a compaction for append-only table, even if sum(size(f_i)) < targetFileSize.
This value avoids pending too much small files, which slows down the
performance.</td>
</tr>
+ <tr>
+ <td><h5>full-compaction.delta-commits</h5></td>
+ <td style="word-wrap: break-word;">(none)</td>
+ <td>Integer</td>
+ <td>Full compaction will be constantly triggered after delta
commits.</td>
+ </tr>
+ </tbody>
+</table>
+
+## Streaming Source
+
+Streaming source behavior is only supported in Flink engine at present.
+
+### Streaming Read Order
+
+For streaming reads, records are produced in the following order:
+
+* For any two records from two different partitions
+ * If `scan.plan-sort-partition` is set to true, the record with a smaller
partition value will be produced first.
+ * Otherwise, the record with an earlier partition creation time will be
produced first.
+* For any two records from the same partition and the same bucket, the first
written record will be produced first.
+* For any two records from the same partition but two different buckets,
different buckets are processed by different tasks, there is no order guarantee
between them.
+
+### Watermark Definition
+
+You can define watermark for reading Paimon tables:
+
+```sql
+CREATE TABLE T (
+ `user` BIGINT,
+ product STRING,
+ order_time TIMESTAMP(3),
+ WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
+) WITH (...);
+
+-- launch a bounded streaming job to read paimon_table
+SELECT window_start, window_end, SUM(f0) FROM
+ TUMBLE(TABLE T, DESCRIPTOR(order_time), INTERVAL '10' MINUTES)) GROUP BY
window_start, window_end; */;
+```
+
+You can also enable [Flink Watermark
alignment](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment-_beta_),
+which will make sure no sources/splits/shards/partitions increase their
watermarks too far ahead of the rest:
+
+<table class="configuration table table-bordered">
+ <thead>
+ <tr>
+ <th class="text-left" style="width: 20%">Key</th>
+ <th class="text-left" style="width: 15%">Default</th>
+ <th class="text-left" style="width: 10%">Type</th>
+ <th class="text-left" style="width: 55%">Description</th>
+ </tr>
+ </thead>
+ <tbody>
+ <tr>
+ <td><h5>scan.watermark.alignment.group</h5></td>
+ <td style="word-wrap: break-word;">(none)</td>
+ <td>String</td>
+ <td>A group of sources to align watermarks.</td>
+ </tr>
+ <tr>
+ <td><h5>scan.watermark.alignment.max-drift</h5></td>
+ <td style="word-wrap: break-word;">(none)</td>
+ <td>Duration</td>
+ <td>Maximal drift to align watermarks, before we pause consuming
from the source/task/partition.</td>
+ </tr>
</tbody>
</table>
+### Bounded Stream
+
+Streaming Source can also be bounded, you can specify 'scan.bounded.watermark'
to define the end condition for bounded streaming mode, stream reading will end
until a larger watermark snapshot is encountered.
+
+Watermark in snapshot is generated by writer, for example, you can specify a
kafka source and declare the definition of watermark.
+When using this kafka source to write to Paimon table, the snapshots of Paimon
table will generate the corresponding watermark,
+so that you can use the feature of bounded watermark when streaming reads of
this Paimon table.
+
+```sql
+CREATE TABLE kafka_table (
+ `user` BIGINT,
+ product STRING,
+ order_time TIMESTAMP(3),
+ WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
+) WITH ('connector' = 'kafka'...);
+
+-- launch a streaming insert job
+INSERT INTO paimon_table SELECT * FROM kakfa_table;
+
+-- launch a bounded streaming job to read paimon_table
+SELECT * FROM paimon_table /*+ OPTIONS('scan.bounded.watermark'='...') */;
+```
+
## Example
The following is an example of creating the Append-Only table and specifying
the bucket key.
diff --git a/docs/content/how-to/querying-tables.md
b/docs/content/how-to/querying-tables.md
index 55863e705..cf8aad5e6 100644
--- a/docs/content/how-to/querying-tables.md
+++ b/docs/content/how-to/querying-tables.md
@@ -92,78 +92,6 @@ Users can also adjust `changelog-producer` table property to
specify the pattern
{{< img src="/img/scan-mode.png">}}
-## Streaming Source
-
-Streaming source behavior is only supported in Flink engine at present.
-
-### Watermark Definition
-
-You can define watermark for reading Paimon tables:
-
-```sql
-CREATE TABLE T (
- `user` BIGINT,
- product STRING,
- order_time TIMESTAMP(3),
- WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
-);
-
--- launch a bounded streaming job to read paimon_table
-SELECT window_start, window_end, SUM(f0) FROM
- TUMBLE(TABLE T, DESCRIPTOR(order_time), INTERVAL '10' MINUTES)) GROUP BY
window_start, window_end; */;
-```
-
-You can also enable [Flink Watermark
alignment](https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/event-time/generating_watermarks/#watermark-alignment-_beta_),
-which will make sure no sources/splits/shards/partitions increase their
watermarks too far ahead of the rest:
-
-<table class="configuration table table-bordered">
- <thead>
- <tr>
- <th class="text-left" style="width: 20%">Key</th>
- <th class="text-left" style="width: 15%">Default</th>
- <th class="text-left" style="width: 10%">Type</th>
- <th class="text-left" style="width: 55%">Description</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td><h5>scan.watermark.alignment.group</h5></td>
- <td style="word-wrap: break-word;">(none)</td>
- <td>String</td>
- <td>A group of sources to align watermarks.</td>
- </tr>
- <tr>
- <td><h5>scan.watermark.alignment.max-drift</h5></td>
- <td style="word-wrap: break-word;">(none)</td>
- <td>Duration</td>
- <td>Maximal drift to align watermarks, before we pause consuming
from the source/task/partition.</td>
- </tr>
- </tbody>
-</table>
-
-### Bounded Stream
-
-Streaming Source can also be bounded, you can specify 'scan.bounded.watermark'
to define the end condition for bounded streaming mode, stream reading will end
until a larger watermark snapshot is encountered.
-
-Watermark in snapshot is generated by writer, for example, you can specify a
kafka source and declare the definition of watermark.
-When using this kafka source to write to Paimon table, the snapshots of Paimon
table will generate the corresponding watermark,
-so that you can use the feature of bounded watermark when streaming reads of
this Paimon table.
-
-```sql
-CREATE TABLE kafka_table (
- `user` BIGINT,
- product STRING,
- order_time TIMESTAMP(3),
- WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND
-) WITH ('connector' = 'kafka'...);
-
--- launch a streaming insert job
-INSERT INTO paimon_table SELECT * FROM kakfa_table;
-
--- launch a bounded streaming job to read paimon_table
-SELECT * FROM paimon_table /*+ OPTIONS('scan.bounded.watermark'='...') */;
-```
-
## Time Travel
Currently, Paimon supports time travel for Flink and Spark 3 (requires Spark
3.3+).