This is an automated email from the ASF dual-hosted git repository.
yufei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/iceberg.git
The following commit(s) were added to refs/heads/master by this push:
new e5ad5ce4c0 Spark 3.4: Add the net_changes option in the changelog
procedure doc (#8364)
e5ad5ce4c0 is described below
commit e5ad5ce4c09d1efcd26e3cf0bd27bcf8c06c2400
Author: Yufei Gu <[email protected]>
AuthorDate: Mon Aug 28 11:36:29 2023 -0700
Spark 3.4: Add the net_changes option in the changelog procedure doc (#8364)
---
docs/spark-procedures.md | 39 +++++++++++++++++++++++++++++----------
1 file changed, 29 insertions(+), 10 deletions(-)
diff --git a/docs/spark-procedures.md b/docs/spark-procedures.md
index 9bd4ec8c94..5303982b27 100644
--- a/docs/spark-procedures.md
+++ b/docs/spark-procedures.md
@@ -722,14 +722,15 @@ Creates a view that contains the changes from a given
table.
#### Usage
-| Argument Name | Required? | Type | Description
|
-|---------------|----------|------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `table` | ✔️ | string | Name of the source table for the changelog
|
-| `changelog_view` | | string | Name of the view to create
|
-| `options` | | map<string, string> | A map of Spark read options to use
|
-|`compute_updates`| | boolean | Whether to compute pre/post update images (see
below for more information). Defaults to false.
|
-|`identifier_columns`| | array<string> | The list of identifier columns to
compute updates. If the argument `compute_updates` is set to true and
`identifier_columns` are not provided, the table’s current identifier fields
will be used to compute updates. |
-|`remove_carryovers`| | boolean | Whether to remove carry-over rows (see below
for more information). Defaults to true.
|
+| Argument Name | Required? | Type | Description
|
+|----------------------|-----------|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `table` | ✔️ | string | Name of the source
table for the changelog
|
+| `changelog_view` | | string | Name of the view to
create
|
+| `options` | | map<string, string> | A map of Spark read
options to use
|
+| `net_changes` | | boolean | Whether to output
net changes (see below for more information). Defaults to false.
|
+| `compute_updates` | | boolean | Whether to compute
pre/post update images (see below for more information). Defaults to false.
|
+| `identifier_columns` | | array<string> | The list of
identifier columns to compute updates. If the argument `compute_updates` is set
to true and `identifier_columns` are not provided, the table’s current
identifier fields will be used. |
+| `remove_carryovers` | | boolean | Whether to remove
carry-over rows (see below for more information). Defaults to true. Deprecated
since 1.4.0, will be removed in 1.5.0; Please query `SparkChangelogTable` to
view carry-over rows. |
Here is a list of commonly used Spark read options:
* `start-snapshot-id`: the exclusive start snapshot ID. If not provided, it
reads from the table’s first snapshot inclusively.
@@ -792,6 +793,22 @@ second snapshot deleted 1 record.
|2 | Bob |INSERT |0 |5390529835796506035|
|1 | Alice |DELETE |1 |8764748981452218370|
+Create a changelog view that computes net changes. It removes intermediate
changes and only outputs the net changes.
+```sql
+CALL spark_catalog.system.create_changelog_view(
+ table => 'db.tbl',
+ options => map('end-snapshot-id', '87647489814522183702'),
+ net_changes => true
+)
+```
+
+With the net changes, the above changelog view only contains the following row
since Alice was inserted in the first snapshot and deleted in the second
snapshot.
+
+| id | name |_change_type | _change_ordinal | _change_snapshot_id |
+|---|--------|---|---|---|
+|2 | Bob |INSERT |0 |5390529835796506035|
+
+
#### Carry-over Rows
The procedure removes the carry-over rows by default. Carry-over rows are the
result of row-level operations(`MERGE`, `UPDATE` and `DELETE`)
@@ -804,8 +821,10 @@ reports this as the following pair of rows, despite it not
being an actual chang
| 1 | Alice | DELETE |
| 1 | Alice | INSERT |
-By default, this view finds the carry-over rows and removes them from the
result. User can disable this
-behavior by setting the `remove_carryovers` option to `false`.
+To see carry-over rows, query `SparkChangelogTable` as follows:
+```sql
+SELECT * FROM spark_catalog.db.tbl.changes
+```
#### Pre/Post Update Images