[iceberg] branch master updated: Spark 3.4: Add the net_changes option in the changelog procedure doc (#8364)

yufei Mon, 28 Aug 2023 11:38:36 -0700

This is an automated email from the ASF dual-hosted git repository.

yufei pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/iceberg.git



The following commit(s) were added to refs/heads/master by this push:
     new e5ad5ce4c0 Spark 3.4: Add the net_changes option in the changelog 
procedure doc (#8364)
e5ad5ce4c0 is described below

commit e5ad5ce4c09d1efcd26e3cf0bd27bcf8c06c2400
Author: Yufei Gu <[email protected]>
AuthorDate: Mon Aug 28 11:36:29 2023 -0700

    Spark 3.4: Add the net_changes option in the changelog procedure doc (#8364)
---
 docs/spark-procedures.md | 39 +++++++++++++++++++++++++++++----------
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/docs/spark-procedures.md b/docs/spark-procedures.md
index 9bd4ec8c94..5303982b27 100644
--- a/docs/spark-procedures.md
+++ b/docs/spark-procedures.md
@@ -722,14 +722,15 @@ Creates a view that contains the changes from a given 
table.
 
 #### Usage
 
-| Argument Name | Required? | Type | Description                               
                                                                                
                                                                                
            |
-|---------------|----------|------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `table`       | ✔️ | string | Name of the source table for the changelog     
                                                                                
                                                                                
       |
-| `changelog_view`        |   | string | Name of the view to create            
                                                                                
                                                                                
                |
-| `options`     |   | map<string, string> | A map of Spark read options to use 
                                                                                
                                                                                
                   |
-|`compute_updates`| | boolean | Whether to compute pre/post update images (see 
below for more information). Defaults to false.                                 
                                                                                
       | 
-|`identifier_columns`| | array<string> | The list of identifier columns to 
compute updates. If the argument `compute_updates` is set to true and 
`identifier_columns` are not provided, the table’s current identifier fields 
will be used to compute updates. |
-|`remove_carryovers`| | boolean | Whether to remove carry-over rows (see below 
for more information). Defaults to true.                                        
                                                                                
         |
+| Argument Name        | Required? | Type                | Description         
                                                                                
                                                                                
                 |
+|----------------------|-----------|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `table`              | ✔️         | string              | Name of the source 
table for the changelog                                                         
                                                                                
                  |
+| `changelog_view`     |           | string              | Name of the view to 
create                                                                          
                                                                                
                 |
+| `options`            |           | map<string, string> | A map of Spark read 
options to use                                                                  
                                                                                
                 |
+| `net_changes`        |           | boolean             | Whether to output 
net changes (see below for more information). Defaults to false.                
                                                                                
                   |
+| `compute_updates`    |           | boolean             | Whether to compute 
pre/post update images (see below for more information). Defaults to false.     
                                                                                
                  | 
+| `identifier_columns` |           | array<string>       | The list of 
identifier columns to compute updates. If the argument `compute_updates` is set 
to true and `identifier_columns` are not provided, the table’s current 
identifier fields will be used.   |
+| `remove_carryovers`  |           | boolean             | Whether to remove 
carry-over rows (see below for more information). Defaults to true. Deprecated 
since 1.4.0, will be removed in 1.5.0;  Please query `SparkChangelogTable` to 
view carry-over rows. |
 
 Here is a list of commonly used Spark read options:
 * `start-snapshot-id`: the exclusive start snapshot ID. If not provided, it 
reads from the table’s first snapshot inclusively. 
@@ -792,6 +793,22 @@ second snapshot deleted 1 record.
 |2     | Bob      |INSERT      |0      |5390529835796506035|
 |1     | Alice  |DELETE        |1      |8764748981452218370|
 
+Create a changelog view that computes net changes. It removes intermediate 
changes and only outputs the net changes. 
+```sql
+CALL spark_catalog.system.create_changelog_view(
+  table => 'db.tbl',
+  options => map('end-snapshot-id', '87647489814522183702'),
+  net_changes => true
+)
+```
+
+With the net changes, the above changelog view only contains the following row 
since Alice was inserted in the first snapshot and deleted in the second 
snapshot.
+
+|  id  | name    |_change_type |       _change_ordinal | _change_snapshot_id |
+|---|--------|---|---|---|
+|2     | Bob      |INSERT      |0      |5390529835796506035|
+
+
 #### Carry-over Rows
 
 The procedure removes the carry-over rows by default. Carry-over rows are the 
result of row-level operations(`MERGE`, `UPDATE` and `DELETE`)
@@ -804,8 +821,10 @@ reports this as the following pair of rows, despite it not 
being an actual chang
 | 1   | Alice | DELETE       |
 | 1   | Alice | INSERT       |
 
-By default, this view finds the carry-over rows and removes them from the 
result. User can disable this
-behavior by setting the `remove_carryovers` option to `false`.
+To see carry-over rows, query `SparkChangelogTable` as follows:
+```sql
+SELECT * FROM spark_catalog.db.tbl.changes
+```
 
 #### Pre/Post Update Images

[iceberg] branch master updated: Spark 3.4: Add the net_changes option in the changelog procedure doc (#8364)

Reply via email to