[I] Spark: Add scd_type2 option to create_changelog_view procedure [iceberg]

via GitHub Wed, 11 Mar 2026 22:52:56 -0700


moomindani opened a new issue, #15593:
URL: https://github.com/apache/iceberg/issues/15593


   ## Motivation
   
   The `create_changelog_view` procedure lets users query changelog rows from 
an Iceberg table, but producing a Slowly Changing Dimensions Type-2 (SCD 
Type-2) history view still requires users to write non-trivial SQL manually: 
joining with snapshot metadata, applying LEAD window functions, and filtering 
intermediate rows. This is a recurring pattern across many analytical workloads.
   
   This issue proposes adding a `scd_type2` parameter to 
`create_changelog_view` so Iceberg can produce an SCD Type-2 view natively.
   
   ## Proposed API
   
   ```sql
   CALL catalog.system.create_changelog_view(
       table              => 'db.products',
       changelog_view     => 'products_history',
       identifier_columns => array('product_id'),
       scd_type2          => true
   )
   ```
   
   ## New columns added to the view
   
   | Column | Type | Nullable | Description |
   |---|---|---|---|
   | `_valid_from` | TIMESTAMP | no | Commit timestamp when this version became 
active |
   | `_valid_to` | TIMESTAMP | yes | Commit timestamp when superseded; `NULL` 
if still active |
   | `_is_current` | BOOLEAN | no | `true` when `_valid_to IS NULL AND 
_change_type != 'DELETE'` |
   
   ## Design choices
   
   - **`NULL` for open-ended rows** — aligns with dbt 
(`dbt_valid_from`/`dbt_valid_to`) and Databricks DLT (`__START_AT`/`__END_AT`)
   - **`_valid_from`/`_valid_to` naming** — closest to dbt and Kimball 
semantics; prefixed with `_` to match Iceberg metadata column conventions
   - **`_is_current` retained** — DELETE rows have `_valid_to IS NULL` but are 
NOT current; the convenience flag avoids the two-condition footgun
   - **DELETE rows kept** — marks hard deletes with `_is_current = false`, 
aligns with dbt `hard_deletes='new_record'`
   - **`UPDATE_BEFORE` rows filtered** — they are intermediate computation 
artifacts
   
   ## Constraints
   
   - `scd_type2 = true` requires `identifier_columns` (explicit or from table 
schema identifier fields)
   - `scd_type2 = true` and `net_changes = true` are mutually exclusive
   - `scd_type2 = true` implicitly forces `compute_updates = true`
   
   ## Query engine
   
   Spark
   
   ## Willingness to contribute
   
   - [x] I can contribute this improvement/feature independently


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Spark: Add scd_type2 option to create_changelog_view procedure [iceberg]

Reply via email to