[
https://issues.apache.org/jira/browse/SPARK-57625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-57625:
-----------------------------------
Labels: pull-request-available (was: )
> Manage Auxiliary Table Lifecycle in DatasetManager
> --------------------------------------------------
>
> Key: SPARK-57625
> URL: https://issues.apache.org/jira/browse/SPARK-57625
> Project: Spark
> Issue Type: Sub-task
> Components: Declarative Pipelines
> Affects Versions: 4.3.0
> Reporter: Anish Mahto
> Priority: Major
> Labels: pull-request-available
>
> Today, the auxiliary table is managed (created, full-refreshed, etc.) during
> flow execution. For SCD1 this is functionally correct but is poor timing. The
> auxiliary table should be managed side-by-side with the target table that it
> is a companion to, and table-level validations for the AutoCDC auxiliary
> table (ex. key/scd type drifts) should happen well before flow execution.
> For SCD2 which is coming soon, the existing control flow is incompatible.
> SCD2 auxiliary tables will contain data columns (SCD1 auxiliary tables only
> contain keys columns + CDC metadata column), and therefore will actually need
> to undergo schema evolution the same way that target tables do.
> The proposal here is to refactor the auxiliary table management such that
> DatasetManager recognizes the general concept of an auxiliary table, and
> manage full-refresh/schema evolution/catalog table validations along side the
> target table it companions.
> We should no longer be materializing or validating tables during flow
> execution.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]