Anish Mahto created SPARK-57625:
-----------------------------------
Summary: Manage Auxiliary Table Lifecycle in DatasetManager
Key: SPARK-57625
URL: https://issues.apache.org/jira/browse/SPARK-57625
Project: Spark
Issue Type: Sub-task
Components: Declarative Pipelines
Affects Versions: 4.3.0
Reporter: Anish Mahto
Today, the auxiliary table is managed (created, full-refreshed, etc.) during
flow execution. For SCD1 this is functionally correct but is poor timing. The
auxiliary table should be managed side-by-side with the target table that it is
a companion to, and table-level validations for the AutoCDC auxiliary table
(ex. key/scd type drifts) should happen well before flow execution.
For SCD2 which is coming soon, the existing control flow is incompatible. SCD2
auxiliary tables will contain data columns (SCD1 auxiliary tables only contain
keys columns + CDC metadata column), and therefore will actually need to
undergo schema evolution the same way that target tables do.
The proposal here is to refactor the auxiliary table management such that
DatasetManager recognizes the general concept of an auxiliary table, and manage
full-refresh/schema evolution/catalog table validations along side the target
table it companions.
We should no longer be materializing or validating tables during flow execution.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]