Anish Mahto created SPARK-57625:
-----------------------------------

             Summary: Manage Auxiliary Table Lifecycle in DatasetManager
                 Key: SPARK-57625
                 URL: https://issues.apache.org/jira/browse/SPARK-57625
             Project: Spark
          Issue Type: Sub-task
          Components: Declarative Pipelines
    Affects Versions: 4.3.0
            Reporter: Anish Mahto


Today, the auxiliary table is managed (created, full-refreshed, etc.) during 
flow execution. For SCD1 this is functionally correct but is poor timing. The 
auxiliary table should be managed side-by-side with the target table that it is 
a companion to, and table-level validations for the AutoCDC auxiliary table 
(ex. key/scd type drifts) should happen well before flow execution.

For SCD2 which is coming soon, the existing control flow is incompatible. SCD2 
auxiliary tables will contain data columns (SCD1 auxiliary tables only contain 
keys columns + CDC metadata column), and therefore will actually need to 
undergo schema evolution the same way that target tables do.

The proposal here is to refactor the auxiliary table management such that 
DatasetManager recognizes the general concept of an auxiliary table, and manage 
full-refresh/schema evolution/catalog table validations along side the target 
table it companions. 

We should no longer be materializing or validating tables during flow execution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to