[ 
https://issues.apache.org/jira/browse/SPARK-57625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-57625:
-----------------------------------
    Labels: pull-request-available  (was: )

> Manage Auxiliary Table Lifecycle in DatasetManager
> --------------------------------------------------
>
>                 Key: SPARK-57625
>                 URL: https://issues.apache.org/jira/browse/SPARK-57625
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Declarative Pipelines
>    Affects Versions: 4.3.0
>            Reporter: Anish Mahto
>            Priority: Major
>              Labels: pull-request-available
>
> Today, the auxiliary table is managed (created, full-refreshed, etc.) during 
> flow execution. For SCD1 this is functionally correct but is poor timing. The 
> auxiliary table should be managed side-by-side with the target table that it 
> is a companion to, and table-level validations for the AutoCDC auxiliary 
> table (ex. key/scd type drifts) should happen well before flow execution.
> For SCD2 which is coming soon, the existing control flow is incompatible. 
> SCD2 auxiliary tables will contain data columns (SCD1 auxiliary tables only 
> contain keys columns + CDC metadata column), and therefore will actually need 
> to undergo schema evolution the same way that target tables do.
> The proposal here is to refactor the auxiliary table management such that 
> DatasetManager recognizes the general concept of an auxiliary table, and 
> manage full-refresh/schema evolution/catalog table validations along side the 
> target table it companions. 
> We should no longer be materializing or validating tables during flow 
> execution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to