[jira] [Updated] (SPARK-52576) In Declarative Pipelines, drop/recreate on full refresh and MV update

Sandy Ryza (Jira) Wed, 25 Jun 2025 13:41:35 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-52576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sandy Ryza updated SPARK-52576:
-------------------------------
    Description: 
Some pipeline runs result in wiping out and replacing all the data for a table:
 * Every run of a materialized view
 * Runs of streaming tables that have the "full refresh" flag

In the current implementation, this "wipe out and replace" is implemented by:
 * Truncating the table
 * Altering the table to drop/update/add columns that don't match the columns 
in the DataFrame for the current run

The reason that we want originally wanted to truncate + alter instead of drop / 
recreate is that dropping has some undesirable effects. E.g. it interrupts 
readers of the table and wipes away things like ACLs.

However, we discovered that not all catalogs support dropping columns (e.g. 
Hive does not), and there’s no way to tell whether a catalog supports dropping 
columns or not. So change the implementation to drop/recreate the table instead 
of truncate/alter.

> In Declarative Pipelines, drop/recreate on full refresh and MV update
> ---------------------------------------------------------------------
>
>                 Key: SPARK-52576
>                 URL: https://issues.apache.org/jira/browse/SPARK-52576
>             Project: Spark
>          Issue Type: Improvement
>          Components: Declarative Pipelines
>    Affects Versions: 4.1.0
>            Reporter: Sandy Ryza
>            Priority: Major
>
> Some pipeline runs result in wiping out and replacing all the data for a 
> table:
>  * Every run of a materialized view
>  * Runs of streaming tables that have the "full refresh" flag
> In the current implementation, this "wipe out and replace" is implemented by:
>  * Truncating the table
>  * Altering the table to drop/update/add columns that don't match the columns 
> in the DataFrame for the current run
> The reason that we want originally wanted to truncate + alter instead of drop 
> / recreate is that dropping has some undesirable effects. E.g. it interrupts 
> readers of the table and wipes away things like ACLs.
> However, we discovered that not all catalogs support dropping columns (e.g. 
> Hive does not), and there’s no way to tell whether a catalog supports 
> dropping columns or not. So change the implementation to drop/recreate the 
> table instead of truncate/alter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-52576) In Declarative Pipelines, drop/recreate on full refresh and MV update

Reply via email to