[ 
https://issues.apache.org/jira/browse/SPARK-31375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sai Krishna Chaitanya Chaganti updated SPARK-31375:
---------------------------------------------------
    Description: 
While overwriting data in specific partitions using insertInto  , spark is 
appending data to specific partitions though the mode is overwrite. Below 
property is set in config to ensure that we don't overwrite all partitions. If 
the below property is set to static it is truncating and inserting the data.


 _spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')_


 _df.write.mode('overwrite').format('parquet').insertInto(<db>.<tbl>)_

However if the above statement is changed to 

_df.write.mode('overwrite').format('parquet').insertInto(<db>.<tbl>,overwrite=True)_

 It starts behaving correct, I mean overwrites the data into specific 
partition. 

It seems   though the save mode has been mentioned earlier, precedence is given 
to the parameter set in insertInto method call.  
+_*insertInto(<db>.<tbl>,overwrite=True)*_+  

It is happening in pyspark

  was:
While overwriting data in specific partitions using insertInto  , spark is 
appending data to specific partitions though the mode is overwrite. Below 
property is set in config to ensure that we don't overwrite all partitions. If 
the below property is set to static it is truncating and inserting the data.
spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')
df.write.mode('overwrite').format('parquet').insertInto(<db>.<tbl>)
However if the above statement is changed to 
df.write.mode('overwrite').format('parquet').insertInto(<db>.<tbl>,overwrite=True)
 It starts behaving correct, I mean overwrites the data into specific 
partition. 
It seems   though the save mode has been mentioned earlier, precedence is given 
to the parameter set in insertInto method call.  
+_*insertInto(<db>.<tbl>,overwrite=True)*_+  

It is happening in pyspark


> Overwriting into dynamic partitions is appending data in pyspark
> ----------------------------------------------------------------
>
>                 Key: SPARK-31375
>                 URL: https://issues.apache.org/jira/browse/SPARK-31375
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 2.4.3
>         Environment: databricks, s3, EMR, PySpark.
>            Reporter: Sai Krishna Chaitanya Chaganti
>            Priority: Major
>
> While overwriting data in specific partitions using insertInto  , spark is 
> appending data to specific partitions though the mode is overwrite. Below 
> property is set in config to ensure that we don't overwrite all partitions. 
> If the below property is set to static it is truncating and inserting the 
> data.
>  _spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')_
>  _df.write.mode('overwrite').format('parquet').insertInto(<db>.<tbl>)_
> However if the above statement is changed to 
> _df.write.mode('overwrite').format('parquet').insertInto(<db>.<tbl>,overwrite=True)_
>  It starts behaving correct, I mean overwrites the data into specific 
> partition. 
> It seems   though the save mode has been mentioned earlier, precedence is 
> given to the parameter set in insertInto method call.  
> +_*insertInto(<db>.<tbl>,overwrite=True)*_+  
> It is happening in pyspark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to