[ 
https://issues.apache.org/jira/browse/SPARK-16410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365883#comment-15365883
 ] 

Ian Hellstrom commented on SPARK-16410:
---------------------------------------

The fact that truncate may be faster is a nice bonus but the main reason is 
that in relational databases a DROP/CREATE deletes all structure on a table 
(partitions, indexes, constraints, clustered keys, etc). These are then lost 
and may severely impact the performance.

Moreover, in some databases (e.g. Oracle), views that are based on this 
recreated table may become invalid. They will automatically be recompiled the 
next time someone tries to access them, but it's odd behaviour because in an 
RDBMS context overwrite usually means TRUNCATE/INSERT and not 
DROP/CREATE/INSERT. 

For JDBC I think the truncate option should be set to true (by default) as that 
is what users expect.

> DataFrameWriter's jdbc method drops table in overwrite mode
> -----------------------------------------------------------
>
>                 Key: SPARK-16410
>                 URL: https://issues.apache.org/jira/browse/SPARK-16410
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.1, 1.6.2
>            Reporter: Ian Hellstrom
>
> According to the [API 
> documentation|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter],
>  the write mode {{overwrite}} should _overwrite the existing data_, which 
> suggests that the data is removed, i.e. the table is truncated. 
> However, that is now what happens in the [source 
> code|https://github.com/apache/spark/blob/0ad6ce7e54b1d8f5946dde652fa5341d15059158/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L421]:
> {code}
> if (mode == SaveMode.Overwrite && tableExists) {
>         JdbcUtils.dropTable(conn, table)
>         tableExists = false
>       }
> {code}
> This clearly shows that the table is first dropped and then recreated. This 
> causes two major issues:
> * Existing indexes, partitioning schemes, etc. are completely lost.
> * The case of identifiers may be changed without the user understanding why.
> In my opinion, the table should be truncated, not dropped. Overwriting data 
> is a DML operation and should not cause DDL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to