[
https://issues.apache.org/jira/browse/SPARK-16410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15365883#comment-15365883
]
Ian Hellstrom commented on SPARK-16410:
---------------------------------------
The fact that truncate may be faster is a nice bonus but the main reason is
that in relational databases a DROP/CREATE deletes all structure on a table
(partitions, indexes, constraints, clustered keys, etc). These are then lost
and may severely impact the performance.
Moreover, in some databases (e.g. Oracle), views that are based on this
recreated table may become invalid. They will automatically be recompiled the
next time someone tries to access them, but it's odd behaviour because in an
RDBMS context overwrite usually means TRUNCATE/INSERT and not
DROP/CREATE/INSERT.
For JDBC I think the truncate option should be set to true (by default) as that
is what users expect.
> DataFrameWriter's jdbc method drops table in overwrite mode
> -----------------------------------------------------------
>
> Key: SPARK-16410
> URL: https://issues.apache.org/jira/browse/SPARK-16410
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.4.1, 1.6.2
> Reporter: Ian Hellstrom
>
> According to the [API
> documentation|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter],
> the write mode {{overwrite}} should _overwrite the existing data_, which
> suggests that the data is removed, i.e. the table is truncated.
> However, that is now what happens in the [source
> code|https://github.com/apache/spark/blob/0ad6ce7e54b1d8f5946dde652fa5341d15059158/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L421]:
> {code}
> if (mode == SaveMode.Overwrite && tableExists) {
> JdbcUtils.dropTable(conn, table)
> tableExists = false
> }
> {code}
> This clearly shows that the table is first dropped and then recreated. This
> causes two major issues:
> * Existing indexes, partitioning schemes, etc. are completely lost.
> * The case of identifiers may be changed without the user understanding why.
> In my opinion, the table should be truncated, not dropped. Overwriting data
> is a DML operation and should not cause DDL.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]