[
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15343029#comment-15343029
]
Wenchen Fan edited comment on SPARK-16032 at 6/22/16 1:15 AM:
--------------------------------------------------------------
I think it doesn't make sense to use `partitionBy` with `insertInto`, as we can
not map `DataFrameWriter.insertInto` to SQL INSERT for 2 reasons:
1. `DataFrameWriter` doesn't support static partition
2. `DataFrameWriter` specifies the partition columns of the data to insert, not
the table to be inserted.
And it's already broken(mostly) in 1.6, according to the test cases at
https://gist.github.com/cloud-fan/14ada3f2b3225b5db52ccaa12aacfbd4 , the only
case that seems reasonable in 1.6 is when the data to insert has same schema
with the table to be inserted and the `partitionBy` specifies the correct
partition columns. But I think it's worth to break it and make the overall
semantics more clear.
Maybe we are wrong, it will be good if we come up with a clean semantics to
explain the behavior of `DataFrame.insertInto`, but after spent a lot of time
on it, we failed, and that's why we wanna make these changes and rush in into
2.0.
was (Author: cloud_fan):
I think it's nonsense to use `partitionBy` with `insertInto`, as we can not map
`DataFrameWriter.insertInto` to SQL INSERT for 2 reasons:
1. `DataFrameWriter` doesn't support static partition
2. `DataFrameWriter` specifies the partition columns of the data to insert, not
the table to be inserted.
And it's already broken(mostly) in 1.6, according to the test cases at
https://gist.github.com/cloud-fan/14ada3f2b3225b5db52ccaa12aacfbd4 , the only
case that seems reasonable in 1.6 is when the data to insert has same schema
with the table to be inserted and the `partitionBy` specifies the correct
partition columns. But I think it's worth to break it and make the overall
semantics more clear.
Maybe we are wrong, it will be good if we come up with a clean semantics to
explain the behavior of `DataFrame.insertInto`, but after spent a lot of time
on it, we failed, and that's why we wanna make these changes and rush in into
2.0.
> Audit semantics of various insertion operations related to partitioned tables
> -----------------------------------------------------------------------------
>
> Key: SPARK-16032
> URL: https://issues.apache.org/jira/browse/SPARK-16032
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Cheng Lian
> Assignee: Wenchen Fan
> Priority: Critical
> Attachments: [SPARK-16032] Spark SQL table insertion auditing -
> Google Docs.pdf
>
>
> We found that semantics of various insertion operations related to partition
> tables can be inconsistent. This is an umbrella ticket for all related
> tickets.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]