[jira] [Comment Edited] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

Wenchen Fan (JIRA) Tue, 21 Jun 2016 18:17:02 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15343029#comment-15343029
 ]


Wenchen Fan edited comment on SPARK-16032 at 6/22/16 1:15 AM:
--------------------------------------------------------------

I think it doesn't make sense to use `partitionBy` with `insertInto`, as we can 
not map `DataFrameWriter.insertInto` to SQL INSERT for 2 reasons:

1. `DataFrameWriter` doesn't support static partition
2. `DataFrameWriter` specifies the partition columns of the data to insert, not 
the table to be inserted.

And it's already broken(mostly) in 1.6, according to the test cases at 
https://gist.github.com/cloud-fan/14ada3f2b3225b5db52ccaa12aacfbd4 , the only 
case that seems reasonable in 1.6 is when the data to insert has same schema 
with the table to be inserted and the `partitionBy` specifies the correct 
partition columns. But I think it's worth to break it and make the overall 
semantics more clear.

Maybe we are wrong, it will be good if we come up with a clean semantics to 
explain the behavior of `DataFrame.insertInto`, but after spent a lot of time 
on it, we failed, and that's why we wanna make these changes and rush in into 
2.0.


was (Author: cloud_fan):
I think it's nonsense to use `partitionBy` with `insertInto`, as we can not map 
`DataFrameWriter.insertInto` to SQL INSERT for 2 reasons:

1. `DataFrameWriter` doesn't support static partition
2. `DataFrameWriter` specifies the partition columns of the data to insert, not 
the table to be inserted.

And it's already broken(mostly) in 1.6, according to the test cases at 
https://gist.github.com/cloud-fan/14ada3f2b3225b5db52ccaa12aacfbd4 , the only 
case that seems reasonable in 1.6 is when the data to insert has same schema 
with the table to be inserted and the `partitionBy` specifies the correct 
partition columns. But I think it's worth to break it and make the overall 
semantics more clear.

Maybe we are wrong, it will be good if we come up with a clean semantics to 
explain the behavior of `DataFrame.insertInto`, but after spent a lot of time 
on it, we failed, and that's why we wanna make these changes and rush in into 
2.0.

> Audit semantics of various insertion operations related to partitioned tables
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-16032
>                 URL: https://issues.apache.org/jira/browse/SPARK-16032
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Cheng Lian
>            Assignee: Wenchen Fan
>            Priority: Critical
>         Attachments: [SPARK-16032] Spark SQL table insertion auditing - 
> Google Docs.pdf
>
>
> We found that semantics of various insertion operations related to partition 
> tables can be inconsistent. This is an umbrella ticket for all related 
> tickets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

Reply via email to