Github user rdblue commented on the pull request:

    https://github.com/apache/spark/pull/12313#issuecomment-222205450
  
    @yhuai, I'll answer #2 first since it's quick: the column names are used to 
create a projection of the incoming data frame so any extra columns aren't 
selected and are ignored. If we think that's a bad assumption, I'm happy to add 
an analysis exception or an option to turn on/off strict checking for that 
case. What do you think is the right thing to do?
    
    For #1, can you explain why you think it is better to separate the by-name 
column resolution into a separate command? The problem I'm trying to fix is 
that the SQL-based behavior of `insertInto` is not obvious when using data 
frames because users expect the API to work like normal serialization 
libraries, where names matter rather than order. If we were to leave 
`insertInto` as it is and add another command, then I think it would be more 
confusing, not less. Adding another command won't make it obvious that 
`insertInto` doesn't do what a user expects.
    
    The data frame API doesn't make the ordering explicit, unlike SQL. For 
example, `df.groupBy("c").agg(max("d") as "m")` results in a dataframe with `c` 
and `m`. Adding `c` automatically isn't carried over from SQL and it isn't 
obvious whether it will be missing, at the beginning, or at the end. I think 
this flexibility is a strength of data frames, but it means that we should 
handle some situations, like writing, in the way that users expect. Users are 
commonly referring to columns by name so I think it makes sense to have an 
option for this in the writer API.
    
    Say I have one table with columns `a`, `b`, `c` and second with columns 
`a`, `c` that's partitioned by `b`. Using 
`spark.table("one").write.insertInto("two")` works as expected. But if I go the 
other way, `spark.table("two").write.insertInto("one")` then it silently maps 
`c -> b` and `b -> c` even if `b` is a string and `c` is an int. I don't think 
this aligns with user expectations because the user didn't select columns in 
either case. In SQL, the column order would be explicit; even selecting `*` 
implies that there is an ordering to worry about.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to