Github user rdblue commented on the pull request: https://github.com/apache/spark/pull/12313#issuecomment-222205450 @yhuai, I'll answer #2 first since it's quick: the column names are used to create a projection of the incoming data frame so any extra columns aren't selected and are ignored. If we think that's a bad assumption, I'm happy to add an analysis exception or an option to turn on/off strict checking for that case. What do you think is the right thing to do? For #1, can you explain why you think it is better to separate the by-name column resolution into a separate command? The problem I'm trying to fix is that the SQL-based behavior of `insertInto` is not obvious when using data frames because users expect the API to work like normal serialization libraries, where names matter rather than order. If we were to leave `insertInto` as it is and add another command, then I think it would be more confusing, not less. Adding another command won't make it obvious that `insertInto` doesn't do what a user expects. The data frame API doesn't make the ordering explicit, unlike SQL. For example, `df.groupBy("c").agg(max("d") as "m")` results in a dataframe with `c` and `m`. Adding `c` automatically isn't carried over from SQL and it isn't obvious whether it will be missing, at the beginning, or at the end. I think this flexibility is a strength of data frames, but it means that we should handle some situations, like writing, in the way that users expect. Users are commonly referring to columns by name so I think it makes sense to have an option for this in the writer API. Say I have one table with columns `a`, `b`, `c` and second with columns `a`, `c` that's partitioned by `b`. Using `spark.table("one").write.insertInto("two")` works as expected. But if I go the other way, `spark.table("two").write.insertInto("one")` then it silently maps `c -> b` and `b -> c` even if `b` is a string and `c` is an int. I don't think this aligns with user expectations because the user didn't select columns in either case. In SQL, the column order would be explicit; even selecting `*` implies that there is an ordering to worry about.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org