GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/12313
[SPARK-14543] [SQL] Fix InsertIntoTable column resolution. (WIP) WIP: this depends on #12239 and includes its commits for SPARK-14459. ## What changes are proposed in this pull request? # This updates the logic to resolve output table from the incoming `LogicalPlan`. It catches cases where there are too many data columns and throws an `AnalysisException` rather than silently dropping the extra data. It also improves the error message when there are too few columns and warns when the output columns appear to be out of order. # This combines the pre-insert casts for Hive's `MetastoreRelation` with the pre-insert cast and rename for `LogicalRelations`. Both are now handled as a single `ResolveOutputColumns` step in the analyzer that implements the above improvements. Casts are now UpCasts to avoid silently adding incorrect casts when columns are misaligned. # This adds a by-name column resolution strategy that matches output columns to the incoming data by name. This is exposed on the `DataFrameWriter`: ```scala sqlContext.table("source").write.byName.insertInto("destination") ``` ## How was this patch tested? This patch includes unit tests that exercise the cases outlined above. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark SPARK-14543-fix-hive-write-cast-and-rename Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12313.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12313 ---- commit 792be85b9c694b9967dcd863075c3230abd70225 Author: Ryan Blue <b...@apache.org> Date: 2016-04-01T19:02:59Z SPARK-14459: Detect relation partitioning and adjust the logical plan to match. This detects a relation's partitioning and adds checks to the analyzer. If an InsertIntoTable node has no partitioning, it is replaced by the relation's partition scheme and input columns are correctly adjusted, placing the partition columns at the end in partition order. If an InsertIntoTable node has partitioning, it is checked against the table's reported partitions. These changes required adding a PartitionedRelation trait to the catalog interface because Hive's MetastoreRelation doesn't extend CatalogRelation. This commit also includes a fix to InsertIntoTable's resolved logic, which now detects that all expected columns are present, including dynamic partition columns. Previously, the number of expected columns was not checked and resolved was true if there were missing columns. commit 247f0566588259cf8d4e2dec77cb06a739c7b86f Author: Ryan Blue <b...@apache.org> Date: 2016-04-07T17:17:59Z SPARK-14459: Add test for InsertIntoTable resolution. This tests the bug in InsertIntoTable's resolve method, where zip ignores expected output columns because there is no corresponding input column. commit 2a807a97cf793a480dbaf172789a315624db44b9 Author: Ryan Blue <b...@apache.org> Date: 2016-04-08T22:23:34Z SPARK-14459: Update partition spec validation test. This test expected to fail the strict partition check, but with support for table partitioning in the analyzer the problem is caught sooner and has a better error message. The message now complains that the partitioning doesn't match rather than strict mode, which wouldn't help. commit 6491788a3419b091dfd6ca2a14eebb9b0ec33635 Author: Ryan Blue <b...@apache.org> Date: 2016-04-11T19:51:02Z SPARK-14459: Fix error message broken by partition checks. SPARK-6941 added a test that verifies a reasonable error message when trying to write to a OneRowRelation. OneRowRelation's output is always Nil, so the additional checks were catching a different problem. This commit adds a OneRowRelation class and uses it to set the correct expected columns in the test. commit cf82d952c731af8ca7188bf1dc99b453ff00aa4b Author: Ryan Blue <b...@apache.org> Date: 2016-04-05T20:35:50Z Move pre-insertion casts into the analyzer and fix edge cases. This combines Hive's pre-insertion casts (without renames) that handle partitioning with the pre-insertion casts/renames in core. The combined rule, ResolveOutputColumns, will resolve columns by name or by position. Resolving by position will detect cases where the number of columns is incorrect or where the input columns are a permutation of the output columns and fail. When resolving by name, each output column is located by name in the child plan. This handles cases where a subset of a data frame is written out. ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org