GitHub user rdblue opened a pull request:

    https://github.com/apache/spark/pull/12313

    [SPARK-14543] [SQL] Fix InsertIntoTable column resolution. (WIP)

    WIP: this depends on #12239 and includes its commits for SPARK-14459.
    
    ## What changes are proposed in this pull request?
    
    # This updates the logic to resolve output table from the incoming 
`LogicalPlan`. It catches cases where there are too many data columns and 
throws an `AnalysisException` rather than silently dropping the extra data. It 
also improves the error message when there are too few columns and warns when 
the output columns appear to be out of order.
    # This combines the pre-insert casts for Hive's `MetastoreRelation` with 
the pre-insert cast and rename for `LogicalRelations`. Both are now handled as 
a single `ResolveOutputColumns` step in the analyzer that implements the above 
improvements. Casts are now UpCasts to avoid silently adding incorrect casts 
when columns are misaligned.
    # This adds a by-name column resolution strategy that matches output 
columns to the incoming data by name. This is exposed on the `DataFrameWriter`:
    
    ```scala
    sqlContext.table("source").write.byName.insertInto("destination")
    ```
    
    ## How was this patch tested?
    
    This patch includes unit tests that exercise the cases outlined above.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rdblue/spark 
SPARK-14543-fix-hive-write-cast-and-rename

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12313.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12313
    
----
commit 792be85b9c694b9967dcd863075c3230abd70225
Author: Ryan Blue <b...@apache.org>
Date:   2016-04-01T19:02:59Z

    SPARK-14459: Detect relation partitioning and adjust the logical plan to 
match.
    
    This detects a relation's partitioning and adds checks to the analyzer.
    If an InsertIntoTable node has no partitioning, it is replaced by the
    relation's partition scheme and input columns are correctly adjusted,
    placing the partition columns at the end in partition order. If an
    InsertIntoTable node has partitioning, it is checked against the table's
    reported partitions.
    
    These changes required adding a PartitionedRelation trait to the catalog
    interface because Hive's MetastoreRelation doesn't extend
    CatalogRelation.
    
    This commit also includes a fix to InsertIntoTable's resolved logic,
    which now detects that all expected columns are present, including
    dynamic partition columns. Previously, the number of expected columns
    was not checked and resolved was true if there were missing columns.

commit 247f0566588259cf8d4e2dec77cb06a739c7b86f
Author: Ryan Blue <b...@apache.org>
Date:   2016-04-07T17:17:59Z

    SPARK-14459: Add test for InsertIntoTable resolution.
    
    This tests the bug in InsertIntoTable's resolve method, where zip
    ignores expected output columns because there is no corresponding input
    column.

commit 2a807a97cf793a480dbaf172789a315624db44b9
Author: Ryan Blue <b...@apache.org>
Date:   2016-04-08T22:23:34Z

    SPARK-14459: Update partition spec validation test.
    
    This test expected to fail the strict partition check, but with support
    for table partitioning in the analyzer the problem is caught sooner and
    has a better error message. The message now complains that the
    partitioning doesn't match rather than strict mode, which wouldn't help.

commit 6491788a3419b091dfd6ca2a14eebb9b0ec33635
Author: Ryan Blue <b...@apache.org>
Date:   2016-04-11T19:51:02Z

    SPARK-14459: Fix error message broken by partition checks.
    
    SPARK-6941 added a test that verifies a reasonable error message when
    trying to write to a OneRowRelation. OneRowRelation's output is always
    Nil, so the additional checks were catching a different problem. This
    commit adds a OneRowRelation class and uses it to set the correct
    expected columns in the test.

commit cf82d952c731af8ca7188bf1dc99b453ff00aa4b
Author: Ryan Blue <b...@apache.org>
Date:   2016-04-05T20:35:50Z

    Move pre-insertion casts into the analyzer and fix edge cases.
    
    This combines Hive's pre-insertion casts (without renames) that handle
    partitioning with the pre-insertion casts/renames in core. The combined
    rule, ResolveOutputColumns, will resolve columns by name or by position.
    Resolving by position will detect cases where the number of columns is
    incorrect or where the input columns are a permutation of the output
    columns and fail. When resolving by name, each output column is located
    by name in the child plan. This handles cases where a subset of a data
    frame is written out.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to