GitHub user zheh12 opened a pull request:

    https://github.com/apache/spark/pull/21554

    [SPARK-24546] InsertIntoDataSourceCommand make data frame with wrong schema 
when use kudu.

    
    ## What changes were proposed in this pull request?
    
    I have a hdfs table
    ```
    hdfs_table(a int,b int,c int)
    ```
    then I have a kudu table
    ```
    kudu_table(b int primary key, a int, c int)
    ```
    
    I want to insert kudu_table
    ```
    insert into kudu_table select * from hdfs_table
    ```
    
    But the data in kudu is misordered.
    
    I think the reason is the line code 
    
    ```
    val df = sparkSession.internalCreateDataFrame(data.queryExecution.toRdd, 
logicalRelation.schema)
    ```
    I think the code no check and can break the law
    
    > the row data must with the right schema
    
    When the logicalRelation like kudu with different order schema, we should 
let the kudu code to process the convert as the kudu do like this.
    
    ```
        val table: KuduTable = syncClient.openTable(tableName)
        val indices: Array[(Int, Int)] = schema.fields.zipWithIndex.map({ case 
(field, sparkIdx) =>
          sparkIdx -> table.getSchema.getColumnIndex(field.name)
        })
    ```
    
    So I suggest create data frame with query schema, and write some convert 
code outside spark sql.
    
    ## How was this patch tested?
    
    I test with spark-2.3 and kudu


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zheh12/spark SPARK-24546

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21554.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21554
    
----
commit bc92fcfbd226468960574c487e8be48bc58bb67d
Author: yangz <zheh12@...>
Date:   2018-06-13T10:44:46Z

    [SPARK-24546] InsertIntoDataSourceCommand make data frame with wrong schema

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to