[ 
https://issues.apache.org/jira/browse/KUDU-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346494#comment-15346494
 ] 

Tom White commented on KUDU-1493:
---------------------------------

Key columns in Kudu must be declared before other, non-key columns. To cope 
with this constraint the write side of the Spark-Kudu integration is careful to 
map Spark dataframe field indexes to Kudu column indexes:

https://github.com/apache/incubator-kudu/blob/master/java/kudu-spark/src/main/scala/org/kududb/spark/kudu/KuduContext.scala#L152-L154

However, on the read side Spark dataframe and Kudu indexes are conflated:

https://github.com/apache/incubator-kudu/blob/master/java/kudu-spark/src/main/scala/org/kududb/spark/kudu/KuduRDD.scala#L114-L128

The code fails if key columns are not declared first - or worse, data will be 
read incorrectly if the types happen to be the same for the permuted fields.

The fix is to do the reverse mapping on the read side.

> Spark read fails if key columns are not leading columns
> -------------------------------------------------------
>
>                 Key: KUDU-1493
>                 URL: https://issues.apache.org/jira/browse/KUDU-1493
>             Project: Kudu
>          Issue Type: Bug
>          Components: spark
>    Affects Versions: 0.9.0
>            Reporter: Tom White
>
> If the Spark dataframe schema is (A, B, C) then reading will fail if the Kudu 
> keys are (A, C). Keys (A, B) work fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to