[jira] [Commented] (KUDU-2235) Spark SQL insert command is actually an upsert

Dan Burkert (JIRA) Wed, 06 Dec 2017 09:35:14 -0800

    [ 
https://issues.apache.org/jira/browse/KUDU-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280553#comment-16280553
 ]


Dan Burkert commented on KUDU-2235:
-----------------------------------

This is 'by design', and was chosen this way in order to more closely match the 
SparkSQL on Parquet semantics, where insert is really an append.  It is 
configurable, by setting the 'kudu.operation' configuration value, and can be 
set to one of 'insert', 'insert-ignore', 'upsert', 'update', or 'delete'.  
Impala works differently because we were able to work with the Impala team to 
provide dedicated syntax for each of these different operation types; with 
SparkSQL we had to make do with their existing limited APIs.  

> Spark SQL insert command is actually an upsert
> ----------------------------------------------
>
>                 Key: KUDU-2235
>                 URL: https://issues.apache.org/jira/browse/KUDU-2235
>             Project: Kudu
>          Issue Type: Bug
>          Components: spark
>    Affects Versions: 1.5.0
>         Environment: CDH 5.13
>            Reporter: Diana Carroll
>
> The Spark SQL 'INSERT' command is actually doing an upsert when used on a 
> Kudu table.  
> Example:
> 1) Create a table in Impala like this:
> {code}
> create table test13 (k1 string, c2 string, c3 string, primary key(k1))
> partition by hash partitions 2 stored as kudu
> {code}
> 2) Try an Impala INSERT to demonstrate correct insert behavior
> {code}> insert into test13 values ('x','x','x'),('y','y','y');
> Modified 2 row(s), 0 row error(s) in 3.70s
> > select * from test13;
> +----+----+-------------+
> | k1 | c2 | c3          |
> +----+----+-------------+
> | x  | x  | x           |
> | y  | y  | y           |
> +----+----+-------------+
> > insert into test13 values ('x','x','test insert'),('z','z','z');
> WARNINGS: Key already present in Kudu table 'impala::default.test13'.
> Modified 1 row(s), 1 row error(s) in 0.11s
> > select * from test13;
> +----+----+-------------+
> | k1 | c2 | c3          |
> +----+----+-------------+
> | x  | x  | x           |
> | y  | y  | y           |
> | z  | z  | z           |
> +----+----+-------------+
> {code}
> 3) Try the same sequence of operations in Spark (Scala)
> {code}
> scala> val test13 = 
> spark.read.format("org.apache.kudu.spark.kudu").option("kudu.master",kuduMaster).option("kudu.table","impala::default.test13").load
> scala> test13.createTempView("test13")
> scala> spark.sql("insert into test13 values ('a','a','a'),('c','c','c')")
> scala> test13.show
> +---+---+---+
> | k1| c2| c3|
> +---+---+---+
> |  a|  a|  a|
> |  c|  c|  c|
> +---+---+---+
> scala> spark.sql("insert into test13 values ('a','a','test 
> update'),('d','d','d')")
> scala> test13.show
> +---+---+-----------+
> | k1| c2|         c3|
> +---+---+-----------+
> |  a|  a|test update|
> |  c|  c|          c|
> |  d|  d|          d|
> +---+---+-----------+
> {code}
> note that in Spark, but not in Impala, the row matching the existing key was 
> changed (updated), and the row with the new key was added.  Neither should 
> happen with an insert.
> 'Upsert' isn't actually a valid command.
> This important difference between Impala SQL and Spark SQL with respect to 
> Kudu is confusing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KUDU-2235) Spark SQL insert command is actually an upsert

Reply via email to