[ https://issues.apache.org/jira/browse/KUDU-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16280648#comment-16280648 ]
Dan Burkert commented on KUDU-2235: ----------------------------------- I don't think so. The only Spark documentation I'm aware of is in http://kudu.apache.org/docs/developing.html#_kudu_integration_with_spark, and that doesn't cover it. We should leave this JIRA open to track adding docs for the feature. > Spark SQL insert command is actually an upsert > ---------------------------------------------- > > Key: KUDU-2235 > URL: https://issues.apache.org/jira/browse/KUDU-2235 > Project: Kudu > Issue Type: Bug > Components: spark > Affects Versions: 1.5.0 > Environment: CDH 5.13 > Reporter: Diana Carroll > > The Spark SQL 'INSERT' command is actually doing an upsert when used on a > Kudu table. > Example: > 1) Create a table in Impala like this: > {code} > create table test13 (k1 string, c2 string, c3 string, primary key(k1)) > partition by hash partitions 2 stored as kudu > {code} > 2) Try an Impala INSERT to demonstrate correct insert behavior > {code}> insert into test13 values ('x','x','x'),('y','y','y'); > Modified 2 row(s), 0 row error(s) in 3.70s > > select * from test13; > +----+----+-------------+ > | k1 | c2 | c3 | > +----+----+-------------+ > | x | x | x | > | y | y | y | > +----+----+-------------+ > > insert into test13 values ('x','x','test insert'),('z','z','z'); > WARNINGS: Key already present in Kudu table 'impala::default.test13'. > Modified 1 row(s), 1 row error(s) in 0.11s > > select * from test13; > +----+----+-------------+ > | k1 | c2 | c3 | > +----+----+-------------+ > | x | x | x | > | y | y | y | > | z | z | z | > +----+----+-------------+ > {code} > 3) Try the same sequence of operations in Spark (Scala) > {code} > scala> val test13 = > spark.read.format("org.apache.kudu.spark.kudu").option("kudu.master",kuduMaster).option("kudu.table","impala::default.test13").load > scala> test13.createTempView("test13") > scala> spark.sql("insert into test13 values ('a','a','a'),('c','c','c')") > scala> test13.show > +---+---+---+ > | k1| c2| c3| > +---+---+---+ > | a| a| a| > | c| c| c| > +---+---+---+ > scala> spark.sql("insert into test13 values ('a','a','test > update'),('d','d','d')") > scala> test13.show > +---+---+-----------+ > | k1| c2| c3| > +---+---+-----------+ > | a| a|test update| > | c| c| c| > | d| d| d| > +---+---+-----------+ > {code} > note that in Spark, but not in Impala, the row matching the existing key was > changed (updated), and the row with the new key was added. Neither should > happen with an insert. > 'Upsert' isn't actually a valid command. > This important difference between Impala SQL and Spark SQL with respect to > Kudu is confusing. -- This message was sent by Atlassian JIRA (v6.4.14#64029)