[jira] [Updated] (KUDU-1533) Spark Kudu Rdd/Dataframe upsert

Qutiba (JIRA) Fri, 15 Jul 2016 04:02:25 -0700

     [ 
https://issues.apache.org/jira/browse/KUDU-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Qutiba updated KUDU-1533:
-------------------------
    Description: 
Applying Upserting kuduRdd into existing Kudu table is not clear how to apply.
You mention in the documentation under "Kudu integration with Spark":
some possible operations to perform:
***********************************************
// then we can insert data into the kudu table
df.write.options(Map("kudu.master" ->  "your.kudu.master.here","kudu.table"-> 
"your.kudu.table.here")).mode("append").kudu

// to update existing data change the mode to 'overwrite'
df.write.options(Map("kudu.master"-> "your.kudu.master.here","kudu.table"-> 
"your.kudu.table.here")).mode("overwrite").kudu
****************************************************************
But there is no possibility to perform:
kuduDataFrame.write.options(Map("kudu.master"-> Kudu_Master,"kudu.table"-> 
TargetTable)).mode("upsert").kudu
***************************************************************
the current solution which is quit slow is:
Call DataFrame.foreachpartition
- open the table
- create session
    --For each row in this partition 
          --- create upsert operation
          --- get row from the operation
          --- add all fields and values to this row
          --- perform this operation
----------------------------------
this solution is quit slow! so adding upsert mode to Dataframe for Kudu tables 
could be better than open sessions and create operations as the previous 
solution.
kuduDataFrame.write.options(Map("kudu.master"-> Kudu_Master,"kudu.table"-> 
TargetTable)).mode("upsert").kudu


  was:
Applying Upserting kuduRdd into existing Kudu table is not clear how to apply.
You mention in the documentation under "Kudu integration with Spark":
some possible operations to perform:
***********************************************
// then we can insert data into the kudu table
df.write.options(Map("kudu.master"-> "your.kudu.master.here","kudu.table"-> 
"your.kudu.table.here")).mode("append").kudu

// to update existing data change the mode to 'overwrite'
df.write.options(Map("kudu.master"-> "your.kudu.master.here","kudu.table"-> 
"your.kudu.table.here")).mode("overwrite").kudu
****************************************************************
But there is no possibility to perform:
kuduDataFrame.write.options(Map("kudu.master"-> Kudu_Master,"kudu.table"-> 
TargetTable)).mode("upsert").kudu
***************************************************************
the current solution which is quit slow is:
Call DataFrame.foreachpartition
- open the table
- create session
    --For each row in this partition 
          --- create upsert operation
          --- get row from the operation
          --- add all fields and values to this row
          --- perform this operation
----------------------------------
this solution is quit slow! so adding upsert mode to Dataframe for Kudu tables 
could be better than open sessions and create operations as the previous 
solution.
kuduDataFrame.write.options(Map("kudu.master"-> Kudu_Master,"kudu.table"-> 
TargetTable)).mode("upsert").kudu



> Spark Kudu Rdd/Dataframe upsert 
> --------------------------------
>
>                 Key: KUDU-1533
>                 URL: https://issues.apache.org/jira/browse/KUDU-1533
>             Project: Kudu
>          Issue Type: Bug
>         Environment: Spark
>            Reporter: Qutiba
>
> Applying Upserting kuduRdd into existing Kudu table is not clear how to apply.
> You mention in the documentation under "Kudu integration with Spark":
> some possible operations to perform:
> ***********************************************
> // then we can insert data into the kudu table
> df.write.options(Map("kudu.master" ->  "your.kudu.master.here","kudu.table"-> 
> "your.kudu.table.here")).mode("append").kudu
> // to update existing data change the mode to 'overwrite'
> df.write.options(Map("kudu.master"-> "your.kudu.master.here","kudu.table"-> 
> "your.kudu.table.here")).mode("overwrite").kudu
> ****************************************************************
> But there is no possibility to perform:
> kuduDataFrame.write.options(Map("kudu.master"-> Kudu_Master,"kudu.table"-> 
> TargetTable)).mode("upsert").kudu
> ***************************************************************
> the current solution which is quit slow is:
> Call DataFrame.foreachpartition
> - open the table
> - create session
>     --For each row in this partition 
>           --- create upsert operation
>           --- get row from the operation
>           --- add all fields and values to this row
>           --- perform this operation
> ----------------------------------
> this solution is quit slow! so adding upsert mode to Dataframe for Kudu 
> tables could be better than open sessions and create operations as the 
> previous solution.
> kuduDataFrame.write.options(Map("kudu.master"-> Kudu_Master,"kudu.table"-> 
> TargetTable)).mode("upsert").kudu



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KUDU-1533) Spark Kudu Rdd/Dataframe upsert

Reply via email to