[jira] [Updated] (KUDU-2371) Allow Kudu-Spark upsert API to ignore NULL column values

Fengling Wang (JIRA) Fri, 23 Mar 2018 12:52:33 -0700

     [ 
https://issues.apache.org/jira/browse/KUDU-2371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Fengling Wang updated KUDU-2371:
--------------------------------
    Description: 
We've seen cases where users use Spark streaming to process JSON and upsert 
into Kudu. The JSON file may have rows with fields that are not all specified. 
In this case, Spark sets the missing column values to NULL. Then when we upsert 
those records, some existing row values will be replaced by NULL. This is a 
correct behavior of Kudu-Spark, but not what users desire to see.

 
{noformat}
// This is the original Kudu table.
scala> df.printSchema
root
 |-- key: long (nullable = false)
 |-- int_val: integer (nullable = true)
 |-- string_val: string (nullable = true)

scala> df.show()
+---+-------+----------+
|key|int_val|string_val|
+---+-------+----------+
|123|    200|       foo|
+---+-------+----------+
 
// Put JSON string into dataframe with matching schema.
scala> val json_str_with_partial_columns = "{\"key\" : 123, \"int_val\" : 1}"
scala> val json_rdd = sc.parallelize(Seq(json_str_with_partial_columns))
scala> val df_from_json = sqlContext.read.schema(df.schema).json(json_rdd)
scala> df_from_json.show()
+-------+-------+----------+
|     key|int_val|string_val|
+--------+-------+----------+
|     123|      1|      null|
+--------+-------+----------+


scala> kuduContext.upsertRows(df_from_json, "kudu_table")

// Below is actual result.
scala> df.show()
+---+-------+----------+
|key|int_val|string_val|
+---+-------+----------+
|123|      1|      null|
+---+-------+----------+

// Below is the desired result.
scala> df.show()
+---+-------+----------+
|key|int_val|string_val|
+---+-------+----------+
|123|      1|       foo|
+---+-------+----------+{noformat}
  

In order to avoid such situation, it's suggested to add an extra flag/option to 
the upsertRows() API to allow treating NULL as an unset/omit. By doing this, 
unspecified column values will stay unchanged in the Kudu table.

  was:
We've seen cases where users use Spark streaming to process JSON and upsert 
into Kudu. The JSON file may have rows with fields that are not all specified. 
In this case, Spark sets the missing column values to NULL. Then when we upsert 
those records, some existing row values will be replaced by NULL. This is a 
correct behavior of Kudu-Spark, but not what users desire to see.

 
{noformat}
// This is the original Kudu table.
scala> df.printSchema
root
 |-- key: long (nullable = false)
 |-- int_val: integer (nullable = true)
 |-- string_val: string (nullable = true)

scala> df.show()
+---+-------+----------+
|key|int_val|string_val|
+---+-------+----------+
|123|    200|       foo|
+---+-------+----------+
 
// Put JSON string into dataframe with matching schema.
scala> val json_str_with_partial_columns = "{\"key\" : 123, \"int_val\" : 1}"
scala> val json_rdd = sc.parallelize(Seq(json_str_with_partial_columns))
scala> val df_from_json = sqlContext.read.schema(df.schema).json(json_rdd)
scala> df_from_json.show()
+-------+-------+----------+
|     key|int_val|string_val|
+--------+-------+----------+
|     123|      1|      null|
+--------+-------+----------+


scala> kuduContext.upsertRows(df_from_json, "kudu_table")

// Below is actual result.
scala> df.show()
+---+-------+----------+
|key|int_val|string_val|
+---+-------+----------+
|123|      1|      null|
+---+-------+----------+

// Below is the desired result.
scala> df.show()
+---+-------+----------+
|key|int_val|string_val|
+---+-------+----------+
|123|      1|       foo|
+---+-------+----------+{noformat}
 

 

In order to avoid such situation, it's suggested to add an extra flag/option to 
the upsertRows() API to allow treating NULL as an unset/omit. By doing this, 
unspecified column values will stay unchanged in the Kudu table.


> Allow Kudu-Spark upsert API to ignore NULL column values
> --------------------------------------------------------
>
>                 Key: KUDU-2371
>                 URL: https://issues.apache.org/jira/browse/KUDU-2371
>             Project: Kudu
>          Issue Type: Improvement
>          Components: spark
>    Affects Versions: 1.6.0
>            Reporter: Fengling Wang
>            Assignee: Fengling Wang
>            Priority: Major
>
> We've seen cases where users use Spark streaming to process JSON and upsert 
> into Kudu. The JSON file may have rows with fields that are not all 
> specified. In this case, Spark sets the missing column values to NULL. Then 
> when we upsert those records, some existing row values will be replaced by 
> NULL. This is a correct behavior of Kudu-Spark, but not what users desire to 
> see.
>  
> {noformat}
> // This is the original Kudu table.
> scala> df.printSchema
> root
>  |-- key: long (nullable = false)
>  |-- int_val: integer (nullable = true)
>  |-- string_val: string (nullable = true)
> scala> df.show()
> +---+-------+----------+
> |key|int_val|string_val|
> +---+-------+----------+
> |123|    200|       foo|
> +---+-------+----------+
>  
> // Put JSON string into dataframe with matching schema.
> scala> val json_str_with_partial_columns = "{\"key\" : 123, \"int_val\" : 1}"
> scala> val json_rdd = sc.parallelize(Seq(json_str_with_partial_columns))
> scala> val df_from_json = sqlContext.read.schema(df.schema).json(json_rdd)
> scala> df_from_json.show()
> +-------+-------+----------+
> |     key|int_val|string_val|
> +--------+-------+----------+
> |     123|      1|      null|
> +--------+-------+----------+
> scala> kuduContext.upsertRows(df_from_json, "kudu_table")
> // Below is actual result.
> scala> df.show()
> +---+-------+----------+
> |key|int_val|string_val|
> +---+-------+----------+
> |123|      1|      null|
> +---+-------+----------+
> // Below is the desired result.
> scala> df.show()
> +---+-------+----------+
> |key|int_val|string_val|
> +---+-------+----------+
> |123|      1|       foo|
> +---+-------+----------+{noformat}
>   
> In order to avoid such situation, it's suggested to add an extra flag/option 
> to the upsertRows() API to allow treating NULL as an unset/omit. By doing 
> this, unspecified column values will stay unchanged in the Kudu table.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KUDU-2371) Allow Kudu-Spark upsert API to ignore NULL column values

Reply via email to