[jira] [Comment Edited] (HUDI-2558) Clustering w/ sort columns with null values fails

Sagar Sumit (Jira) Fri, 05 Nov 2021 09:34:04 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17439242#comment-17439242
 ]


Sagar Sumit edited comment on HUDI-2558 at 11/5/21, 4:33 PM:
-------------------------------------------------------------

Hudi is [simply returning 
null|https://github.com/apache/hudi/blob/3af6568d316f410184e3d4dcfdbf00a8802b1fb8/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java]
 for the column with null value. Subsequently when the rdd is sorted, [Spark 
does a 
compare|https://github.com/apache/spark/blob/v2.4.7/core/src/main/java/org/apache/spark/util/collection/TimSort.java#L270-L277]
 which requires keys being compared to be non-null.

We can try to make this behavior configurable i.e. replace user-configured 
default value for nulls. However, I think it's best to retain this behavior and 
document it. There is some discussion in the Guava community around this 
behavior. Refer https://github.com/google/guava/issues/5460

A simple workaround is to give default values after reading dataframe but 
before writing Hudi table:

{code:java}
df  = df.fillna( {'sort_column': 'default_value'} )
{code}

cc [~vinoth] [~satishkotha]


was (Author: codope):
Hudi is [simply returning 
null|https://github.com/apache/hudi/blob/3af6568d316f410184e3d4dcfdbf00a8802b1fb8/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java]
 for the column with null value. Subsequently when the rdd is sorted, [Spark 
does a 
compare|https://github.com/apache/spark/blob/v2.4.7/core/src/main/java/org/apache/spark/util/collection/TimSort.java#L270-L277]
 which requires keys being compared to be non-null.

We can try to make this behavior configurable i.e. replace user-configured 
default value for nulls. However, I think it's best to retain this behavior and 
document it. There is some discussion in the Guava community around this 
behavior. Refer https://github.com/google/guava/issues/5460

A simple workaround is to give default values after reading dataframe but 
before writing Hudi table:

{code:java}
df  = df.fillna( {'sort_column': 'default_value'} )
{code}



> Clustering w/ sort columns with null values fails
> -------------------------------------------------
>
>                 Key: HUDI-2558
>                 URL: https://issues.apache.org/jira/browse/HUDI-2558
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Writer Core
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>              Labels: sev:critical, user-support-issues
>             Fix For: 0.10.0
>
>
> https://github.com/apache/hudi/issues/3766



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-2558) Clustering w/ sort columns with null values fails

Reply via email to