[
https://issues.apache.org/jira/browse/HUDI-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17439242#comment-17439242
]
Sagar Sumit edited comment on HUDI-2558 at 11/5/21, 4:33 PM:
-------------------------------------------------------------
Hudi is [simply returning
null|https://github.com/apache/hudi/blob/3af6568d316f410184e3d4dcfdbf00a8802b1fb8/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java]
for the column with null value. Subsequently when the rdd is sorted, [Spark
does a
compare|https://github.com/apache/spark/blob/v2.4.7/core/src/main/java/org/apache/spark/util/collection/TimSort.java#L270-L277]
which requires keys being compared to be non-null.
We can try to make this behavior configurable i.e. replace user-configured
default value for nulls. However, I think it's best to retain this behavior and
document it. There is some discussion in the Guava community around this
behavior. Refer https://github.com/google/guava/issues/5460
A simple workaround is to give default values after reading dataframe but
before writing Hudi table:
{code:java}
df = df.fillna( {'sort_column': 'default_value'} )
{code}
cc [~vinoth] [~satishkotha]
was (Author: codope):
Hudi is [simply returning
null|https://github.com/apache/hudi/blob/3af6568d316f410184e3d4dcfdbf00a8802b1fb8/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java]
for the column with null value. Subsequently when the rdd is sorted, [Spark
does a
compare|https://github.com/apache/spark/blob/v2.4.7/core/src/main/java/org/apache/spark/util/collection/TimSort.java#L270-L277]
which requires keys being compared to be non-null.
We can try to make this behavior configurable i.e. replace user-configured
default value for nulls. However, I think it's best to retain this behavior and
document it. There is some discussion in the Guava community around this
behavior. Refer https://github.com/google/guava/issues/5460
A simple workaround is to give default values after reading dataframe but
before writing Hudi table:
{code:java}
df = df.fillna( {'sort_column': 'default_value'} )
{code}
> Clustering w/ sort columns with null values fails
> -------------------------------------------------
>
> Key: HUDI-2558
> URL: https://issues.apache.org/jira/browse/HUDI-2558
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Writer Core
> Reporter: sivabalan narayanan
> Assignee: sivabalan narayanan
> Priority: Major
> Labels: sev:critical, user-support-issues
> Fix For: 0.10.0
>
>
> https://github.com/apache/hudi/issues/3766
--
This message was sent by Atlassian Jira
(v8.3.4#803005)