[
https://issues.apache.org/jira/browse/HUDI-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
satish updated HUDI-1443:
-------------------------
Description:
https://github.com/apache/hudi/pull/2263#discussion_r533653930 has the context.
When sorting is specified as part of clustering, we use custom partitioner
RDDCustomColumnsSortPartitioner. This deserializes schema to get values for
sort columns. Check if its possible to avoid this and implement the suggestion
in PR.
We tried another approach by adding
[SerializableSchema|https://github.com/apache/hudi/pull/2263/files#diff-92c1237d9746afaae1b8ad84d38eab0f74399587c4858a6ded64d60bfdbc33dc].
But this is not working for nested schemas. See test failing
[here|https://github.com/apache/hudi/pull/2263/files#diff-8e3156ac499f22081cf2e13aa6b192eadbc2e8a8d82c627df7811414bc7a60cfR50].
Fix this serialization and use it in RDDCustomColumnsSortPartitioner
was:https://github.com/apache/hudi/pull/2263#discussion_r533653930 has the
context. When sorting is specified as part of clustering, we use custom
partitioner RDDCustomColumnsSortPartitioner. This deserializes schema to get
values for sort columns. Check if its possible to avoid this and implement the
suggestion in PR.
> Remove record deserialization in RDDCustomColumnsSortPartitioner
> ----------------------------------------------------------------
>
> Key: HUDI-1443
> URL: https://issues.apache.org/jira/browse/HUDI-1443
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: Performance
> Reporter: satish
> Priority: Major
>
> https://github.com/apache/hudi/pull/2263#discussion_r533653930 has the
> context. When sorting is specified as part of clustering, we use custom
> partitioner RDDCustomColumnsSortPartitioner. This deserializes schema to get
> values for sort columns. Check if its possible to avoid this and implement
> the suggestion in PR.
> We tried another approach by adding
> [SerializableSchema|https://github.com/apache/hudi/pull/2263/files#diff-92c1237d9746afaae1b8ad84d38eab0f74399587c4858a6ded64d60bfdbc33dc].
> But this is not working for nested schemas. See test failing
> [here|https://github.com/apache/hudi/pull/2263/files#diff-8e3156ac499f22081cf2e13aa6b192eadbc2e8a8d82c627df7811414bc7a60cfR50].
> Fix this serialization and use it in RDDCustomColumnsSortPartitioner
--
This message was sent by Atlassian Jira
(v8.3.4#803005)