[jira] [Updated] (HUDI-1443) Remove record deserialization in RDDCustomColumnsSortPartitioner

satish (Jira) Mon, 21 Dec 2020 12:35:35 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-1443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


satish updated HUDI-1443:
-------------------------
    Description: 
https://github.com/apache/hudi/pull/2263#discussion_r533653930 has the context. 
When sorting is specified as part of clustering, we use custom partitioner 
RDDCustomColumnsSortPartitioner. This deserializes schema to get values for 
sort columns.  Check if its possible to avoid this and implement the suggestion 
in PR.

We tried another approach by adding 
[SerializableSchema|https://github.com/apache/hudi/pull/2263/files#diff-92c1237d9746afaae1b8ad84d38eab0f74399587c4858a6ded64d60bfdbc33dc].
 But this is not working for nested schemas. See test failing 
[here|https://github.com/apache/hudi/pull/2263/files#diff-8e3156ac499f22081cf2e13aa6b192eadbc2e8a8d82c627df7811414bc7a60cfR50].
  Fix this serialization and use it in RDDCustomColumnsSortPartitioner

  was:https://github.com/apache/hudi/pull/2263#discussion_r533653930 has the 
context. When sorting is specified as part of clustering, we use custom 
partitioner RDDCustomColumnsSortPartitioner. This deserializes schema to get 
values for sort columns.  Check if its possible to avoid this and implement the 
suggestion in PR.


> Remove record deserialization in RDDCustomColumnsSortPartitioner
> ----------------------------------------------------------------
>
>                 Key: HUDI-1443
>                 URL: https://issues.apache.org/jira/browse/HUDI-1443
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: Performance
>            Reporter: satish
>            Priority: Major
>
> https://github.com/apache/hudi/pull/2263#discussion_r533653930 has the 
> context. When sorting is specified as part of clustering, we use custom 
> partitioner RDDCustomColumnsSortPartitioner. This deserializes schema to get 
> values for sort columns.  Check if its possible to avoid this and implement 
> the suggestion in PR.
> We tried another approach by adding 
> [SerializableSchema|https://github.com/apache/hudi/pull/2263/files#diff-92c1237d9746afaae1b8ad84d38eab0f74399587c4858a6ded64d60bfdbc33dc].
>  But this is not working for nested schemas. See test failing 
> [here|https://github.com/apache/hudi/pull/2263/files#diff-8e3156ac499f22081cf2e13aa6b192eadbc2e8a8d82c627df7811414bc7a60cfR50].
>   Fix this serialization and use it in RDDCustomColumnsSortPartitioner



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1443) Remove record deserialization in RDDCustomColumnsSortPartitioner

Reply via email to