[ 
https://issues.apache.org/jira/browse/HUDI-5284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

scx updated HUDI-5284:
----------------------
    Description: 
When our input data comes from a complex rdd lineage, hudi writing will lead to 
repeated calculations.
For example, we will de duplicate according to the key of the input data, and 
we will obtain all partitions to be written to the data in the tag location. So 
I think we should cache the data to be written for downstream use.

> add new config controls whether input rdd should be first persist before 
> insert.
> --------------------------------------------------------------------------------
>
>                 Key: HUDI-5284
>                 URL: https://issues.apache.org/jira/browse/HUDI-5284
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: scx
>            Priority: Major
>
> When our input data comes from a complex rdd lineage, hudi writing will lead 
> to repeated calculations.
> For example, we will de duplicate according to the key of the input data, and 
> we will obtain all partitions to be written to the data in the tag location. 
> So I think we should cache the data to be written for downstream use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to