[jira] [Commented] (HUDI-860) Ability to do small file handling without need for caching

Ethan Guo (Jira) Wed, 06 Oct 2021 17:00:09 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425284#comment-17425284
 ]


Ethan Guo commented on HUDI-860:
--------------------------------

Cool, I'll take a look.

> Ability to do small file handling without need for caching
> ----------------------------------------------------------
>
>                 Key: HUDI-860
>                 URL: https://issues.apache.org/jira/browse/HUDI-860
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: Writer Core
>            Reporter: Vinoth Chandar
>            Assignee: Ethan Guo
>            Priority: Blocker
>             Fix For: 0.10.0
>
>
> As of now, in upsert path,
>  * hudi builds a workloadProfile to understand total inserts and updates(with 
> location info) 
>  * Following which, small files info are populated
>  * Then buckets are populated with above info. 
>  * These buckets are later used when getPartition(Object key) is invoked in 
> UpsertPartitioner.
> In step1: to build global workload profile, we had to do an action on entire 
> JavaRDD<HoodieRecord>s in the driver and hudi does save the workload profile 
> as well. 
> For large write intensive batch jobs(COW types), caching this incurs 
> additional overhead. So, this effort is trying to see if we can avoid doing 
> this by some means. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-860) Ability to do small file handling without need for caching

Reply via email to