[
https://issues.apache.org/jira/browse/HUDI-860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425284#comment-17425284
]
Ethan Guo commented on HUDI-860:
--------------------------------
Cool, I'll take a look.
> Ability to do small file handling without need for caching
> ----------------------------------------------------------
>
> Key: HUDI-860
> URL: https://issues.apache.org/jira/browse/HUDI-860
> Project: Apache Hudi
> Issue Type: Sub-task
> Components: Writer Core
> Reporter: Vinoth Chandar
> Assignee: Ethan Guo
> Priority: Blocker
> Fix For: 0.10.0
>
>
> As of now, in upsert path,
> * hudi builds a workloadProfile to understand total inserts and updates(with
> location info)
> * Following which, small files info are populated
> * Then buckets are populated with above info.
> * These buckets are later used when getPartition(Object key) is invoked in
> UpsertPartitioner.
> In step1: to build global workload profile, we had to do an action on entire
> JavaRDD<HoodieRecord>s in the driver and hudi does save the workload profile
> as well.
> For large write intensive batch jobs(COW types), caching this incurs
> additional overhead. So, this effort is trying to see if we can avoid doing
> this by some means.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)