[jira] [Updated] (HUDI-860) Ability to do small file handling without need for caching

sivabalan narayanan (Jira) Sun, 05 Jul 2020 17:01:20 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


sivabalan narayanan updated HUDI-860:
-------------------------------------
    Description: 
As of now, in upsert path,
 * hudi builds a workloadProfile to understand total inserts and updates(with 
location info) 
 * Following which, small files info are populated
 * Then buckets are populated with above info. 
 * These buckets are later used when getPartition(Object key) is invoked in 
UpsertPartitioner.

In step1: to build global workload profile, we had to do an action on entire 
JavaRDD<HoodieRecord>s in the driver and hudi does save the workload profile as 
well. 

For large write intensive batch jobs(COW types), caching this incurs additional 
overhead. So, this effort is trying to see if we can avoid doing this by some 
means. 

 

 

> Ability to do small file handling without need for caching
> ----------------------------------------------------------
>
>                 Key: HUDI-860
>                 URL: https://issues.apache.org/jira/browse/HUDI-860
>             Project: Apache Hudi
>          Issue Type: Sub-task
>          Components: Writer Core
>            Reporter: Vinoth Chandar
>            Assignee: sivabalan narayanan
>            Priority: Blocker
>             Fix For: 0.6.0
>
>
> As of now, in upsert path,
>  * hudi builds a workloadProfile to understand total inserts and updates(with 
> location info) 
>  * Following which, small files info are populated
>  * Then buckets are populated with above info. 
>  * These buckets are later used when getPartition(Object key) is invoked in 
> UpsertPartitioner.
> In step1: to build global workload profile, we had to do an action on entire 
> JavaRDD<HoodieRecord>s in the driver and hudi does save the workload profile 
> as well. 
> For large write intensive batch jobs(COW types), caching this incurs 
> additional overhead. So, this effort is trying to see if we can avoid doing 
> this by some means. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-860) Ability to do small file handling without need for caching

Reply via email to