[
https://issues.apache.org/jira/browse/HUDI-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-2634:
---------------------------------
Priority: Blocker (was: Major)
> Improve bootstrap performance for very large tables
> ---------------------------------------------------
>
> Key: HUDI-2634
> URL: https://issues.apache.org/jira/browse/HUDI-2634
> Project: Apache Hudi
> Issue Type: Sub-task
> Reporter: Prashant Wason
> Assignee: Prashant Wason
> Priority: Blocker
> Labels: pull-request-available
>
> Existing bootstrap code lists all files in the dataset and caches a
> FileStatus object for each file found. FileStatus object has many fields
> which take memory and most of these fields are not even used later as part of
> bootstrap.
> For a very large production table, the bootstrap code fails with OOM and also
> leads to timeout as a very large number of executors are spawned.
> Dataset has 1299 partitions and 12Million files.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)