Prashant Wason created HUDI-2634:
------------------------------------
Summary: Improve bootstrap performance for very large tables
Key: HUDI-2634
URL: https://issues.apache.org/jira/browse/HUDI-2634
Project: Apache Hudi
Issue Type: Sub-task
Reporter: Prashant Wason
Assignee: Prashant Wason
Existing bootstrap code lists all files in the dataset and caches a FileStatus
object for each file found. FileStatus object has many fields which take memory
and most of these fields are not even used later as part of bootstrap.
For a very large production table, the bootstrap code fails with OOM and also
leads to timeout as a very large number of executors are spawned.
Dataset has 1299 partitions and 12Million files.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)