Prashant Wason created HUDI-2634:
------------------------------------

             Summary: Improve bootstrap performance for very large tables
                 Key: HUDI-2634
                 URL: https://issues.apache.org/jira/browse/HUDI-2634
             Project: Apache Hudi
          Issue Type: Sub-task
            Reporter: Prashant Wason
            Assignee: Prashant Wason


Existing bootstrap code lists all files in the dataset and caches a FileStatus 
object for each file found. FileStatus object has many fields which take memory 
and most of these fields are not even used later as part of bootstrap.

For a very large production table, the bootstrap code fails with OOM and also 
leads to timeout as a very large number of executors are spawned.

Dataset has 1299 partitions and 12Million files.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to