Prashant Wason created HUDI-1554:
------------------------------------

             Summary: Introduce buffering for streams in HUDI
                 Key: HUDI-1554
                 URL: https://issues.apache.org/jira/browse/HUDI-1554
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Prashant Wason
            Assignee: Prashant Wason


Input and Output streams created in HUDI through calls to 
HoodieWrapperFileSystem do not include any buffering unless the underlying file 
system implements buffering.

DistributedFileSystem (over HDFS) does not implement any buffering. This leads 
to very large number of small-sized IO calls being send to the HDFS while 
performing HUDI IO operations like reading parquet, writing parquet, 
reading/writing log files, reading/writing instants, etc. 

This patch introduces buffering at the HoodieWrapperFileSystem level so that 
all types of reads and writes benefit from buffering.

 

In my tests with at scale on HDFS writing 1million records into a parquet file 
(read from an existing parquet file in the same dataset), I observed the 
following benefits:
 # about 40% reduction in total time to run the test  
 # Total write calls to HDFS reduced from 19.1M -> 328
 # Total read calls reduced from 229M -> 515K

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to