Prashant Wason created HUDI-1554:
------------------------------------
Summary: Introduce buffering for streams in HUDI
Key: HUDI-1554
URL: https://issues.apache.org/jira/browse/HUDI-1554
Project: Apache Hudi
Issue Type: Improvement
Reporter: Prashant Wason
Assignee: Prashant Wason
Input and Output streams created in HUDI through calls to
HoodieWrapperFileSystem do not include any buffering unless the underlying file
system implements buffering.
DistributedFileSystem (over HDFS) does not implement any buffering. This leads
to very large number of small-sized IO calls being send to the HDFS while
performing HUDI IO operations like reading parquet, writing parquet,
reading/writing log files, reading/writing instants, etc.
This patch introduces buffering at the HoodieWrapperFileSystem level so that
all types of reads and writes benefit from buffering.
In my tests with at scale on HDFS writing 1million records into a parquet file
(read from an existing parquet file in the same dataset), I observed the
following benefits:
# about 40% reduction in total time to run the test
# Total write calls to HDFS reduced from 19.1M -> 328
# Total read calls reduced from 229M -> 515K
--
This message was sent by Atlassian Jira
(v8.3.4#803005)