Balajee Nagasubramaniam created HUDI-335:
--------------------------------------------
Summary: Improvements to DiskBasedMap
Key: HUDI-335
URL: https://issues.apache.org/jira/browse/HUDI-335
Project: Apache Hudi (incubating)
Issue Type: Improvement
Components: Common Core
Reporter: Balajee Nagasubramaniam
Fix For: 0.5.1
DiskBasedMap is used by ExternalSpillableMap for writing (K,V) pair to a file,
keeping the (K, fileMetadata) in memory, to reduce the foot print of the record
on disk.
This change improves the performance of the record get/read operation to disk,
by using
a BufferedInputStream to cache the data.
Results from POC are promising. Before the write performance improvement,
spilling/writing 1 million records (record size ~ 350 bytes) to the file took
about 104 seconds.
After the improvement, same operation can be performed in under 5 seconds
Similarly, before the read performance improvement reading 1 million records
(size ~350 bytes) from the spill file took about 23 seconds. After the
improvement, same operation can be performed in under 4 seconds.
{{without read/write performance improvements
RecordsHandled: 10000 totalTestTime: 3145 writeTime: 1176
readTime: 255
RecordsHandled: 50000 totalTestTime: 5775 writeTime: 4187
readTime: 1175
RecordsHandled: 100000 totalTestTime: 10570 writeTime: 7718
readTime: 2203
RecordsHandled: 500000 totalTestTime: 59723 writeTime: 45618
readTime: 11093
RecordsHandled: 1000000 totalTestTime: 120022 writeTime: 87918
readTime: 22355
RecordsHandled: 2000000 totalTestTime: 258627 writeTime: 187185
readTime: 56431}}
{{With write improvement:
RecordsHandled: 10000 totalTestTime: 2013 writeTime: 700
readTime: 503
RecordsHandled: 50000 totalTestTime: 2525 writeTime: 390
readTime: 1247
RecordsHandled: 100000 totalTestTime: 3583 writeTime: 464
readTime: 2352
RecordsHandled: 500000 totalTestTime: 22934 writeTime: 3731
readTime: 15778
RecordsHandled: 1000000 totalTestTime: 42415 writeTime: 4816
readTime: 30332
RecordsHandled: 2000000 totalTestTime: 74158 writeTime: 10192
readTime: 53195}}
{{With read improvements:
RecordsHandled: 10000 totalTestTime: 2473 writeTime: 1562
readTime: 87
RecordsHandled: 50000 totalTestTime: 6169 writeTime: 5151
readTime: 438
RecordsHandled: 100000 totalTestTime: 9967 writeTime: 8636
readTime: 252
RecordsHandled: 500000 totalTestTime: 50889 writeTime: 46766
readTime: 1014
RecordsHandled: 1000000 totalTestTime: 114482 writeTime: 104353
readTime: 3776
RecordsHandled: 2000000 totalTestTime: 239251 writeTime: 219041
readTime: 8127}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)