[
https://issues.apache.org/jira/browse/HUDI-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-335:
--------------------------------
Fix Version/s: (was: 0.5.1)
0.5.2
> Improvements to DiskBasedMap
> ----------------------------
>
> Key: HUDI-335
> URL: https://issues.apache.org/jira/browse/HUDI-335
> Project: Apache Hudi (incubating)
> Issue Type: Improvement
> Components: Common Core
> Reporter: Balajee Nagasubramaniam
> Priority: Major
> Labels: Hoodie, pull-request-available
> Fix For: 0.5.2
>
> Attachments: Screen Shot 2019-11-11 at 1.22.44 PM.png, Screen Shot
> 2019-11-13 at 2.56.53 PM.png
>
> Original Estimate: 504h
> Time Spent: 20m
> Remaining Estimate: 503h 40m
>
> DiskBasedMap is used by ExternalSpillableMap for writing (K,V) pair to a file,
> keeping the (K, fileMetadata) in memory, to reduce the foot print of the
> record on disk.
> This change improves the performance of the record get/read operation to
> disk, by using
> a BufferedInputStream to cache the data.
> Results from POC are promising. Before the write performance improvement,
> spilling/writing 1 million records (record size ~ 350 bytes) to the file took
> about 104 seconds.
> After the improvement, same operation can be performed in under 5 seconds
> Similarly, before the read performance improvement reading 1 million records
> (size ~350 bytes) from the spill file took about 23 seconds. After the
> improvement, same operation can be performed in under 4 seconds.
> {{without read/write performance improvements
>
> RecordsHandled: 10000 totalTestTime: 3145 writeTime: 1176
> readTime: 255
> RecordsHandled: 50000 totalTestTime: 5775 writeTime: 4187
> readTime: 1175
> RecordsHandled: 100000 totalTestTime: 10570 writeTime: 7718
> readTime: 2203
> RecordsHandled: 500000 totalTestTime: 59723 writeTime: 45618
> readTime: 11093
> RecordsHandled: 1000000 totalTestTime: 120022 writeTime: 87918
> readTime: 22355
> RecordsHandled: 2000000 totalTestTime: 258627 writeTime: 187185
> readTime: 56431}}
> {{With write improvement:
> RecordsHandled: 10000 totalTestTime: 2013 writeTime: 700
> readTime: 503
> RecordsHandled: 50000 totalTestTime: 2525 writeTime: 390
> readTime: 1247
> RecordsHandled: 100000 totalTestTime: 3583 writeTime: 464
> readTime: 2352
> RecordsHandled: 500000 totalTestTime: 22934 writeTime: 3731
> readTime: 15778
> RecordsHandled: 1000000 totalTestTime: 42415 writeTime: 4816
> readTime: 30332
> RecordsHandled: 2000000 totalTestTime: 74158 writeTime: 10192
> readTime: 53195}}
> {{With read improvements:
> RecordsHandled: 10000 totalTestTime: 2473 writeTime: 1562
> readTime: 87
> RecordsHandled: 50000 totalTestTime: 6169 writeTime: 5151
> readTime: 438
> RecordsHandled: 100000 totalTestTime: 9967 writeTime: 8636
> readTime: 252
> RecordsHandled: 500000 totalTestTime: 50889 writeTime: 46766
> readTime: 1014
> RecordsHandled: 1000000 totalTestTime: 114482 writeTime: 104353
> readTime: 3776
> RecordsHandled: 2000000 totalTestTime: 239251 writeTime: 219041
> readTime: 8127}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)