[ https://issues.apache.org/jira/browse/HUDI-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16975969#comment-16975969 ]
leesf commented on HUDI-335: ---------------------------- Looks promising. Would you please send a PR? [~balajeeUber] > Improvements to DiskBasedMap > ---------------------------- > > Key: HUDI-335 > URL: https://issues.apache.org/jira/browse/HUDI-335 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Common Core > Reporter: Balajee Nagasubramaniam > Priority: Major > Labels: Hoodie > Fix For: 0.5.1 > > Attachments: Screen Shot 2019-11-11 at 1.22.44 PM.png, Screen Shot > 2019-11-13 at 2.56.53 PM.png > > Original Estimate: 504h > Remaining Estimate: 504h > > DiskBasedMap is used by ExternalSpillableMap for writing (K,V) pair to a file, > keeping the (K, fileMetadata) in memory, to reduce the foot print of the > record on disk. > This change improves the performance of the record get/read operation to > disk, by using > a BufferedInputStream to cache the data. > Results from POC are promising. Before the write performance improvement, > spilling/writing 1 million records (record size ~ 350 bytes) to the file took > about 104 seconds. > After the improvement, same operation can be performed in under 5 seconds > Similarly, before the read performance improvement reading 1 million records > (size ~350 bytes) from the spill file took about 23 seconds. After the > improvement, same operation can be performed in under 4 seconds. > {{without read/write performance improvements > > RecordsHandled: 10000 totalTestTime: 3145 writeTime: 1176 > readTime: 255 > RecordsHandled: 50000 totalTestTime: 5775 writeTime: 4187 > readTime: 1175 > RecordsHandled: 100000 totalTestTime: 10570 writeTime: 7718 > readTime: 2203 > RecordsHandled: 500000 totalTestTime: 59723 writeTime: 45618 > readTime: 11093 > RecordsHandled: 1000000 totalTestTime: 120022 writeTime: 87918 > readTime: 22355 > RecordsHandled: 2000000 totalTestTime: 258627 writeTime: 187185 > readTime: 56431}} > {{With write improvement: > RecordsHandled: 10000 totalTestTime: 2013 writeTime: 700 > readTime: 503 > RecordsHandled: 50000 totalTestTime: 2525 writeTime: 390 > readTime: 1247 > RecordsHandled: 100000 totalTestTime: 3583 writeTime: 464 > readTime: 2352 > RecordsHandled: 500000 totalTestTime: 22934 writeTime: 3731 > readTime: 15778 > RecordsHandled: 1000000 totalTestTime: 42415 writeTime: 4816 > readTime: 30332 > RecordsHandled: 2000000 totalTestTime: 74158 writeTime: 10192 > readTime: 53195}} > {{With read improvements: > RecordsHandled: 10000 totalTestTime: 2473 writeTime: 1562 > readTime: 87 > RecordsHandled: 50000 totalTestTime: 6169 writeTime: 5151 > readTime: 438 > RecordsHandled: 100000 totalTestTime: 9967 writeTime: 8636 > readTime: 252 > RecordsHandled: 500000 totalTestTime: 50889 writeTime: 46766 > readTime: 1014 > RecordsHandled: 1000000 totalTestTime: 114482 writeTime: 104353 > readTime: 3776 > RecordsHandled: 2000000 totalTestTime: 239251 writeTime: 219041 > readTime: 8127}} -- This message was sent by Atlassian Jira (v8.3.4#803005)