[ 
https://issues.apache.org/jira/browse/HUDI-335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16975969#comment-16975969
 ] 

leesf commented on HUDI-335:
----------------------------

Looks promising. Would you please send a PR? [~balajeeUber]

> Improvements to DiskBasedMap
> ----------------------------
>
>                 Key: HUDI-335
>                 URL: https://issues.apache.org/jira/browse/HUDI-335
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Common Core
>            Reporter: Balajee Nagasubramaniam
>            Priority: Major
>              Labels: Hoodie
>             Fix For: 0.5.1
>
>         Attachments: Screen Shot 2019-11-11 at 1.22.44 PM.png, Screen Shot 
> 2019-11-13 at 2.56.53 PM.png
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> DiskBasedMap is used by ExternalSpillableMap for writing (K,V) pair to a file,
> keeping the (K, fileMetadata) in memory, to reduce the foot print of the 
> record on disk.
> This change improves the performance of the record get/read operation to 
> disk, by using
> a BufferedInputStream to cache the data.
> Results from POC are promising.   Before the write performance improvement, 
> spilling/writing 1 million records (record size ~ 350 bytes) to the file took 
> about 104 seconds. 
> After the improvement, same operation can be performed in under 5 seconds
> Similarly, before the read performance improvement reading 1 million records 
> (size ~350 bytes) from the spill file took about 23 seconds.  After the 
> improvement, same operation can be performed in under 4 seconds.
> {{without read/write performance improvements                                 
>                 
> RecordsHandled:       10000   totalTestTime:  3145    writeTime:      1176    
> readTime:       255
> RecordsHandled:       50000   totalTestTime:  5775    writeTime:      4187    
> readTime:       1175
> RecordsHandled:       100000  totalTestTime:  10570   writeTime:      7718    
> readTime:       2203
> RecordsHandled:       500000  totalTestTime:  59723   writeTime:      45618   
> readTime:       11093
> RecordsHandled:       1000000 totalTestTime:  120022  writeTime:      87918   
> readTime:       22355
> RecordsHandled:       2000000 totalTestTime:  258627  writeTime:      187185  
> readTime:       56431}}
> {{With write improvement:
> RecordsHandled:       10000   totalTestTime:  2013    writeTime:      700     
> readTime:       503
> RecordsHandled:       50000   totalTestTime:  2525    writeTime:      390     
> readTime:       1247
> RecordsHandled:       100000  totalTestTime:  3583    writeTime:      464     
> readTime:       2352
> RecordsHandled:       500000  totalTestTime:  22934   writeTime:      3731    
> readTime:       15778
> RecordsHandled:       1000000 totalTestTime:  42415   writeTime:      4816    
> readTime:       30332
> RecordsHandled:       2000000 totalTestTime:  74158   writeTime:      10192   
> readTime:       53195}}
> {{With read improvements:
> RecordsHandled:       10000   totalTestTime:  2473    writeTime:      1562    
> readTime:       87
> RecordsHandled:       50000   totalTestTime:  6169    writeTime:      5151    
> readTime:       438
> RecordsHandled:       100000  totalTestTime:  9967    writeTime:      8636    
> readTime:       252
> RecordsHandled:       500000  totalTestTime:  50889   writeTime:      46766   
> readTime:       1014
> RecordsHandled:       1000000 totalTestTime:  114482  writeTime:      104353  
> readTime:       3776
> RecordsHandled:       2000000 totalTestTime:  239251  writeTime:      219041  
> readTime:       8127}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to