Balajee Nagasubramaniam created HUDI-335:
--------------------------------------------

             Summary: Improvements to DiskBasedMap
                 Key: HUDI-335
                 URL: https://issues.apache.org/jira/browse/HUDI-335
             Project: Apache Hudi (incubating)
          Issue Type: Improvement
          Components: Common Core
            Reporter: Balajee Nagasubramaniam
             Fix For: 0.5.1


DiskBasedMap is used by ExternalSpillableMap for writing (K,V) pair to a file,
keeping the (K, fileMetadata) in memory, to reduce the foot print of the record 
on disk.

This change improves the performance of the record get/read operation to disk, 
by using
a BufferedInputStream to cache the data.

Results from POC are promising.   Before the write performance improvement, 
spilling/writing 1 million records (record size ~ 350 bytes) to the file took 
about 104 seconds. 
After the improvement, same operation can be performed in under 5 seconds

Similarly, before the read performance improvement reading 1 million records 
(size ~350 bytes) from the spill file took about 23 seconds.  After the 
improvement, same operation can be performed in under 4 seconds.

{{without read/write performance improvements                                   
                
RecordsHandled: 10000   totalTestTime:  3145    writeTime:      1176    
readTime:       255
RecordsHandled: 50000   totalTestTime:  5775    writeTime:      4187    
readTime:       1175
RecordsHandled: 100000  totalTestTime:  10570   writeTime:      7718    
readTime:       2203
RecordsHandled: 500000  totalTestTime:  59723   writeTime:      45618   
readTime:       11093
RecordsHandled: 1000000 totalTestTime:  120022  writeTime:      87918   
readTime:       22355
RecordsHandled: 2000000 totalTestTime:  258627  writeTime:      187185  
readTime:       56431}}

{{With write improvement:
RecordsHandled: 10000   totalTestTime:  2013    writeTime:      700     
readTime:       503
RecordsHandled: 50000   totalTestTime:  2525    writeTime:      390     
readTime:       1247
RecordsHandled: 100000  totalTestTime:  3583    writeTime:      464     
readTime:       2352
RecordsHandled: 500000  totalTestTime:  22934   writeTime:      3731    
readTime:       15778
RecordsHandled: 1000000 totalTestTime:  42415   writeTime:      4816    
readTime:       30332
RecordsHandled: 2000000 totalTestTime:  74158   writeTime:      10192   
readTime:       53195}}

{{With read improvements:
RecordsHandled: 10000   totalTestTime:  2473    writeTime:      1562    
readTime:       87
RecordsHandled: 50000   totalTestTime:  6169    writeTime:      5151    
readTime:       438
RecordsHandled: 100000  totalTestTime:  9967    writeTime:      8636    
readTime:       252
RecordsHandled: 500000  totalTestTime:  50889   writeTime:      46766   
readTime:       1014
RecordsHandled: 1000000 totalTestTime:  114482  writeTime:      104353  
readTime:       3776
RecordsHandled: 2000000 totalTestTime:  239251  writeTime:      219041  
readTime:       8127}}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to