baohe-zhang opened a new pull request #29149:
URL: https://github.com/apache/spark/pull/29149


   ### What changes were proposed in this pull request?
   The idea is to improve the performance of HybridStore by adding batch write 
support to LevelDB. #28412  introduces HybridStore. HybridStore will write data 
to InMemoryStore at first and use a background thread to dump data to LevelDB 
once the writing to InMemoryStore is completed. In the comments section of 
#28412 , @mridulm mentioned using batch writing can improve the performance of 
this dumping process and he wrote the code of writeAll().
   
   ### Why are the changes needed?
   I did the comparison of the HybridStore switching time between one-by-one 
write and batch write on an HDD disk. When the disk is free, the batch-write 
has around 25% improvement, and when the disk is 100% busy, the batch-write has 
7x - 10x improvement.
   
   when the disk is at 0% utilization:
   | log size, jobs and tasks per job   | original switching time, with write() 
| switching time with writeAll() |
   | ---------------------------------- | ------------------------------------- 
| ------------------------------ |
   | 133m, 400 jobs, 100 tasks per job  | 16s                                   
| 13s                            |
   | 265m, 400 jobs, 200 tasks per job  | 30s                                   
| 23s                            |
   | 1.3g, 1000 jobs, 400 tasks per job | 136s                                  
| 108s                           |
   
   when the disk is at 100% utilization:
   | log size, jobs and tasks per job  | original switching time, with write() 
| switching time with writeAll() |
   | --------------------------------- | ------------------------------------- 
| ------------------------------ |
   | 133m, 400 jobs, 100 tasks per job | 116s                                  
| 17s                            |
   | 265m, 400 jobs, 200 tasks per job | 251s                                  
| 26s                            |
   
   I also ran some write related benchmarking tests on LevelDBBenchmark.java 
and measured the total time of writing 1024 objects. The tests were conducted 
when the disk is at 0% utilization.
   
   | Benchmark test           | with write(), ms | with writeAll(), ms |
   | ------------------------ | ---------------- | ------------------- |
   | randomUpdatesIndexed     | 213.06           | 157.356             |
   | randomUpdatesNoIndex     | 57.869           | 35.439              |
   | randomWritesIndexed      | 298.854          | 229.274             |
   | randomWritesNoIndex      | 66.764           | 38.361              |
   | sequentialUpdatesIndexed | 87.019           | 56.219              |
   | sequentialUpdatesNoIndex | 61.851           | 41.942              |
   | sequentialWritesIndexed  | 94.044           | 56.534              |
   | sequentialWritesNoIndex  | 118.345          | 66.483              |
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Manually tested.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to