baohe-zhang opened a new pull request #29149: URL: https://github.com/apache/spark/pull/29149
### What changes were proposed in this pull request? The idea is to improve the performance of HybridStore by adding batch write support to LevelDB. #28412 introduces HybridStore. HybridStore will write data to InMemoryStore at first and use a background thread to dump data to LevelDB once the writing to InMemoryStore is completed. In the comments section of #28412 , @mridulm mentioned using batch writing can improve the performance of this dumping process and he wrote the code of writeAll(). ### Why are the changes needed? I did the comparison of the HybridStore switching time between one-by-one write and batch write on an HDD disk. When the disk is free, the batch-write has around 25% improvement, and when the disk is 100% busy, the batch-write has 7x - 10x improvement. when the disk is at 0% utilization: | log size, jobs and tasks per job | original switching time, with write() | switching time with writeAll() | | ---------------------------------- | ------------------------------------- | ------------------------------ | | 133m, 400 jobs, 100 tasks per job | 16s | 13s | | 265m, 400 jobs, 200 tasks per job | 30s | 23s | | 1.3g, 1000 jobs, 400 tasks per job | 136s | 108s | when the disk is at 100% utilization: | log size, jobs and tasks per job | original switching time, with write() | switching time with writeAll() | | --------------------------------- | ------------------------------------- | ------------------------------ | | 133m, 400 jobs, 100 tasks per job | 116s | 17s | | 265m, 400 jobs, 200 tasks per job | 251s | 26s | I also ran some write related benchmarking tests on LevelDBBenchmark.java and measured the total time of writing 1024 objects. The tests were conducted when the disk is at 0% utilization. | Benchmark test | with write(), ms | with writeAll(), ms | | ------------------------ | ---------------- | ------------------- | | randomUpdatesIndexed | 213.06 | 157.356 | | randomUpdatesNoIndex | 57.869 | 35.439 | | randomWritesIndexed | 298.854 | 229.274 | | randomWritesNoIndex | 66.764 | 38.361 | | sequentialUpdatesIndexed | 87.019 | 56.219 | | sequentialUpdatesNoIndex | 61.851 | 41.942 | | sequentialWritesIndexed | 94.044 | 56.534 | | sequentialWritesNoIndex | 118.345 | 66.483 | ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually tested. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
