[ https://issues.apache.org/jira/browse/SPARK-32350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Baohe Zhang updated SPARK-32350: -------------------------------- Description: The idea is to improve the performance of HybridStore by adding batch write support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 introduces HybridStore. HybridStore will write data to InMemoryStore at first and use a background thread to dump data to LevelDB once the writing to InMemoryStore is completed. In the comments section of [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned using batch writing can improve the performance of this dumping process and he wrote the code of writeAll(). I did the comparison of the HybridStore switching time between one-by-one write and batch write on an HDD disk. When the disk is free, the batch-write has around 25% improvement, and when the disk is 100% busy, the batch-write has 7x - 10x improvement. when the disk is at 0% utilization: ||log size, jobs and tasks per job||original switching time, with write()||switching time with writeAll()|| |133m, 400 jobs, 100 tasks per job|16s|13s| |265m, 400 jobs, 200 tasks per job|30s|23s| |1.3g, 1000 jobs, 400 tasks per job|136s|108s| when the disk is at 100% utilization: ||log size, jobs and tasks per job||original switching time, with write()||switching time with writeAll()|| |133m, 400 jobs, 100 tasks per job|116s|17s| |265m, 400 jobs, 200 tasks per job|251s|26s| I also ran some write related benchmarking tests on LevelDBBenchmark.java and measured the total time of writing 1024 objects. when the disk is at 0% utilization: ||Benchmark test||with write(), ms||with writeAll(), ms || |randomUpdatesIndexed|213.060|157.356| |randomUpdatesNoIndex|57.869|35.439| |randomWritesIndexed|298.854|229.274| |randomWritesNoIndex|66.764|38.361| |sequentialUpdatesIndexed|87.019|56.219| |sequentialUpdatesNoIndex|61.851|41.942| |sequentialWritesIndexed|94.044|56.534| |sequentialWritesNoIndex|118.345|66.483| when the disk is at 50% utilization: ||Benchmark test||with write(), ms||with writeAll(), ms|| |randomUpdatesIndexed|230.386|180.817| |randomUpdatesNoIndex|58.935|50.113| |randomWritesIndexed|315.241|254.400| |randomWritesNoIndex|96.709|41.164| |sequentialUpdatesIndexed|89.971|70.387| |sequentialUpdatesNoIndex|72.021|53.769| |sequentialWritesIndexed|103.052|67.358| |sequentialWritesNoIndex|76.194|99.037| was: The idea is to improve the performance of HybridStore by adding batch write support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 introduces HybridStore. HybridStore will write data to InMemoryStore at first and use a background thread to dump data to LevelDB once the writing to InMemoryStore is completed. In the comments section of [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned using batch writing can improve the performance of this dumping process and he wrote the code of writeAll(). I did the comparison of the HybridStore switching time between one-by-one write and batch write on an HDD disk. When the disk is free, the batch-write has around 25% improvement, and when the disk is 100% busy, the batch-write has 7x - 10x improvement. when the disk is at 0% utilization: ||log size, jobs and tasks per job||original switching time, with write()||switching time with writeAll()|| |133m, 400 jobs, 100 tasks per job|16s|13s| |265m, 400 jobs, 200 tasks per job|30s|23s| |1.3g, 1000 jobs, 400 tasks per job|136s|108s| when the disk is at 100% utilization: ||log size, jobs and tasks per job||original switching time, with write()||switching time with writeAll()|| |133m, 400 jobs, 100 tasks per job|116s|17s| |265m, 400 jobs, 200 tasks per job|251s|26s| I also ran some write related benchmarking tests on LevelDBBenchmark.java and measured the total time of writing 1024 objects. The test was conducted when disk at 0% utilization. ||Benchmark test||with write(), ms||with writeAll(), ms|| |randomUpdatesIndexed|230.386|180.817| |randomUpdatesNoIndex|58.935|50.113| |randomWritesIndexed|315.241|254.400| |randomWritesNoIndex|96.709|41.164| |sequentialUpdatesIndexed|89.971|70.387| |sequentialUpdatesNoIndex|72.021|53.769| |sequentialWritesIndexed|103.052|67.358| |sequentialWritesNoIndex|76.194|99.037| > Add batch write support on LevelDB to improve performance of HybridStore > ------------------------------------------------------------------------ > > Key: SPARK-32350 > URL: https://issues.apache.org/jira/browse/SPARK-32350 > Project: Spark > Issue Type: Improvement > Components: Web UI > Affects Versions: 3.0.1, 3.1.0 > Reporter: Baohe Zhang > Priority: Major > > The idea is to improve the performance of HybridStore by adding batch write > support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 > introduces HybridStore. HybridStore will write data to InMemoryStore at first > and use a background thread to dump data to LevelDB once the writing to > InMemoryStore is completed. In the comments section of > [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned > using batch writing can improve the performance of this dumping process and > he wrote the code of writeAll(). > I did the comparison of the HybridStore switching time between one-by-one > write and batch write on an HDD disk. When the disk is free, the batch-write > has around 25% improvement, and when the disk is 100% busy, the batch-write > has 7x - 10x improvement. > when the disk is at 0% utilization: > > ||log size, jobs and tasks per job||original switching time, with > write()||switching time with writeAll()|| > |133m, 400 jobs, 100 tasks per job|16s|13s| > |265m, 400 jobs, 200 tasks per job|30s|23s| > |1.3g, 1000 jobs, 400 tasks per job|136s|108s| > > when the disk is at 100% utilization: > ||log size, jobs and tasks per job||original switching time, with > write()||switching time with writeAll()|| > |133m, 400 jobs, 100 tasks per job|116s|17s| > |265m, 400 jobs, 200 tasks per job|251s|26s| > I also ran some write related benchmarking tests on LevelDBBenchmark.java and > measured the total time of writing 1024 objects. > when the disk is at 0% utilization: > > ||Benchmark test||with write(), ms||with writeAll(), ms || > |randomUpdatesIndexed|213.060|157.356| > |randomUpdatesNoIndex|57.869|35.439| > |randomWritesIndexed|298.854|229.274| > |randomWritesNoIndex|66.764|38.361| > |sequentialUpdatesIndexed|87.019|56.219| > |sequentialUpdatesNoIndex|61.851|41.942| > |sequentialWritesIndexed|94.044|56.534| > |sequentialWritesNoIndex|118.345|66.483| > > when the disk is at 50% utilization: > ||Benchmark test||with write(), ms||with writeAll(), ms|| > |randomUpdatesIndexed|230.386|180.817| > |randomUpdatesNoIndex|58.935|50.113| > |randomWritesIndexed|315.241|254.400| > |randomWritesNoIndex|96.709|41.164| > |sequentialUpdatesIndexed|89.971|70.387| > |sequentialUpdatesNoIndex|72.021|53.769| > |sequentialWritesIndexed|103.052|67.358| > |sequentialWritesNoIndex|76.194|99.037| -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org