[ 
https://issues.apache.org/jira/browse/SPARK-32350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Baohe Zhang updated SPARK-32350:
--------------------------------
    Description: 
The idea is to improve the performance of HybridStore by adding batch write 
support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 
introduces HybridStore. HybridStore will write data to InMemoryStore at first 
and use a background thread to dump data to LevelDB once the writing to 
InMemoryStore is completed. In the comments section of 
[https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned 
using batch writing can improve the performance of this dumping process and he 
wrote the code of writeAll().

I did the comparison of the HybridStore switching time between one-by-one write 
and batch write on an HDD disk. When the disk is free, the batch-write has 
around 25% improvement, and when the disk is 100% busy, the batch-write has 7x 
- 10x improvement.

when the disk is at 0% utilization:

 
||log size, jobs and tasks per job||original switching time, with 
write()||switching time with writeAll()||
|133m, 400 jobs, 100 tasks per job|16s|13s|
|265m, 400 jobs, 200 tasks per job|30s|23s|
|1.3g, 1000 jobs, 400 tasks per job|136s|108s|

 

when the disk is at 100% utilization:
||log size, jobs and tasks per job||original switching time, with 
write()||switching time with writeAll()||
|133m, 400 jobs, 100 tasks per job|116s|17s|
|265m, 400 jobs, 200 tasks per job|251s|26s|

I also ran some write related benchmarking tests on LevelDBBenchmark.java and 
measured the total time of writing 1024 objects.

when the disk is at 0% utilization:

 
||Benchmark test||with write(), ms||with writeAll(), ms ||
|randomUpdatesIndexed|213.060|157.356|
|randomUpdatesNoIndex|57.869|35.439|
|randomWritesIndexed|298.854|229.274|
|randomWritesNoIndex|66.764|38.361|
|sequentialUpdatesIndexed|87.019|56.219|
|sequentialUpdatesNoIndex|61.851|41.942|
|sequentialWritesIndexed|94.044|56.534|
|sequentialWritesNoIndex|118.345|66.483|

 

when the disk is at 50% utilization:
||Benchmark test||with write(), ms||with writeAll(), ms||
|randomUpdatesIndexed|230.386|180.817|
|randomUpdatesNoIndex|58.935|50.113|
|randomWritesIndexed|315.241|254.400|
|randomWritesNoIndex|96.709|41.164|
|sequentialUpdatesIndexed|89.971|70.387|
|sequentialUpdatesNoIndex|72.021|53.769|
|sequentialWritesIndexed|103.052|67.358|
|sequentialWritesNoIndex|76.194|99.037|

  was:
The idea is to improve the performance of HybridStore by adding batch write 
support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 
introduces HybridStore. HybridStore will write data to InMemoryStore at first 
and use a background thread to dump data to LevelDB once the writing to 
InMemoryStore is completed. In the comments section of 
[https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned 
using batch writing can improve the performance of this dumping process and he 
wrote the code of writeAll().

I did the comparison of the HybridStore switching time between one-by-one write 
and batch write on an HDD disk. When the disk is free, the batch-write has 
around 25% improvement, and when the disk is 100% busy, the batch-write has 7x 
- 10x improvement.

when the disk is at 0% utilization:

 
||log size, jobs and tasks per job||original switching time, with 
write()||switching time with writeAll()||
|133m, 400 jobs, 100 tasks per job|16s|13s|
|265m, 400 jobs, 200 tasks per job|30s|23s|
|1.3g, 1000 jobs, 400 tasks per job|136s|108s|

 

when the disk is at 100% utilization:
||log size, jobs and tasks per job||original switching time, with 
write()||switching time with writeAll()||
|133m, 400 jobs, 100 tasks per job|116s|17s|
|265m, 400 jobs, 200 tasks per job|251s|26s|

I also ran some write related benchmarking tests on LevelDBBenchmark.java and 
measured the total time of writing 1024 objects. The test was conducted when 
disk at 0% utilization.
||Benchmark test||with write(), ms||with writeAll(), ms||
|randomUpdatesIndexed|230.386|180.817|
|randomUpdatesNoIndex|58.935|50.113|
|randomWritesIndexed|315.241|254.400|
|randomWritesNoIndex|96.709|41.164|
|sequentialUpdatesIndexed|89.971|70.387|
|sequentialUpdatesNoIndex|72.021|53.769|
|sequentialWritesIndexed|103.052|67.358|
|sequentialWritesNoIndex|76.194|99.037|


> Add batch write support on LevelDB to improve performance of HybridStore
> ------------------------------------------------------------------------
>
>                 Key: SPARK-32350
>                 URL: https://issues.apache.org/jira/browse/SPARK-32350
>             Project: Spark
>          Issue Type: Improvement
>          Components: Web UI
>    Affects Versions: 3.0.1, 3.1.0
>            Reporter: Baohe Zhang
>            Priority: Major
>
> The idea is to improve the performance of HybridStore by adding batch write 
> support to LevelDB. https://issues.apache.org/jira/browse/SPARK-31608 
> introduces HybridStore. HybridStore will write data to InMemoryStore at first 
> and use a background thread to dump data to LevelDB once the writing to 
> InMemoryStore is completed. In the comments section of 
> [https://github.com/apache/spark/pull/28412], Mridul Muralidharan mentioned 
> using batch writing can improve the performance of this dumping process and 
> he wrote the code of writeAll().
> I did the comparison of the HybridStore switching time between one-by-one 
> write and batch write on an HDD disk. When the disk is free, the batch-write 
> has around 25% improvement, and when the disk is 100% busy, the batch-write 
> has 7x - 10x improvement.
> when the disk is at 0% utilization:
>  
> ||log size, jobs and tasks per job||original switching time, with 
> write()||switching time with writeAll()||
> |133m, 400 jobs, 100 tasks per job|16s|13s|
> |265m, 400 jobs, 200 tasks per job|30s|23s|
> |1.3g, 1000 jobs, 400 tasks per job|136s|108s|
>  
> when the disk is at 100% utilization:
> ||log size, jobs and tasks per job||original switching time, with 
> write()||switching time with writeAll()||
> |133m, 400 jobs, 100 tasks per job|116s|17s|
> |265m, 400 jobs, 200 tasks per job|251s|26s|
> I also ran some write related benchmarking tests on LevelDBBenchmark.java and 
> measured the total time of writing 1024 objects.
> when the disk is at 0% utilization:
>  
> ||Benchmark test||with write(), ms||with writeAll(), ms ||
> |randomUpdatesIndexed|213.060|157.356|
> |randomUpdatesNoIndex|57.869|35.439|
> |randomWritesIndexed|298.854|229.274|
> |randomWritesNoIndex|66.764|38.361|
> |sequentialUpdatesIndexed|87.019|56.219|
> |sequentialUpdatesNoIndex|61.851|41.942|
> |sequentialWritesIndexed|94.044|56.534|
> |sequentialWritesNoIndex|118.345|66.483|
>  
> when the disk is at 50% utilization:
> ||Benchmark test||with write(), ms||with writeAll(), ms||
> |randomUpdatesIndexed|230.386|180.817|
> |randomUpdatesNoIndex|58.935|50.113|
> |randomWritesIndexed|315.241|254.400|
> |randomWritesNoIndex|96.709|41.164|
> |sequentialUpdatesIndexed|89.971|70.387|
> |sequentialUpdatesNoIndex|72.021|53.769|
> |sequentialWritesIndexed|103.052|67.358|
> |sequentialWritesNoIndex|76.194|99.037|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to