[
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
yucai updated SPARK-12196:
--------------------------
Description:
*Problem*:
Nowadays, users have both SSDs and HDDs.
SSDs have great performance, but capacity is low. HDDs have good capacity, but
x2-x3 lower than SSDs.
How can we get both good?
*Solution*:
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup
storage.
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it gets
blocks from SSDs first, and when SSD’s useable space is less than some
threshold, getting blocks from HDDs.
In our implementation, we actually go further. We support a way to build any
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
*Performance*:
1. At the best case, our solution performs the same as all SSDs.
2. At the worst case, like all data are spilled to HDDs, no performance
regression.
3. Compared with all HDDs, hierarchy store improves more than x1.86 (it could
be higher, CPU reaches bottleneck in our test environment).
4. Compared with Tachyon, our hierarchy store still x1.3 faster. Because we
support both RDD cache and shuffle and no extra inter process communication.
was:
Problem:
Nowadays, users have both SSDs and HDDs.
SSDs have great performance, but capacity is low. HDDs have good capacity,
but x2-x3 lower than SSDs.
How can we get both good?
Solution:
Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup
storage.
When Spark core allocates blocks for RDD (either shuffle or RDD cache), it
gets blocks from SSDs first, and when SSD’s useable space is less than some
threshold, getting blocks from HDDs.
In our implementation, we actually go further. We support a way to build any
level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
Performance:
1. At the best case, our solution performs the same as all SSDs.
At the worst case, like all data are spilled to HDDs, no performance
regression.
2. Compared with all HDDs, hierarchy store improves more than x1.86 (it
could be higher, CPU reaches bottleneck in our test environment).
3. Compared with Tachyon, our hierarchy store still x1.3 faster. Because we
support both RDD cache and shuffle and no extra inter process communication.
> Store blocks in storage devices with hierarchy way
> --------------------------------------------------
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
> Issue Type: New Feature
> Components: Spark Core
> Reporter: yucai
>
> *Problem*:
> Nowadays, users have both SSDs and HDDs.
> SSDs have great performance, but capacity is low. HDDs have good capacity,
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*:
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup
> storage.
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it
> gets blocks from SSDs first, and when SSD’s useable space is less than some
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*:
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than x1.86 (it could
> be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still x1.3 faster. Because we
> support both RDD cache and shuffle and no extra inter process communication.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]