[
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
yucai updated SPARK-12196:
--------------------------
Comment: was deleted
(was: Sorry, I have to delete this PR because of my wrong github operation, I
will send a new one ASAP.)
> Store blocks in storage devices with hierarchy way
> --------------------------------------------------
>
> Key: SPARK-12196
> URL: https://issues.apache.org/jira/browse/SPARK-12196
> Project: Spark
> Issue Type: New Feature
> Components: Spark Core
> Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs.
> SSDs have great performance, but capacity is small. HDDs have good capacity,
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup
> storage.
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it
> gets blocks from SSDs first, and when SSD’s useable space is less than some
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because
> we support both RDD cache and shuffle and no extra inter process
> communication.
> *Usage*
> 1. In spark-default.xml, configure spark.hierarchyStore.
> {code}
> spark.hierarchyStore nvm 50GB, ssd 80GB
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all
> the rest form the last layer.
> 2. Configuration the "nvm", "ssd" location in local dir, like spark.local.dir
> or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the
> last layer.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]