[jira] [Commented] (SPARK-12196) Store blocks in different speed storage devices by hierarchy way

wei wu (JIRA) Mon, 28 Dec 2015 01:36:07 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-12196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15072567#comment-15072567
 ]


wei wu commented on SPARK-12196:
--------------------------------

We also have the similar idea about Spark supported SSD for Block Manager and  
have done a prototype for it.  And we also have done some performance test on 
it. How about we add the following function and  API?

We use the benchmark problem from databricks: 
https://github.com/databricks/spark-perf/tree/master/spark-tests, 
With the test configuration Executor number: 3,  Per executor Memory: 4GB and 2 
cores, Data Size(1867MB);
The performance results is:
Test case                                                  Memory   SSD    HDD
Count                                                         0.259s        3s  
   6.75s
count-with-filter                                           0.56s     3.24s   
10s
aggregate-by-key                                            2s       4.8s     9s

The prototype configuration just like as follows:
We use the following Configuration that is similar with  the hadoop data node 
path configuration:
spark.local.dir = [DISK]file:/// disk0; [SSD]file:///disk1; 
[DISK]file:///disk2;[SSD]file:/// disk3; [DISK]file:/// disk4; [DISK]file:/// 
disk5; [DISK]file:/// disk6; [DISK]file:/// disk7;
or
spark.local.dir = file:/// disk0; [SSD];file:///disk1; 
file:///disk2;[SSD]file:/// disk3; file:/// disk4; file:/// disk5; file:/// 
disk6; file:/// disk7;
or
spark.local.dir = file:/// disk0;file:///disk1; file:///disk2;file:/// disk3; 
file:/// disk4; file:/// disk5; file:/// disk6; file:/// disk7;

We add the [SSD] and [DISK] identifier for the different disk path. 
The [SSD] mark the disk as SSD storage. The [DISK] mark the disk as HDD disk.
If we ignore the [DISK] in disk path, the disk is default as HDD storage.

Add the related StorageLevel API for SSD:
StorageLevel. MEMORY_AND_SSD               // cache the block in memory, then 
ssd
StorageLevel. SSD_ONLY                       //cache the block only in ssd
StorageLevel. MEMORY_AND_SSD_AND_DISK     //cache block in memory, then ssd, 
then hdd
StorageLevel. SSD_AND_DISK                  // cache the block in ssd, then hdd

For example: the user can use the follow API to cache the block data:
RDD.persist(StorageLevel.MEMORY_AND_SSD)
RDD.persist(StorageLevel.SSD)
RDD.persist(StorageLevel.SSD_AND_DISK)
RDD.persist(StorageLevel. MEMORY_AND_SSD_AND_DISK)






> Store blocks in different speed storage devices by hierarchy way
> ----------------------------------------------------------------
>
>                 Key: SPARK-12196
>                 URL: https://issues.apache.org/jira/browse/SPARK-12196
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: yucai
>
> *Problem*
> Nowadays, users have both SSDs and HDDs. 
> SSDs have great performance, but capacity is small. HDDs have good capacity, 
> but x2-x3 lower than SSDs.
> How can we get both good?
> *Solution*
> Our idea is to build hierarchy store: use SSDs as cache and HDDs as backup 
> storage. 
> When Spark core allocates blocks for RDD (either shuffle or RDD cache), it 
> gets blocks from SSDs first, and when SSD’s useable space is less than some 
> threshold, getting blocks from HDDs.
> In our implementation, we actually go further. We support a way to build any 
> level hierarchy store access all storage medias (NVM, SSD, HDD etc.).
> *Performance*
> 1. At the best case, our solution performs the same as all SSDs.
> 2. At the worst case, like all data are spilled to HDDs, no performance 
> regression.
> 3. Compared with all HDDs, hierarchy store improves more than *_x1.86_* (it 
> could be higher, CPU reaches bottleneck in our test environment).
> 4. Compared with Tachyon, our hierarchy store still *_x1.3_* faster. Because 
> we support both RDD cache and shuffle and no extra inter process 
> communication.
> *Usage*
> 1. Set the priority and threshold for each layer in 
> spark.storage.hierarchyStore.
> {code}
> spark.storage.hierarchyStore='nvm 50GB,ssd 80GB'
> {code}
> It builds a 3 layers hierarchy store: the 1st is "nvm", the 2nd is "sdd", all 
> the rest form the last layer.
> 2. Configure each layer's location, user just needs put the keyword like 
> "nvm", "ssd", which are specified in step 1, into local dirs, like 
> spark.local.dir or yarn.nodemanager.local-dirs.
> {code}
> spark.local.dir=/mnt/nvm1,/mnt/ssd1,/mnt/ssd2,/mnt/ssd3,/mnt/disk1,/mnt/disk2,/mnt/disk3,/mnt/disk4,/mnt/others
> {code}
> After then, restart your Spark application, it will allocate blocks from nvm 
> first.
> When nvm's usable space is less than 50GB, it starts to allocate from ssd.
> When ssd's usable space is less than 80GB, it starts to allocate from the 
> last layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12196) Store blocks in different speed storage devices by hierarchy way

Reply via email to