[
https://issues.apache.org/jira/browse/SPARK-13004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated SPARK-13004:
------------------------------
Target Version/s: (was: 1.6.0)
OK, sounds interesting, but sounds like you are already developing this
separately. I don't see why this is a Spark JIRA (yet)? what specific change
does this propose?
> Support Non-Volatile Data and Operations
> ----------------------------------------
>
> Key: SPARK-13004
> URL: https://issues.apache.org/jira/browse/SPARK-13004
> Project: Spark
> Issue Type: Epic
> Components: Input/Output, Spark Core
> Affects Versions: 1.5.0, 1.6.0
> Reporter: Wang, Gang
> Labels: Non-VolatileRDD, Non-volatileComputing, RDD, performance
>
> Based on our experiments, the SerDe-like operations have some significant
> negative performance impacts on majority of industrial Spark workloads,
> especially, when the volumn of datasets are much larger than the system
> memory volumns of Spark cluster available to caching, checkpoint,
> shuffling/dispatching, data loading and Storing. the JVM on-heap management
> would downgrade the performance as well when under pressure incurred by large
> memory demand and frequently memory allocation/free operations.
> With the trend of adopting advanced server platform technologies e.g. Large
> Memory Server, Non-volatile Memory and NVMe/Fast SSD Array Storage, This
> project focuses on adopting new features provided by server platform for
> Spark applications and retrofitting the utilization of hybrid addressable
> memory resources onto Spark whenever possible.
> *Data Object Managment*
> * Using our non-volatile generic object programming model (NVGOP) to avoid
> SerDe as well as reduce GC overhead.
> * Minimizing memory footprint to load data lazily.
> * Being naturally fit for RDD schemas in non-volatile RDD and off-heap RDD.
> * Using non-volatile/off-heap RDDs to transform Spark datasets.
> * Avoiding the memory caching part by the way of in-place non-volatile RDD
> operations.
> * Avoiding the checkpoints for Spark computing.
> *Data Memory Management*
>
> * Managing hereogeneous memory devices as an unified hybrid memory cache
> pool for Spark.
> * Using non-volatile memory-like devices for Spark checkpoint and shuffle.
> * Supporting to Reclaim allocated memory blocks automatically.
> * Providing an unified memory block APIs for the general purpose of memory
> usage.
>
> *Computing device management*
> * AVX instructions, programmable FPGA and GPU.
>
> Our customized Spark prototype has shown some potential improvements.
> [https://github.com/NonVolatileComputing/spark/tree/NonVolatileRDD]
> !http://bigdata-memory.github.io/images/Spark_mlib_kmeans.png|width=300!
> !http://bigdata-memory.github.io/images/total_GC_STW_pausetime.png|width=300!
>
> This epic tries to further improve the Spark performance with our
> non-volatile solutions.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]