RE: [DISCUSS] Mnemonic incubator proposal

Wang, Yanping Mon, 22 Feb 2016 21:25:42 -0800

Hi, All

I uploaded a PDF presentation that describes Project Mnemonic with some nice 
pictures.
Click Attachment link below to see the presentation.


Attachment name: Project_Mnemonic_Pub1.0.pdf
Attachment size: 1493317
Attachment link: 
https://wiki.apache.org/incubator/MnemonicProposal?action=AttachFile&do=get&target=Project_Mnemonic_Pub1.0.pdf
 

Page link: https://wiki.apache.org/incubator/MnemonicProposal 

Thanks
Yanping

-----Original Message-----
From: Wang, Yanping [mailto:yanping.w...@intel.com] 
Sent: Sunday, February 21, 2016 11:47 AM
To: general@incubator.apache.org
Subject: [DISCUSS] Mnemonic incubator proposal 

Hi all 

We'd like to start a discussion regarding a proposal to submit Mnemonic to the 
Apache Incubator.

The proposal text is available on the Wiki here:
https://wiki.apache.org/incubator/MnemonicProposal

and pasted below for convenience.

We are excited to make this proposal, and look forward to the community's input!

Best,
Yanping


= Mnemonic Proposal =
=== Abstract ===
Mnemonic is a Java based non-volatile memory library for in-place structured 
data processing and computing. It is a solution for generic object and block 
persistence on heterogeneous block and byte-addressable devices, such as DRAM, 
persistent memory, NVMe, SSD, and cloud network storage.

=== Proposal ===
Mnemonic is a structured data persistence in-memory in-place library for 
Java-based applications and frameworks. It provides unified interfaces for data 
manipulation on heterogeneous block/byte-addressable devices, such as DRAM, 
persistent memory, NVMe, SSD, and cloud network devices.

The design motivation for this project is to create a non-volatile programming 
paradigm for in-memory data object persistence, in-memory data objects caching, 
and JNI-less IPC.
Mnemonic simplifies the usage of data object caching, persistence, and JNI-less 
IPC for massive object oriented structural datasets.

Mnemonic defines Non-Volatile Java objects that store data fields in persistent 
memory and storage. During the program runtime, only methods and volatile 
fields are instantiated in Java heap, Non-Volatile data fields are directly 
accessed via GET/SET operation to and from persistent memory and storage. 
Mnemonic avoids SerDes and significantly reduces amount of garbage in Java heap.

Major features of Mnemonic:
* Provides an abstract level of viewpoint to utilize heterogeneous 
block/byte-addressable device as a whole (e.g., DRAM, persistent memory, NVMe, 
SSD, HD, cloud network Storage).
* Provides seamless support object oriented design and programming without 
adding burden to transfer object data to different form.
* Avoids the object data serialization/de-serialization for data retrieval, 
caching and storage.
* Reduces the consumption of on-heap memory and in turn to reduce and stabilize 
Java Garbage Collection (GC) pauses for latency sensitive applications.
* Overcomes current limitations of Java GC to manage much larger memory 
resources for massive dataset processing and computing.
* Supports the migration data usage model from traditional NVMe/SSD/HD to 
non-volatile memory with ease.
* Uses lazy loading mechanism to avoid unnecessary memory consumption if some 
data does not need to use for computing immediately.
* Bypasses JNI call for the interaction between Java runtime application and 
its native code.
* Provides an allocation aware auto-reclaim mechanism to prevent external 
memory resource leaking.


=== Background ===
Big Data and Cloud applications increasingly require both high throughput and 
low latency processing. Java-based applications targeting the Big Data and 
Cloud space should be tuned for better throughput, lower latency, and more 
predictable response time.
Typically, there are some issues that impact BigData applications' performance 
and scalability:

1) The Complexity of Data Transformation/Organization: In most cases, during 
data processing, applications use their own complicated data caching mechanism 
for SerDes data objects, spilling to different storage and eviction large 
amount of data. Some data objects contains complex values and structure that 
will make it much more difficulty for data organization. To load and then 
parse/decode its datasets from storage consumes high system resource and 
computation power. 

2) Lack of Caching, Burst Temporary Object Creation/Destruction Causes Frequent 
Long GC Pauses: Big Data computing/syntax generates large amount of temporary 
objects during processing, e.g. lambda, SerDes, copying and etc. This will 
trigger frequent long Java GC pause to scan references, to update references 
lists, and to copy live objects from one memory location to another blindly.

3) The Unpredictable GC Pause: For latency sensitive applications, such as 
database, search engine, web query, real-time/streaming computing, require 
latency/request-response under control. But current Java GC does not provide 
predictable GC activities with large on-heap memory management.

4) High JNI Invocation Cost: JNI calls are expensive, but high performance 
applications usually try to leverage native code to improve performance, 
however, JNI calls need to convert Java objects into something that C/C++ can 
understand. In addition, some comprehensive native code needs to communicate 
with Java based application that will cause frequently JNI call along with 
stack marshalling.

Mnemonic project provides a solution to address above issues and performance 
bottlenecks for structured data processing and computing. It also simplifies 
the massive data handling with much reduced GC activity. 

=== Rationale ===
There are strong needs for a cohesive, easy-to-use non-volatile programing 
model for unified heterogeneous memory resources management and allocation. 
Mnemonic project provides a reusable and flexible framework to accommodate 
other special type of memory/block devices for better performance without 
changing client code.

Most of the BigData frameworks (e.g., Apache Spark™, Apache™ Hadoop®, Apache 
HBase™, Apache Flink™, Apache Kafka™, etc.) have their own complicated memory 
management modules for caching and checkpoint. Many approaches increase the 
complexity and are error-prone to maintain code.

We have observed heavy overheads during the operations of data parse, SerDes, 
pack/unpack, code/decode for data loading, storage, checkpoint, caching, 
marshal and transferring. Mnemonic provides a generic in-memory persistence 
object model to address those overheads for better performance. In addition, it 
manages its in-memory persistence objects and blocks in the way that GC does, 
which means their underlying memory resource is able to be reclaimed without 
explicitly releasing it.

Some existing Big Data applications suffer from poor Java GC behaviors when 
they process their massive unstructured datasets.  Those behaviors either cause 
very long stop-the-world GC pauses or take significant system resources during 
computing which impact throughput and incur significant perceivable pauses for 
interactive analytics.

There are more and more computing intensive Big Data applications moving down 
to rely on JNI to offload their computing tasks to native code which 
dramatically increases the cost of JNI invocation and IPC. Mnemonic provides a 
mechanism to communicate with native code directly through in-place object data 
update to avoid complex object data type conversion and stack marshaling. In 
addition, this project can be extended to support various lockers for threads 
between Java code and native code.

=== Initial Goals ===
Our initial goal is to bring Mnemonic into the ASF and transit the engineering 
and governance processes to the "Apache Way."  We would like to enrich a 
collaborative development model that closely aligns with current and future 
industry memory and storage technologies.

Another important goal is to encourage efforts to integrate non-volatile 
programming model into data centric processing/analytics 
frameworks/applications, (e.g., Apache Spark™, Apache HBase™, Apache Flink™, 
Apache™ Hadoop®, Apache Cassandra™,  etc.).

We expect Mnemonic project to be continuously developing new functionalities in 
an open, community-driven way. We envision accelerating innovation under ASF 
governance in order to meet the requirements of a wide variety of use cases for 
in-memory non-volatile and volatile data caching programming.

=== Current Status ===
Mnemonic project is available at Intel’s internal repository and managed by its 
designers and developers. It is also temporary hosted at Github for general 
view https://github.com/NonVolatileComputing/Mnemonic.git 

We have integrated this project for Apache Spark™ 1.5.0 and get 2X performance 
improvement ratio for Spark™ MLlib k-means workload and observed expected 
benefits of removing SerDes, reducing total GC pause time by 40% from our 
experiments.

==== Meritocracy ====
Mnemonic was originally created by Gang (Gary) Wang and Yanping Wang in early 
2015. The initial committers are the current Mnemonic R&D team members from US, 
China, and India Big Data Technologies Group at Intel. This group will form a 
base for much broader community to collaborate on this code base.

We intend to radically expand the initial developer and user community by 
running the project in accordance with the "Apache Way." Users and new 
contributors will be treated with respect and welcomed. By participating in the 
community and providing quality patches/support that move the project forward, 
they will earn merit. They also will be encouraged to provide non-code 
contributions (documentation, events, community management, etc.) and will gain 
merit for doing so. Those with a proven support and quality track record will 
be encouraged to become committers.

==== Community ====
If Mnemonic is accepted for incubation, the primary initial goal is to transit 
the core community towards embracing the Apache Way of project governance. We 
would solicit major existing contributors to become committers on the project 
from the start.

==== Core Developers ====
Mnemonic core developers are all skilled software developers and system 
performance engineers at Intel Corp with years of experiences in their fields. 
They have contributed many code to Apache projects. There are PMCs and 
experienced committers have been working with us from Apache Spark™, Apache 
HBase™, Apache Phoenix™, Apache™ Hadoop® for this project's open source efforts.

=== Alignment ===
The initial code base is targeted to data centric processing and analyzing in 
general. Mnemonic has been building the connection and integration for Apache 
projects and other projects.

We believe Mnemonic will be evolved to become a promising project for real-time 
processing, in-memory streaming analytics and more, along with current and 
future new server platforms with persistent memory as base storage devices.

=== Known Risks ===
==== Orphaned products ====
Intel’s Big Data Technologies Group is actively working with community on 
integrating this project to Big Data frameworks and applications. We are 
continuously adding new concepts and codes to this project and support new 
usage cases and features for Apache Big Data ecosystem.

The project contributors are leading contributors of Hadoop-based technologies 
and have a long standing in the Hadoop community. As we are addressing major 
Big Data processing performance issues, there is minimal risk of this work 
becoming non-strategic and unsupported. 

Our contributors are confident that a larger community will be formed within 
the project in a relatively short period of time.

==== Inexperience with Open Source ====
This project has long standing experienced mentors and interested contributors 
from Apache Spark™, Apache HBase™, Apache Phoenix™, Apache™ Hadoop® to help us 
moving through open source process. We are actively working with experienced 
Apache community PMCs and committers to improve our project and further testing.

==== Homogeneous Developers ====
All initial committers and interested contributors are employed at Intel. As an 
infrastructure memory project, there are wide range of Apache projects are 
interested in innovative memory project to fit large sized persistent memory 
and storage devices. Various Apache projects such as Apache Spark™, Apache 
HBase™, Apache Phoenix™, Apache Flink™, Apache Cassandra™ etc. can take good 
advantage of this project to overcome serialization/de-serialization, Java GC, 
and caching issues. We expect a wide range of interest will be generated after 
we open source this project to Apache.

==== Reliance on Salaried Developers ====
All developers are paid by their employers to contribute to this project. We 
welcome all others to contribute to this project after it is open sourced.

==== Relationships with Other Apache Product ====
Mnemonic can be integrated into various Big Data and Cloud frameworks and 
applications.
We are currently working on several Apache projects with Mnemonic:

For Apache Spark™ we integrated Mnemonic to improve: 
a) Local checkpoints
b) Memory management for caching
c) Persistent memory datasets input
d) Non-Volatile RDD operations
The best use case for Apache Spark™ computing is that the input data is stored 
in form of Mnemonic native storage to avoid caching its row data for iterative 
processing. Moreover, Spark applications can leverage Mnemonic to perform data 
transforming in persistent or non-persistent memory without SerDes.

For Apache™ Hadoop®, we are integrating HDFS Caching with Mnemonic instead of 
mmap. This will take advantage of persistent memory related features. We also 
plan to evaluate to integrate in Namenode Editlog, FSImage persistent data into 
Mnemonic persistent memory area.

For Apache HBase™, we are using Mnemonic for BucketCache and evaluating 
performance improvements.

We expect Mnemonic will be further developed and integrated into many Apache 
BigData projects and so on, to enhance memory management solutions for much 
improved performance and reliability.

==== An Excessive Fascination with the Apache Brand ====
While we expect Apache brand helps to attract more contributors, our interests 
in starting this project is based on the factors mentioned in the Rationale 
section.

We would like Mnemonic to become an Apache project to further foster a healthy 
community of contributors and consumers in BigData technology R&D areas. Since 
Mnemonic can directly benefit many Apache projects and solves major performance 
problems, we expect the Apache Software Foundation to increase interaction with 
the larger community as well.

=== Documentation ===
The documentation is currently available at Intel and will be posted under: 
https://mnemonic.incubator.apache.org/docs 

=== Initial Source ===
Initial source code is temporary hosted Github for general viewing:
https://github.com/NonVolatileComputing/Mnemonic.git 
It will be moved to Apache http://git.apache.org/ after podling.

The initial Source is written in Java code (88%) and mixed with JNI C code 
(11%) and shell script (1%) for underlying native allocation libraries.

=== Source and Intellectual Property Submission Plan ===
As soon as Mnemonic is approved to join the Incubator, the source code will be 
transitioned via the Software Grant Agreement onto ASF infrastructure and in 
turn made available under the Apache License, version 2.0.

=== External Dependencies ===
The required external dependencies are all Apache licenses or other compatible 
Licenses
Note: The runtime dependent licenses of Mnemonic are all declared as Apache 
2.0, the GNU licensed components are used for Mnemonic build and deployment. 
The Mnemonic JNI libraries are built using the GNU tools.

maven and its plugins (http://maven.apache.org/ ) [Apache 2.0]
JDK8 or OpenJDK 8 (http://java.com/) [Oracle or Openjdk JDK License]  
Nvml (http://pmem.io ) [optional] [Open Source]
PMalloc (https://github.com/bigdata-memory/pmalloc ) [optional] [Apache 2.0]

Build and test dependencies:
org.testng.testng v6.8.17  (http://testng.org) [Apache 2.0]
org.flowcomputing.commons.commons-resgc v0.8.7 [Apache 2.0]
org.flowcomputing.commons.commons-primitives v.0.6.0 [Apache 2.0]
com.squareup.javapoet v1.3.1-SNAPSHOT [Apache 2.0]
JDK8 or OpenJDK 8 (http://java.com/) [Oracle or Openjdk JDK License]

=== Cryptography ===
Project Mnemonic does not use cryptography itself, however, Hadoop projects use 
standard APIs and tools for SSH and SSL communication where necessary.

=== Required Resources ===
We request that following resources be created for the project to use

==== Mailing lists ====
priv...@mnemonic.incubator.apache.org (moderated subscriptions)
comm...@mnemonic.incubator.apache.org
d...@mnemonic.incubator.apache.org

==== Git repository ====
https://github.com/apache/incubator-mnemonic

==== Documentation ====
https://mnemonic.incubator.apache.org/docs/

==== JIRA instance ====
https://issues.apache.org/jira/browse/mnemonic

=== Initial Committers ===
* Gang (Gary) Wang (gang1 dot wang at intel dot com)
* Yanping Wang (yanping dot wang at intel dot com)
* Uma Maheswara Rao G (umamahesh at apache dot org)  
* Kai Zheng (drankye at apache dot org) 
* Rakesh Radhakrishnan Potty  (rakeshr at apache dot org) 
* Sean Zhong  (seanzhong at apache dot org) 
* Henry Saputra  (hsaputra at apache dot org) 
* Hao Cheng (hao dot cheng at intel dot com) 

=== Affiliations ===
* Gang (Gary) Wang, Intel 
* Yanping Wang, Intel 
* Uma Maheswara Rao G, Intel 
* Kai Zheng, Intel 
* Rakesh Radhakrishnan Potty, Intel 
* Sean Zhong, Intel 
* Henry Saputra, Independent 
* Hao Cheng, Intel 

=== Sponsors ===
==== Champion ====
Patrick Hunt

==== Nominated Mentors ====
* Patrick Hunt <phunt at apache dot org> - Apache IPMC member
* Andrew Purtell <apurtell at apache dot org > - Apache IPMC member 
* James Taylor <jamestaylor at apache dot org> - Apache IPMC member 
* Henry Saputra <hsaputra at apache dot org> - Apache IPMC member 

==== Sponsoring Entity ====
Apache Incubator PMC

B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB��[��X��ܚX�KK[XZ[
��[�\�[
][��X��ܚX�P[��X�]܋�\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[
��[�\�[
Z[[��X�]܋�\X�K�ܙ�B

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

RE: [DISCUSS] Mnemonic incubator proposal

Reply via email to