[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

Paul Wais (JIRA) Sun, 06 Sep 2015 16:30:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14733169#comment-14733169
 ]


Paul Wais commented on SPARK-10399:
-----------------------------------

Image processing is a great use case.  I've deployed a JNA-based image 
processing Spark app on a cluster of ~200 cores and one of the pain points was 
memory management.  That solution copied images (via memcpy) since there was 
not time to implement a better solution.  Spark would have the JVM use 
essentially all available memory and would not account for native off-heap 
usage, so the native code would typically trigger an OOM after a while.  Tuning 
to curtail OOMs was hard.  Direct access to off-heap memory would have helped a 
ton here.

A similar use case is large-scale processing of text data (e.g. web pages, 
tweets, blog posts, etc).  java.lang.String is not very portable (noted below) 
and direct access to string buffers (especially if they're in proper UTF 
format) is very desirable.  Direct access to UTF-8 could also benefit Python 
support.

A major advantage of *in-process* native code (as opposed to, say, using 
`RDD.pipe()`) is that exceptions can get propagated, logged, and handled by 
Spark.  This feature alone IMO warrants the software cost of in-process native 
code.  Unfortunately, properly handling JNI-related exceptions and other 
nuances is tricky and a major pain.  I recommend Djinni, which helps a ton here 
(and is used in consumer mobile apps): https://bit.ly/djinnitalk   Furthermore, 
Djinni also recently added a type-marshaling feature that enables zero-copy 
type translation.  (The default type marshaling does deep copying). 

Some related issues:

 * Spark's BlockManager makes use of on-heap byte buffers for e.g. compression. 
 On-heap byte arrays are *not* necessarily zero-copy (the JVM is allowed to 
copy data in a JNI  `GetPrimitiveArrayCritical()` call; FMI see some discussion 
https://github.com/dropbox/djinni/issues/54 ).  A complete solution to this 
JIRA may necessitate some changes to Spark's core serializer API.  (In 
particular, it might be nice to have a code path that avoids any temporary 
on-heap buffers).

 * While Spark's Unsafe UTF-8 Strings are likely portable, java.lang.String is 
*not* particularly portable to C++: 
https://github.com/dropbox/djinni/blob/master/support-lib/jni/djinni_support.cpp#L431
  I've microbenchmarked that code and found it to be major overhead.  A 
solution to this JIRA might need some subtle API changes to encourage/help 
users avoid Java Strings.

 * Shipping and running a native library on a cluster is tricky.  Containers / 
virtualization (e.g. Docker) can help ensure the availability of dependencies, 
but sometimes those technologies aren't available.  One can compile all 
dependences (i.e. including libc++) into a single dynamic library, but that 
takes some special build set-up.  On-executor, dynamic code compilation (e.g. 
through Cling https://root.cern.ch/cling ) would be desirable but is probably 
beyond the scope of this JIRA.  I'm hoping to contribute a change to Djinni 
soon ( 
https://github.com/dropbox/djinni/compare/master...pwais:pwais_linux_build ) 
that will address the common use case where one simply wants to ship and run 
(on Spark) an app jar that contains a native library (and use system 
dependencies).



Are there any followers of this JIRA who have specific API requests?  My take 
on this issue is that there are a few main components:
  * Ensuring the accessibility of UnsafeRow to user code (which would then 
invoke native code).  (It's not clear to me that this is already part of Spark 
1.5; DataFrames simply interop with Row).  
  * Creating a byte buffer 'view' that's similar to UTF8String for buffer row 
attributes.  `UnSafeRow.getBytes()` currently deep-copies (into an on-heap 
array) and we'd want a 'view' of the bytes instead.
  * Define and implement core type mappers.  E.g. Spark UTF8String <-> 
std::string.  It might be nice for "Spark C++" types to be simple arrays (e.g. 
(pointer, length, nullable deleter)) with adapters to standard types (e.g. 
std::string and std::vector).  The deleter part is important if native code can 
be allowed to consume (and gain ownership) of data; a full solution needs a 
'move' API component.

With those pieces in place (and especially if any "Spark C++ support code" is 
header-only), it wouldn't be too hard for users to build & package Spark jars 
w/ native libs as they please. As mentioned above, I'd recommend Djinni as a 
facilitator to this project (and as a facilitator to users who want to write & 
deploy native libs).

There are some other misc issues:
 * Is Unsafe memory always aligned? If not, how can we flag this to native code?
 * As mentioned above, can we modify BlockManager to have a path that skips any 
on-heap buffers?
 * If native code *does* need to use substantial memory, how can it communicate 
that need to Spark?  Can native code interop with how Spark tracks 
Unsafe-allocated memory?

> Off Heap Memory Access for non-JVM libraries (C++)
> --------------------------------------------------
>
>                 Key: SPARK-10399
>                 URL: https://issues.apache.org/jira/browse/SPARK-10399
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Paul Weiss
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> ----
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> ----
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> ----
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

Reply via email to