[ https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14733169#comment-14733169 ]
Paul Wais commented on SPARK-10399: ----------------------------------- Image processing is a great use case. I've deployed a JNA-based image processing Spark app on a cluster of ~200 cores and one of the pain points was memory management. That solution copied images (via memcpy) since there was not time to implement a better solution. Spark would have the JVM use essentially all available memory and would not account for native off-heap usage, so the native code would typically trigger an OOM after a while. Tuning to curtail OOMs was hard. Direct access to off-heap memory would have helped a ton here. A similar use case is large-scale processing of text data (e.g. web pages, tweets, blog posts, etc). java.lang.String is not very portable (noted below) and direct access to string buffers (especially if they're in proper UTF format) is very desirable. Direct access to UTF-8 could also benefit Python support. A major advantage of *in-process* native code (as opposed to, say, using `RDD.pipe()`) is that exceptions can get propagated, logged, and handled by Spark. This feature alone IMO warrants the software cost of in-process native code. Unfortunately, properly handling JNI-related exceptions and other nuances is tricky and a major pain. I recommend Djinni, which helps a ton here (and is used in consumer mobile apps): https://bit.ly/djinnitalk Furthermore, Djinni also recently added a type-marshaling feature that enables zero-copy type translation. (The default type marshaling does deep copying). Some related issues: * Spark's BlockManager makes use of on-heap byte buffers for e.g. compression. On-heap byte arrays are *not* necessarily zero-copy (the JVM is allowed to copy data in a JNI `GetPrimitiveArrayCritical()` call; FMI see some discussion https://github.com/dropbox/djinni/issues/54 ). A complete solution to this JIRA may necessitate some changes to Spark's core serializer API. (In particular, it might be nice to have a code path that avoids any temporary on-heap buffers). * While Spark's Unsafe UTF-8 Strings are likely portable, java.lang.String is *not* particularly portable to C++: https://github.com/dropbox/djinni/blob/master/support-lib/jni/djinni_support.cpp#L431 I've microbenchmarked that code and found it to be major overhead. A solution to this JIRA might need some subtle API changes to encourage/help users avoid Java Strings. * Shipping and running a native library on a cluster is tricky. Containers / virtualization (e.g. Docker) can help ensure the availability of dependencies, but sometimes those technologies aren't available. One can compile all dependences (i.e. including libc++) into a single dynamic library, but that takes some special build set-up. On-executor, dynamic code compilation (e.g. through Cling https://root.cern.ch/cling ) would be desirable but is probably beyond the scope of this JIRA. I'm hoping to contribute a change to Djinni soon ( https://github.com/dropbox/djinni/compare/master...pwais:pwais_linux_build ) that will address the common use case where one simply wants to ship and run (on Spark) an app jar that contains a native library (and use system dependencies). Are there any followers of this JIRA who have specific API requests? My take on this issue is that there are a few main components: * Ensuring the accessibility of UnsafeRow to user code (which would then invoke native code). (It's not clear to me that this is already part of Spark 1.5; DataFrames simply interop with Row). * Creating a byte buffer 'view' that's similar to UTF8String for buffer row attributes. `UnSafeRow.getBytes()` currently deep-copies (into an on-heap array) and we'd want a 'view' of the bytes instead. * Define and implement core type mappers. E.g. Spark UTF8String <-> std::string. It might be nice for "Spark C++" types to be simple arrays (e.g. (pointer, length, nullable deleter)) with adapters to standard types (e.g. std::string and std::vector). The deleter part is important if native code can be allowed to consume (and gain ownership) of data; a full solution needs a 'move' API component. With those pieces in place (and especially if any "Spark C++ support code" is header-only), it wouldn't be too hard for users to build & package Spark jars w/ native libs as they please. As mentioned above, I'd recommend Djinni as a facilitator to this project (and as a facilitator to users who want to write & deploy native libs). There are some other misc issues: * Is Unsafe memory always aligned? If not, how can we flag this to native code? * As mentioned above, can we modify BlockManager to have a path that skips any on-heap buffers? * If native code *does* need to use substantial memory, how can it communicate that need to Spark? Can native code interop with how Spark tracks Unsafe-allocated memory? > Off Heap Memory Access for non-JVM libraries (C++) > -------------------------------------------------- > > Key: SPARK-10399 > URL: https://issues.apache.org/jira/browse/SPARK-10399 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Reporter: Paul Weiss > > *Summary* > Provide direct off-heap memory access to an external non-JVM program such as > a c++ library within the Spark running JVM/executor. As Spark moves to > storing all data into off heap memory it makes sense to provide access points > to the memory for non-JVM programs. > ---- > *Assumptions* > * Zero copies will be made during the call into non-JVM library > * Access into non-JVM libraries will be accomplished via JNI > * A generic JNI interface will be created so that developers will not need to > deal with the raw JNI call > * C++ will be the initial target non-JVM use case > * memory management will remain on the JVM/Spark side > * the API from C++ will be similar to dataframes as much as feasible and NOT > require expert knowledge of JNI > * Data organization and layout will support complex (multi-type, nested, > etc.) types > ---- > *Design* > * Initially Spark JVM -> non-JVM will be supported > * Creating an embedded JVM with Spark running from a non-JVM program is > initially out of scope > ---- > *Technical* > * GetDirectBufferAddress is the JNI call used to access byte buffer without > copy -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org