Reynold, That is great to hear. Definitely interested in how 2. is being implemented and how it will be exposed in C++. One important aspect of leveraging the off heap memory is how the data is organized as well as being able to easily access it from the C++ side. For example how would you store a multi dimensional array of doubles and how would you specify that? Perhaps Avro or Protobuf could be used for storing complex nested structures although making that a zero copy could be a challenge. Regardless of how the internals lays the data out in memory the important requirements are:
a) ensuring zero copy b) providing a friendly api on the C++ side so folks don't have to deal with raw bytes, serialization, and JNI c) ability to specify a complex (multi type and nested) structure via a schema for memory storage (compile time generated would be sufficient but run time dynamically would be extremely flexible) Perhaps a simple way to accomplish would be to enhance dataframes to have a C++ api that can access the off-heap memory in a clean way from Spark (in process and w/ zero copy). Also, is this work being done on a branch I could look into further and try out? thanks, -paul On Sat, Aug 29, 2015 at 9:40 PM, Reynold Xin <r...@databricks.com> wrote: > Supporting non-JVM code without memory copying and serialization is > actually one of the motivations behind Tungsten. We didn't talk much about > it since it is not end-user-facing and it is still too early. There are a > few challenges still: > > 1. Spark cannot run entirely in off-heap mode (by entirely here I'm > referring to all the data-plane memory, not control-plane such as RPCs > since those don't matter much). There is nothing fundamental. It just takes > a while to make sure all code paths allocate/free memory using the proper > allocators. > > 2. The memory layout of data is still in flux, since we are only 4 months > into Tungsten. They will change pretty frequently for the foreseeable > future, and as a result, the C++ side of things will have change as well. > > > > On Sat, Aug 29, 2015 at 12:29 PM, Timothy Chen <tnac...@gmail.com> wrote: > >> I would also like to see data shared off-heap to a 3rd party C++ >> library with JNI, I think the complications would be how to memory >> manage this and make sure the 3rd party libraries also adhere to the >> access contracts as well. >> >> Tim >> >> On Sat, Aug 29, 2015 at 12:17 PM, Paul Weiss <paulweiss....@gmail.com> >> wrote: >> > Hi, >> > >> > Would the benefits of project tungsten be available for access by >> non-JVM >> > programs directly into the off-heap memory? Spark using dataframes w/ >> the >> > tungsten improvements will definitely help analytics within the JVM >> world >> > but accessing outside 3rd party c++ libraries is a challenge especially >> when >> > trying to do it with a zero copy. >> > >> > Ideally the off heap memory would be accessible to a non JVM program >> and be >> > invoked in process using JNI per each partition. The alternatives to >> this >> > involve additional costs of starting another process if using pipes as >> well >> > as the additional copy all the data. >> > >> > In addition to read only non-JVM access in process would there be a way >> to >> > share the dataframe that is in memory out of process and across spark >> > contexts. This way an expensive complicated initial build up of a >> dataframe >> > would not have to be replicated as well not having to pay the penalty >> of the >> > startup costs on failure. >> > >> > thanks, >> > >> > -paul >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >