Reynold,

That is great to hear.  Definitely interested in how 2. is being
implemented and how it will be exposed in C++.  One important aspect of
leveraging the off heap memory is how the data is organized as well as
being able to easily access it from the C++ side.  For example how would
you store a multi dimensional array of doubles and how would you specify
that?  Perhaps Avro or Protobuf could be used for storing complex nested
structures although making that a zero copy could be a challenge.
Regardless of how the internals lays the data out in memory the important
requirements are:

a) ensuring zero copy
b) providing a friendly api on the C++ side so folks don't have to deal
with raw bytes, serialization, and JNI
c) ability to specify a complex (multi type and nested) structure via a
schema for memory storage (compile time generated would be sufficient but
run time dynamically would be extremely flexible)

Perhaps a simple way to accomplish would be to enhance dataframes to have a
C++ api that can access the off-heap memory in a clean way from Spark (in
process and w/ zero copy).

Also, is this work being done on a branch I could look into further and try
out?

thanks,
-paul



On Sat, Aug 29, 2015 at 9:40 PM, Reynold Xin <r...@databricks.com> wrote:

> Supporting non-JVM code without memory copying and serialization is
> actually one of the motivations behind Tungsten. We didn't talk much about
> it since it is not end-user-facing and it is still too early. There are a
> few challenges still:
>
> 1. Spark cannot run entirely in off-heap mode (by entirely here I'm
> referring to all the data-plane memory, not control-plane such as RPCs
> since those don't matter much). There is nothing fundamental. It just takes
> a while to make sure all code paths allocate/free memory using the proper
> allocators.
>
> 2. The memory layout of data is still in flux, since we are only 4 months
> into Tungsten. They will change pretty frequently for the foreseeable
> future, and as a result, the C++ side of things will have change as well.
>
>
>
> On Sat, Aug 29, 2015 at 12:29 PM, Timothy Chen <tnac...@gmail.com> wrote:
>
>> I would also like to see data shared off-heap to a 3rd party C++
>> library with JNI, I think the complications would be how to memory
>> manage this and make sure the 3rd party libraries also adhere to the
>> access contracts as well.
>>
>> Tim
>>
>> On Sat, Aug 29, 2015 at 12:17 PM, Paul Weiss <paulweiss....@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > Would the benefits of project tungsten be available for access by
>> non-JVM
>> > programs directly into the off-heap memory?  Spark using dataframes w/
>> the
>> > tungsten improvements will definitely help analytics within the JVM
>> world
>> > but accessing outside 3rd party c++ libraries is a challenge especially
>> when
>> > trying to do it with a zero copy.
>> >
>> > Ideally the off heap memory would be accessible to a non JVM program
>> and be
>> > invoked in process using JNI per each partition.  The alternatives to
>> this
>> > involve additional costs of starting another process if using pipes as
>> well
>> > as the additional copy all the data.
>> >
>> > In addition to read only non-JVM access in process would there be a way
>> to
>> > share the dataframe that is in memory out of process and across spark
>> > contexts.  This way an expensive complicated initial build up of a
>> dataframe
>> > would not have to be replicated as well not having to pay the penalty
>> of the
>> > startup costs on failure.
>> >
>> > thanks,
>> >
>> > -paul
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

Reply via email to