Thanks for all the great comments & discussion.  Let me expand a bit
on our use case, and then I'm gonna combine responses to various
questions.

In general, when we use spark, we have some really big RDDs that use
up a lot of memory (10s of GB per node) that are really our "core"
data sets.  We tend to start up a spark application, immediately load
all those data sets, and just leave them loaded for the lifetime of
that process.  We definitely create a lot of other RDDs along the way,
and lots of intermediate objects that we'd like to go through normal
garbage collection.  But those all require much less memory, maybe
1/10th of the big RDDs that we just keep around.  I know this is a bit
of a special case, but it seems like it probably isn't that different
from a lot of use cases.

Reynold Xin wrote:
> This is especially attractive if the application can read directly from a byte
> buffer without generic serialization (like Shark).

interesting -- can you explain how this works in Shark?  do you have
some general way of storing data in byte buffers that avoids
serialization?  Or do you mean that if the user is effectively
creating an RDD of ints, that you create a an RDD[ByteBuffer], and
then you read / write ints into the byte buffer yourself?
Sorry, I'm familiar with the basic idea of shark but not the code at
all -- even a pointer to the code would be helpful.

Haoyun Li wrote:
> One possible solution is that you can use
> Tachyon<https://github.com/amplab/tachyon>.

This is a good idea, that I had probably overlooked.  There are two
potential issues that I can think of with this approach, though:
1) I was under the impression that Tachyon is still not really tested
in production systems, and I need something a bit more mature.  Of
course, my changes wouldn't be thoroughly tested either, but somehow I
feel better about deploying my 5-line patch to a codebase I understand
than adding another entire system.  (This isn't a good reason to add
this to spark in general, though, just might be a temporary patch we
locally deploy)
2) I may have misunderstood Tachyon, but it seems there is a big
difference in the data locality in these two approaches.  On a large
cluster, HDFS will spread the data all over the cluster, and so any
particular piece of on-disk data will only live on a few machines.
When you start a spark application, which only uses a small subset of
the nodes, odds are the data you want is *not* on those nodes.  So
even if tachyon caches data from HDFS into memory, it won't be on the
same nodes as the spark application.  Which means that when the spark
application reads data from the RDD, even though the data is in memory
on some node in the cluster, it will need to be read over the network
by the actual spark worker assigned to the application.

Is my understanding correct?  I haven't done any measurements at all
of a difference in performance, but it seems this would be much
slower.


Mark Hamstra wrote:
> What worries me is the
> combinatoric explosion of different caching and persistence mechanisms.

great points, and I have no ideas of the real advantages yet.  I agree
we'd need to actual observe an improvement to add yet another option.
(I would really like some alternative to what I'm doing now, but maybe
tachyon is all I need ...)

Reynold Xin wrote:
> Mark - you don't necessarily need to construct a separate storage level.
> One simple way to accomplish this is for the user application to pass Spark
> a DirectByteBuffer.

hmm, that's true I suppose, but I had originally thought of making it
another storage level, just for convenience & consistency.  Couldn't
you get rid of all the storage levels and just have the user apply
various transformations to an RDD?  eg.

rdd.cache(MEMORY_ONLY_SER)

could be

rdd.map{x => serializer.serialize(x)}.cache(MEMORY_ONLY)


And I agree with all of Lijie's comments that using off-heap memory is
unsafe & difficult.  But I feel that isn't a reason to completely
disallow it, if there is a significant performance improvement.  It
would need to be clearly documented as an advanced feature with some
risks involved.

thanks,
imran

On Mon, Aug 26, 2013 at 4:38 AM, Lijie Xu <[email protected]> wrote:
> I remember that I talked about this off-heap approach with Reynold in
> person several months ago. I think this approach is attractive to
> Spark/Shark, since there are many large objects in JVM. But the main
> problem in original Spark (without Tachyon support) is that it uses the
> same memory space both for storing critical data and processing temporary
> data. Separating storing and processing is more important than looking for
> memory-efficient storing technique. So I think this separation is the main
> contribution of Tachyon.
>
>
> As for off-heap approach, we are not the first to realize this problem.
> Apache DirectMemory is promising, though not mature currently. However, I
> think there are some problems while using direct memory.
>
> 1)       Unsafe. As same as C++, there may be memory leak. Users will also
> be confused to set right memory-related configurations such as –Xmx and
> –MaxDirectMemorySize.
>
> 2)       Difficult. Designing an effective and efficient memory management
> system is not an easy job. How to allocate, replace, reclaim objects at
> right time and at right location is challenging. It’s a bit similar with GC
> algorithms.
>
> 3)       Limited usage. It’s useful for write-once-read-many-times large
> objects but not for others.
>
>
>
> I also have two related questions:
>
> 1)       Can JVM’s heap use virtual memory or just use physical memory?
>
> 2)       Can direct memory use virtual memory or just use physical memory?
>
>
>
>
> On Mon, Aug 26, 2013 at 8:06 AM, Haoyuan Li <[email protected]> wrote:
>
>> Hi Imran,
>>
>> One possible solution is that you can use
>> Tachyon<https://github.com/amplab/tachyon>.
>> When data is in Tachyon, Spark jobs will read it from off-heap memory.
>> Internally, it uses direct byte buffers to store memory-serialized RDDs as
>> you mentioned. Also, different Spark jobs can share the same data in
>> Tachyon's memory. Here is a presentation
>> (slide<
>> https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf
>> >)
>> we did in May.
>>
>> Haoyuan
>>
>>
>> On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <[email protected]>
>> wrote:
>>
>> > Hi,
>> >
>> > I was wondering if anyone has thought about putting cached data in an
>> > RDD into off-heap memory, eg. w/ direct byte buffers.  For really
>> > long-lived RDDs that use a lot of memory, this seems like a huge
>> > improvement, since all the memory is now totally ignored during GC.
>> > (and reading data from direct byte buffers is potentially faster as
>> > well, buts thats just a nice bonus).
>> >
>> > The easiest thing to do is to store memory-serialized RDDs in direct
>> > byte buffers, but I guess we could also store the serialized RDD on
>> > disk and use a memory mapped file.  Serializing into off-heap buffers
>> > is a really simple patch, I just changed a few lines (I haven't done
>> > any real tests w/ it yet, though).  But I dont' really have a ton of
>> > experience w/ off-heap memory, so I thought I would ask what others
>> > think of the idea, if it makes sense or if there are any gotchas I
>> > should be aware of, etc.
>> >
>> > thanks,
>> > Imran
>> >
>>

Reply via email to