I remember that I talked about this off-heap approach with Reynold in person several months ago. I think this approach is attractive to Spark/Shark, since there are many large objects in JVM. But the main problem in original Spark (without Tachyon support) is that it uses the same memory space both for storing critical data and processing temporary data. Separating storing and processing is more important than looking for memory-efficient storing technique. So I think this separation is the main contribution of Tachyon.
As for off-heap approach, we are not the first to realize this problem. Apache DirectMemory is promising, though not mature currently. However, I think there are some problems while using direct memory. 1) Unsafe. As same as C++, there may be memory leak. Users will also be confused to set right memory-related configurations such as –Xmx and –MaxDirectMemorySize. 2) Difficult. Designing an effective and efficient memory management system is not an easy job. How to allocate, replace, reclaim objects at right time and at right location is challenging. It’s a bit similar with GC algorithms. 3) Limited usage. It’s useful for write-once-read-many-times large objects but not for others. I also have two related questions: 1) Can JVM’s heap use virtual memory or just use physical memory? 2) Can direct memory use virtual memory or just use physical memory? On Mon, Aug 26, 2013 at 8:06 AM, Haoyuan Li <[email protected]> wrote: > Hi Imran, > > One possible solution is that you can use > Tachyon<https://github.com/amplab/tachyon>. > When data is in Tachyon, Spark jobs will read it from off-heap memory. > Internally, it uses direct byte buffers to store memory-serialized RDDs as > you mentioned. Also, different Spark jobs can share the same data in > Tachyon's memory. Here is a presentation > (slide< > https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf > >) > we did in May. > > Haoyuan > > > On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <[email protected]> > wrote: > > > Hi, > > > > I was wondering if anyone has thought about putting cached data in an > > RDD into off-heap memory, eg. w/ direct byte buffers. For really > > long-lived RDDs that use a lot of memory, this seems like a huge > > improvement, since all the memory is now totally ignored during GC. > > (and reading data from direct byte buffers is potentially faster as > > well, buts thats just a nice bonus). > > > > The easiest thing to do is to store memory-serialized RDDs in direct > > byte buffers, but I guess we could also store the serialized RDD on > > disk and use a memory mapped file. Serializing into off-heap buffers > > is a really simple patch, I just changed a few lines (I haven't done > > any real tests w/ it yet, though). But I dont' really have a ton of > > experience w/ off-heap memory, so I thought I would ask what others > > think of the idea, if it makes sense or if there are any gotchas I > > should be aware of, etc. > > > > thanks, > > Imran > > >
