Hi Imran, One possible solution is that you can use Tachyon<https://github.com/amplab/tachyon>. When data is in Tachyon, Spark jobs will read it from off-heap memory. Internally, it uses direct byte buffers to store memory-serialized RDDs as you mentioned. Also, different Spark jobs can share the same data in Tachyon's memory. Here is a presentation (slide<https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf>) we did in May.
Haoyuan On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <[email protected]> wrote: > Hi, > > I was wondering if anyone has thought about putting cached data in an > RDD into off-heap memory, eg. w/ direct byte buffers. For really > long-lived RDDs that use a lot of memory, this seems like a huge > improvement, since all the memory is now totally ignored during GC. > (and reading data from direct byte buffers is potentially faster as > well, buts thats just a nice bonus). > > The easiest thing to do is to store memory-serialized RDDs in direct > byte buffers, but I guess we could also store the serialized RDD on > disk and use a memory mapped file. Serializing into off-heap buffers > is a really simple patch, I just changed a few lines (I haven't done > any real tests w/ it yet, though). But I dont' really have a ton of > experience w/ off-heap memory, so I thought I would ask what others > think of the idea, if it makes sense or if there are any gotchas I > should be aware of, etc. > > thanks, > Imran >
