Mark - you don't necessarily need to construct a separate storage level. One simple way to accomplish this is for the user application to pass Spark a DirectByteBuffer.
On Sun, Aug 25, 2013 at 6:06 PM, Mark Hamstra <[email protected]>wrote: > I'd need to see a clear and significant advantage to using off-heap RDDs > directly within Spark vs. leveraging Tachyon. What worries me is the > combinatoric explosion of different caching and persistence mechanisms. > With too many of these, not only will users potentially be baffled > (@user-list: "What are the performance trade-offs in > using MEMORY_ONLY_SER_2 vs. MEMORY_ONLY vs. off-heap RDDs? Or should I > store some of my RDDs in Tachyon? Which ones?", etc. ad infinitum), but > we've got to make sure that all of the combinations work correctly. At > some point we end up needing to do some sort of caching/persistence manager > to automate some of the choices and wrangle the permutations. > > That's not to say that off-heap RDDs are a bad idea or are necessarily the > combinatoric last straw, but I'm concerned about adding significant > complexity for only marginal gains in limited cases over a more general > solution via Tachyon. I'm willing to be shown that those concerns are > misplaced. > > > > On Sun, Aug 25, 2013 at 5:06 PM, Haoyuan Li <[email protected]> wrote: > > > Hi Imran, > > > > One possible solution is that you can use > > Tachyon<https://github.com/amplab/tachyon>. > > When data is in Tachyon, Spark jobs will read it from off-heap memory. > > Internally, it uses direct byte buffers to store memory-serialized RDDs > as > > you mentioned. Also, different Spark jobs can share the same data in > > Tachyon's memory. Here is a presentation > > (slide< > > > https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf > > >) > > we did in May. > > > > Haoyuan > > > > > > On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <[email protected]> > > wrote: > > > > > Hi, > > > > > > I was wondering if anyone has thought about putting cached data in an > > > RDD into off-heap memory, eg. w/ direct byte buffers. For really > > > long-lived RDDs that use a lot of memory, this seems like a huge > > > improvement, since all the memory is now totally ignored during GC. > > > (and reading data from direct byte buffers is potentially faster as > > > well, buts thats just a nice bonus). > > > > > > The easiest thing to do is to store memory-serialized RDDs in direct > > > byte buffers, but I guess we could also store the serialized RDD on > > > disk and use a memory mapped file. Serializing into off-heap buffers > > > is a really simple patch, I just changed a few lines (I haven't done > > > any real tests w/ it yet, though). But I dont' really have a ton of > > > experience w/ off-heap memory, so I thought I would ask what others > > > think of the idea, if it makes sense or if there are any gotchas I > > > should be aware of, etc. > > > > > > thanks, > > > Imran > > > > > >
