No, you don't necessarily need a separate storage level, but I don't think you can avoid the "when do I use on-heap RDDs vs. off-heap RDDs vs. RDDs in Tachyon vs. ...?" questions. If off-heap RDDs don't gain us a lot over Tachyon in a lot of use cases, then I'm not sure that they are worth the extra complexity. If you can show me how to do them really simply and so that their appropriate use cases are obvious, then that changes the calculus.
On Sun, Aug 25, 2013 at 6:15 PM, Reynold Xin <[email protected]> wrote: > Mark - you don't necessarily need to construct a separate storage level. > One simple way to accomplish this is for the user application to pass Spark > a DirectByteBuffer. > > > > > On Sun, Aug 25, 2013 at 6:06 PM, Mark Hamstra <[email protected] > >wrote: > > > I'd need to see a clear and significant advantage to using off-heap RDDs > > directly within Spark vs. leveraging Tachyon. What worries me is the > > combinatoric explosion of different caching and persistence mechanisms. > > With too many of these, not only will users potentially be baffled > > (@user-list: "What are the performance trade-offs in > > using MEMORY_ONLY_SER_2 vs. MEMORY_ONLY vs. off-heap RDDs? Or should I > > store some of my RDDs in Tachyon? Which ones?", etc. ad infinitum), but > > we've got to make sure that all of the combinations work correctly. At > > some point we end up needing to do some sort of caching/persistence > manager > > to automate some of the choices and wrangle the permutations. > > > > That's not to say that off-heap RDDs are a bad idea or are necessarily > the > > combinatoric last straw, but I'm concerned about adding significant > > complexity for only marginal gains in limited cases over a more general > > solution via Tachyon. I'm willing to be shown that those concerns are > > misplaced. > > > > > > > > On Sun, Aug 25, 2013 at 5:06 PM, Haoyuan Li <[email protected]> > wrote: > > > > > Hi Imran, > > > > > > One possible solution is that you can use > > > Tachyon<https://github.com/amplab/tachyon>. > > > When data is in Tachyon, Spark jobs will read it from off-heap memory. > > > Internally, it uses direct byte buffers to store memory-serialized RDDs > > as > > > you mentioned. Also, different Spark jobs can share the same data in > > > Tachyon's memory. Here is a presentation > > > (slide< > > > > > > https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf > > > >) > > > we did in May. > > > > > > Haoyuan > > > > > > > > > On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <[email protected]> > > > wrote: > > > > > > > Hi, > > > > > > > > I was wondering if anyone has thought about putting cached data in an > > > > RDD into off-heap memory, eg. w/ direct byte buffers. For really > > > > long-lived RDDs that use a lot of memory, this seems like a huge > > > > improvement, since all the memory is now totally ignored during GC. > > > > (and reading data from direct byte buffers is potentially faster as > > > > well, buts thats just a nice bonus). > > > > > > > > The easiest thing to do is to store memory-serialized RDDs in direct > > > > byte buffers, but I guess we could also store the serialized RDD on > > > > disk and use a memory mapped file. Serializing into off-heap buffers > > > > is a really simple patch, I just changed a few lines (I haven't done > > > > any real tests w/ it yet, though). But I dont' really have a ton of > > > > experience w/ off-heap memory, so I thought I would ask what others > > > > think of the idea, if it makes sense or if there are any gotchas I > > > > should be aware of, etc. > > > > > > > > thanks, > > > > Imran > > > > > > > > > >
