Re: off-heap RDDs

Reynold Xin Sun, 25 Aug 2013 18:17:18 -0700

Mark - you don't necessarily need to construct a separate storage level.
One simple way to accomplish this is for the user application to pass Spark
a DirectByteBuffer.





On Sun, Aug 25, 2013 at 6:06 PM, Mark Hamstra <[email protected]>wrote:

> I'd need to see a clear and significant advantage to using off-heap RDDs
> directly within Spark vs. leveraging Tachyon.  What worries me is the
> combinatoric explosion of different caching and persistence mechanisms.
>  With too many of these, not only will users potentially be baffled
> (@user-list: "What are the performance trade-offs in
> using MEMORY_ONLY_SER_2 vs. MEMORY_ONLY vs. off-heap RDDs?  Or should I
> store some of my RDDs in Tachyon?  Which ones?", etc. ad infinitum), but
> we've got to make sure that all of the combinations work correctly.  At
> some point we end up needing to do some sort of caching/persistence manager
> to automate some of the choices and wrangle the permutations.
>
> That's not to say that off-heap RDDs are a bad idea or are necessarily the
> combinatoric last straw, but I'm concerned about adding significant
> complexity for only marginal gains in limited cases over a more general
> solution via Tachyon.  I'm willing to be shown that those concerns are
> misplaced.
>
>
>
> On Sun, Aug 25, 2013 at 5:06 PM, Haoyuan Li <[email protected]> wrote:
>
> > Hi Imran,
> >
> > One possible solution is that you can use
> > Tachyon<https://github.com/amplab/tachyon>.
> > When data is in Tachyon, Spark jobs will read it from off-heap memory.
> > Internally, it uses direct byte buffers to store memory-serialized RDDs
> as
> > you mentioned. Also, different Spark jobs can share the same data in
> > Tachyon's memory. Here is a presentation
> > (slide<
> >
> https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf
> > >)
> > we did in May.
> >
> > Haoyuan
> >
> >
> > On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > I was wondering if anyone has thought about putting cached data in an
> > > RDD into off-heap memory, eg. w/ direct byte buffers.  For really
> > > long-lived RDDs that use a lot of memory, this seems like a huge
> > > improvement, since all the memory is now totally ignored during GC.
> > > (and reading data from direct byte buffers is potentially faster as
> > > well, buts thats just a nice bonus).
> > >
> > > The easiest thing to do is to store memory-serialized RDDs in direct
> > > byte buffers, but I guess we could also store the serialized RDD on
> > > disk and use a memory mapped file.  Serializing into off-heap buffers
> > > is a really simple patch, I just changed a few lines (I haven't done
> > > any real tests w/ it yet, though).  But I dont' really have a ton of
> > > experience w/ off-heap memory, so I thought I would ask what others
> > > think of the idea, if it makes sense or if there are any gotchas I
> > > should be aware of, etc.
> > >
> > > thanks,
> > > Imran
> > >
> >
>

Re: off-heap RDDs

Reply via email to