Re: off-heap RDDs

Haoyuan Li Sun, 25 Aug 2013 17:07:18 -0700

Hi Imran,

One possible solution is that you can use
Tachyon<https://github.com/amplab/tachyon>.
When data is in Tachyon, Spark jobs will read it from off-heap memory.
Internally, it uses direct byte buffers to store memory-serialized RDDs as
you mentioned. Also, different Spark jobs can share the same data in
Tachyon's memory. Here is a presentation
(slide<https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf>)
we did in May.


Haoyuan


On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <[email protected]> wrote:

> Hi,
>
> I was wondering if anyone has thought about putting cached data in an
> RDD into off-heap memory, eg. w/ direct byte buffers.  For really
> long-lived RDDs that use a lot of memory, this seems like a huge
> improvement, since all the memory is now totally ignored during GC.
> (and reading data from direct byte buffers is potentially faster as
> well, buts thats just a nice bonus).
>
> The easiest thing to do is to store memory-serialized RDDs in direct
> byte buffers, but I guess we could also store the serialized RDD on
> disk and use a memory mapped file.  Serializing into off-heap buffers
> is a really simple patch, I just changed a few lines (I haven't done
> any real tests w/ it yet, though).  But I dont' really have a ton of
> experience w/ off-heap memory, so I thought I would ask what others
> think of the idea, if it makes sense or if there are any gotchas I
> should be aware of, etc.
>
> thanks,
> Imran
>

Re: off-heap RDDs

Reply via email to