Re: off-heap RDDs

Evan Chan Fri, 30 Aug 2013 10:12:19 -0700

Hey guys,

I would also prefer to strengthen and get behind Tachyon, rather than
implement a separate solution (though I guess if it's not offiically
supported, then nobody will ask questions).  But it's more that off-heap
memory is difficult, so it's better to focus efforts on one project, is my
feeling.


Haoyuan,

Tachyon brings cached HDFS data to the local client.  Have we thought about
the opposite approach, which might be more efficient?
 - Load the data in HDFS node N into the off-heap memory of node N
 - in Spark, inform the framework (maybe via RDD partition/location info)
of where the data is, that it is located in node N
 - bring the computation to node N

This avoids network IO and may be much more efficient for many types of
applications.   I know this would be a big win for us.

-Evan


On Wed, Aug 28, 2013 at 1:37 AM, Haoyuan Li <[email protected]> wrote:

> No problem. Like reading/writing data from/to off-heap bytebuffer, when a
> program reads/writes data from/to Tachyon, Spark/Shark needs to do ser/de.
> Efficient ser/de will help on performance a lot as people pointed out. One
> solution is that the application can do primitive operations directly on
> ByteBuffer, like how Shark is handling it now. Most related code is located
> at "
> https://github.com/amplab/shark/tree/master/src/main/scala/shark/memstore2
> "
> and "
>
> https://github.com/amplab/shark/tree/master/src/tachyon_enabled/scala/shark/tachyon
> ".
>
> Haoyuan
>
>
> On Wed, Aug 28, 2013 at 1:21 AM, Imran Rashid <[email protected]>
> wrote:
>
> > Thanks Haoyuan.  It seems like we should try out Tachyon, sounds like
> > it is what we are looking for.
> >
> > On Wed, Aug 28, 2013 at 8:18 AM, Haoyuan Li <[email protected]>
> wrote:
> > > Response inline.
> > >
> > >
> > > On Tue, Aug 27, 2013 at 1:37 AM, Imran Rashid <[email protected]>
> > wrote:
> > >
> > >> Thanks for all the great comments & discussion.  Let me expand a bit
> > >> on our use case, and then I'm gonna combine responses to various
> > >> questions.
> > >>
> > >> In general, when we use spark, we have some really big RDDs that use
> > >> up a lot of memory (10s of GB per node) that are really our "core"
> > >> data sets.  We tend to start up a spark application, immediately load
> > >> all those data sets, and just leave them loaded for the lifetime of
> > >> that process.  We definitely create a lot of other RDDs along the way,
> > >> and lots of intermediate objects that we'd like to go through normal
> > >> garbage collection.  But those all require much less memory, maybe
> > >> 1/10th of the big RDDs that we just keep around.  I know this is a bit
> > >> of a special case, but it seems like it probably isn't that different
> > >> from a lot of use cases.
> > >>
> > >> Reynold Xin wrote:
> > >> > This is especially attractive if the application can read directly
> > from
> > >> a byte
> > >> > buffer without generic serialization (like Shark).
> > >>
> > >> interesting -- can you explain how this works in Shark?  do you have
> > >> some general way of storing data in byte buffers that avoids
> > >> serialization?  Or do you mean that if the user is effectively
> > >> creating an RDD of ints, that you create a an RDD[ByteBuffer], and
> > >> then you read / write ints into the byte buffer yourself?
> > >> Sorry, I'm familiar with the basic idea of shark but not the code at
> > >> all -- even a pointer to the code would be helpful.
> > >>
> > >> Haoyun Li wrote:
> > >> > One possible solution is that you can use
> > >> > Tachyon<https://github.com/amplab/tachyon>.
> > >>
> > >> This is a good idea, that I had probably overlooked.  There are two
> > >> potential issues that I can think of with this approach, though:
> > >> 1) I was under the impression that Tachyon is still not really tested
> > >> in production systems, and I need something a bit more mature.  Of
> > >> course, my changes wouldn't be thoroughly tested either, but somehow I
> > >> feel better about deploying my 5-line patch to a codebase I understand
> > >> than adding another entire system.  (This isn't a good reason to add
> > >> this to spark in general, though, just might be a temporary patch we
> > >> locally deploy)
> > >>
> > >
> > > This is a legitimate concern. The good news is that, several companies
> > have
> > > been testing it for a while, and some are close to make it to
> production.
> > > For example, as Yahoo mentioned in today's meetup, we are working to
> > > integrate Shark and Tachyon closely, and results are very promising. It
> > > will be in production soon.
> > >
> > >
> > >> 2) I may have misunderstood Tachyon, but it seems there is a big
> > >> difference in the data locality in these two approaches.  On a large
> > >> cluster, HDFS will spread the data all over the cluster, and so any
> > >> particular piece of on-disk data will only live on a few machines.
> > >> When you start a spark application, which only uses a small subset of
> > >> the nodes, odds are the data you want is *not* on those nodes.  So
> > >> even if tachyon caches data from HDFS into memory, it won't be on the
> > >> same nodes as the spark application.  Which means that when the spark
> > >> application reads data from the RDD, even though the data is in memory
> > >> on some node in the cluster, it will need to be read over the network
> > >> by the actual spark worker assigned to the application.
> > >
> > > Is my understanding correct?  I haven't done any measurements at all
> > >> of a difference in performance, but it seems this would be much
> > >> slower.
> > >>
> > >
> > > This is a great question. Actually, from data locality perspective, two
> > > approaches have no difference. Tachyon does client side caching, which
> > > means, if a client on a node reads data not on its local machine, the
> > first
> > > read will cache the data on that node. Therefore, all future access on
> > that
> > > node will read the data from its local memory. For example, suppose you
> > > have a cluster with 100 nodes all running HDFS and Tachyon. Then you
> > launch
> > > a Spark jobs running on 20 nodes only. When it reads or caches the data
> > > first time, all data will be cached on those 20 nodes. In the future,
> > when
> > > Spark master tries to schedule tasks, it will query Tachyon about data
> > > locations, and take advantage of data localities automatically.
> > >
> > > Best,
> > >
> > > Haoyuan
> > >
> > >
> > >>
> > >>
> > >> Mark Hamstra wrote:
> > >> > What worries me is the
> > >> > combinatoric explosion of different caching and persistence
> > mechanisms.
> > >>
> > >> great points, and I have no ideas of the real advantages yet.  I agree
> > >> we'd need to actual observe an improvement to add yet another option.
> > >> (I would really like some alternative to what I'm doing now, but maybe
> > >> tachyon is all I need ...)
> > >>
> > >> Reynold Xin wrote:
> > >> > Mark - you don't necessarily need to construct a separate storage
> > level.
> > >> > One simple way to accomplish this is for the user application to
> pass
> > >> Spark
> > >> > a DirectByteBuffer.
> > >>
> > >> hmm, that's true I suppose, but I had originally thought of making it
> > >> another storage level, just for convenience & consistency.  Couldn't
> > >> you get rid of all the storage levels and just have the user apply
> > >> various transformations to an RDD?  eg.
> > >>
> > >> rdd.cache(MEMORY_ONLY_SER)
> > >>
> > >> could be
> > >>
> > >> rdd.map{x => serializer.serialize(x)}.cache(MEMORY_ONLY)
> > >>
> > >>
> > >> And I agree with all of Lijie's comments that using off-heap memory is
> > >> unsafe & difficult.  But I feel that isn't a reason to completely
> > >> disallow it, if there is a significant performance improvement.  It
> > >> would need to be clearly documented as an advanced feature with some
> > >> risks involved.
> > >>
> > >> thanks,
> > >> imran
> > >>
> > >> On Mon, Aug 26, 2013 at 4:38 AM, Lijie Xu <[email protected]>
> wrote:
> > >> > I remember that I talked about this off-heap approach with Reynold
> in
> > >> > person several months ago. I think this approach is attractive to
> > >> > Spark/Shark, since there are many large objects in JVM. But the main
> > >> > problem in original Spark (without Tachyon support) is that it uses
> > the
> > >> > same memory space both for storing critical data and processing
> > temporary
> > >> > data. Separating storing and processing is more important than
> looking
> > >> for
> > >> > memory-efficient storing technique. So I think this separation is
> the
> > >> main
> > >> > contribution of Tachyon.
> > >> >
> > >> >
> > >> > As for off-heap approach, we are not the first to realize this
> > problem.
> > >> > Apache DirectMemory is promising, though not mature currently.
> > However, I
> > >> > think there are some problems while using direct memory.
> > >> >
> > >> > 1)       Unsafe. As same as C++, there may be memory leak. Users
> will
> > >> also
> > >> > be confused to set right memory-related configurations such as –Xmx
> > and
> > >> > –MaxDirectMemorySize.
> > >> >
> > >> > 2)       Difficult. Designing an effective and efficient memory
> > >> management
> > >> > system is not an easy job. How to allocate, replace, reclaim objects
> > at
> > >> > right time and at right location is challenging. It’s a bit similar
> > with
> > >> GC
> > >> > algorithms.
> > >> >
> > >> > 3)       Limited usage. It’s useful for write-once-read-many-times
> > large
> > >> > objects but not for others.
> > >> >
> > >> >
> > >> >
> > >> > I also have two related questions:
> > >> >
> > >> > 1)       Can JVM’s heap use virtual memory or just use physical
> > memory?
> > >> >
> > >> > 2)       Can direct memory use virtual memory or just use physical
> > >> memory?
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Mon, Aug 26, 2013 at 8:06 AM, Haoyuan Li <[email protected]>
> > >> wrote:
> > >> >
> > >> >> Hi Imran,
> > >> >>
> > >> >> One possible solution is that you can use
> > >> >> Tachyon<https://github.com/amplab/tachyon>.
> > >> >> When data is in Tachyon, Spark jobs will read it from off-heap
> > memory.
> > >> >> Internally, it uses direct byte buffers to store memory-serialized
> > RDDs
> > >> as
> > >> >> you mentioned. Also, different Spark jobs can share the same data
> in
> > >> >> Tachyon's memory. Here is a presentation
> > >> >> (slide<
> > >> >>
> > >>
> >
> https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf
> > >> >> >)
> > >> >> we did in May.
> > >> >>
> > >> >> Haoyuan
> > >> >>
> > >> >>
> > >> >> On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <
> [email protected]>
> > >> >> wrote:
> > >> >>
> > >> >> > Hi,
> > >> >> >
> > >> >> > I was wondering if anyone has thought about putting cached data
> in
> > an
> > >> >> > RDD into off-heap memory, eg. w/ direct byte buffers.  For really
> > >> >> > long-lived RDDs that use a lot of memory, this seems like a huge
> > >> >> > improvement, since all the memory is now totally ignored during
> GC.
> > >> >> > (and reading data from direct byte buffers is potentially faster
> as
> > >> >> > well, buts thats just a nice bonus).
> > >> >> >
> > >> >> > The easiest thing to do is to store memory-serialized RDDs in
> > direct
> > >> >> > byte buffers, but I guess we could also store the serialized RDD
> on
> > >> >> > disk and use a memory mapped file.  Serializing into off-heap
> > buffers
> > >> >> > is a really simple patch, I just changed a few lines (I haven't
> > done
> > >> >> > any real tests w/ it yet, though).  But I dont' really have a ton
> > of
> > >> >> > experience w/ off-heap memory, so I thought I would ask what
> others
> > >> >> > think of the idea, if it makes sense or if there are any gotchas
> I
> > >> >> > should be aware of, etc.
> > >> >> >
> > >> >> > thanks,
> > >> >> > Imran
> > >> >> >
> > >> >>
> > >>
> >
>



-- 
--
Evan Chan
Staff Engineer
[email protected]  |

<http://www.ooyala.com/>
<http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>

Re: off-heap RDDs

Reply via email to