Re: off-heap RDDs

Haoyuan Li Fri, 06 Sep 2013 01:17:59 -0700

That will be great!

Haoyuan



On Thu, Sep 5, 2013 at 9:28 PM, Evan Chan <[email protected]> wrote:

> Haoyuan,
>
> Thanks, that sounds great, exactly what we are looking for.
>
> We might be interested in integrating Tachyon with CFS (Cassandra File
> System, the Cassandra-based implementation of HDFS).
>
> -Evan
>
>
>
> On Sat, Aug 31, 2013 at 3:33 PM, Haoyuan Li <[email protected]> wrote:
>
> > Evan,
> >
> > If I understand you correctly, you want to avoid network I/O as much as
> > possible by caching the data on the node having the data on disk.
> Actually,
> > what I meant client caching would automatically do this. For example,
> > suppose you have a cluster of machines, nothing cached in memory yet.
> Then
> > a spark application runs on it. Spark asks Tachyon where data X is. Since
> > nothing is in memory yet, Tachyon would return disk locations for the
> first
> > time. Then Spark program will try to take advantage of disk data
> locality,
> > and load the data X in HDFS node N into the off-heap memory of node N. In
> > the future, when Spark asks Tachyon the location of X, Tachyon will
> return
> > node N. There is no network I/O involved in the whole process. Let me
> know
> > if I misunderstood something.
> >
> > Haoyuan
> >
> >
> > On Fri, Aug 30, 2013 at 10:00 AM, Evan Chan <[email protected]> wrote:
> >
> > > Hey guys,
> > >
> > > I would also prefer to strengthen and get behind Tachyon, rather than
> > > implement a separate solution (though I guess if it's not offiically
> > > supported, then nobody will ask questions).  But it's more that
> off-heap
> > > memory is difficult, so it's better to focus efforts on one project, is
> > my
> > > feeling.
> > >
> > > Haoyuan,
> > >
> > > Tachyon brings cached HDFS data to the local client.  Have we thought
> > about
> > > the opposite approach, which might be more efficient?
> > >  - Load the data in HDFS node N into the off-heap memory of node N
> > >  - in Spark, inform the framework (maybe via RDD partition/location
> info)
> > > of where the data is, that it is located in node N
> > >  - bring the computation to node N
> > >
> > > This avoids network IO and may be much more efficient for many types of
> > > applications.   I know this would be a big win for us.
> > >
> > > -Evan
> > >
> > >
> > > On Wed, Aug 28, 2013 at 1:37 AM, Haoyuan Li <[email protected]>
> > wrote:
> > >
> > > > No problem. Like reading/writing data from/to off-heap bytebuffer,
> > when a
> > > > program reads/writes data from/to Tachyon, Spark/Shark needs to do
> > > ser/de.
> > > > Efficient ser/de will help on performance a lot as people pointed
> out.
> > > One
> > > > solution is that the application can do primitive operations directly
> > on
> > > > ByteBuffer, like how Shark is handling it now. Most related code is
> > > located
> > > > at "
> > > >
> > >
> >
> https://github.com/amplab/shark/tree/master/src/main/scala/shark/memstore2
> > > > "
> > > > and "
> > > >
> > > >
> > >
> >
> https://github.com/amplab/shark/tree/master/src/tachyon_enabled/scala/shark/tachyon
> > > > ".
> > > >
> > > > Haoyuan
> > > >
> > > >
> > > > On Wed, Aug 28, 2013 at 1:21 AM, Imran Rashid <[email protected]>
> > > > wrote:
> > > >
> > > > > Thanks Haoyuan.  It seems like we should try out Tachyon, sounds
> like
> > > > > it is what we are looking for.
> > > > >
> > > > > On Wed, Aug 28, 2013 at 8:18 AM, Haoyuan Li <[email protected]>
> > > > wrote:
> > > > > > Response inline.
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 27, 2013 at 1:37 AM, Imran Rashid <
> > [email protected]>
> > > > > wrote:
> > > > > >
> > > > > >> Thanks for all the great comments & discussion.  Let me expand a
> > bit
> > > > > >> on our use case, and then I'm gonna combine responses to various
> > > > > >> questions.
> > > > > >>
> > > > > >> In general, when we use spark, we have some really big RDDs that
> > use
> > > > > >> up a lot of memory (10s of GB per node) that are really our
> "core"
> > > > > >> data sets.  We tend to start up a spark application, immediately
> > > load
> > > > > >> all those data sets, and just leave them loaded for the lifetime
> > of
> > > > > >> that process.  We definitely create a lot of other RDDs along
> the
> > > way,
> > > > > >> and lots of intermediate objects that we'd like to go through
> > normal
> > > > > >> garbage collection.  But those all require much less memory,
> maybe
> > > > > >> 1/10th of the big RDDs that we just keep around.  I know this
> is a
> > > bit
> > > > > >> of a special case, but it seems like it probably isn't that
> > > different
> > > > > >> from a lot of use cases.
> > > > > >>
> > > > > >> Reynold Xin wrote:
> > > > > >> > This is especially attractive if the application can read
> > directly
> > > > > from
> > > > > >> a byte
> > > > > >> > buffer without generic serialization (like Shark).
> > > > > >>
> > > > > >> interesting -- can you explain how this works in Shark?  do you
> > have
> > > > > >> some general way of storing data in byte buffers that avoids
> > > > > >> serialization?  Or do you mean that if the user is effectively
> > > > > >> creating an RDD of ints, that you create a an RDD[ByteBuffer],
> and
> > > > > >> then you read / write ints into the byte buffer yourself?
> > > > > >> Sorry, I'm familiar with the basic idea of shark but not the
> code
> > at
> > > > > >> all -- even a pointer to the code would be helpful.
> > > > > >>
> > > > > >> Haoyun Li wrote:
> > > > > >> > One possible solution is that you can use
> > > > > >> > Tachyon<https://github.com/amplab/tachyon>.
> > > > > >>
> > > > > >> This is a good idea, that I had probably overlooked.  There are
> > two
> > > > > >> potential issues that I can think of with this approach, though:
> > > > > >> 1) I was under the impression that Tachyon is still not really
> > > tested
> > > > > >> in production systems, and I need something a bit more mature.
>  Of
> > > > > >> course, my changes wouldn't be thoroughly tested either, but
> > > somehow I
> > > > > >> feel better about deploying my 5-line patch to a codebase I
> > > understand
> > > > > >> than adding another entire system.  (This isn't a good reason to
> > add
> > > > > >> this to spark in general, though, just might be a temporary
> patch
> > we
> > > > > >> locally deploy)
> > > > > >>
> > > > > >
> > > > > > This is a legitimate concern. The good news is that, several
> > > companies
> > > > > have
> > > > > > been testing it for a while, and some are close to make it to
> > > > production.
> > > > > > For example, as Yahoo mentioned in today's meetup, we are working
> > to
> > > > > > integrate Shark and Tachyon closely, and results are very
> > promising.
> > > It
> > > > > > will be in production soon.
> > > > > >
> > > > > >
> > > > > >> 2) I may have misunderstood Tachyon, but it seems there is a big
> > > > > >> difference in the data locality in these two approaches.  On a
> > large
> > > > > >> cluster, HDFS will spread the data all over the cluster, and so
> > any
> > > > > >> particular piece of on-disk data will only live on a few
> machines.
> > > > > >> When you start a spark application, which only uses a small
> subset
> > > of
> > > > > >> the nodes, odds are the data you want is *not* on those nodes.
>  So
> > > > > >> even if tachyon caches data from HDFS into memory, it won't be
> on
> > > the
> > > > > >> same nodes as the spark application.  Which means that when the
> > > spark
> > > > > >> application reads data from the RDD, even though the data is in
> > > memory
> > > > > >> on some node in the cluster, it will need to be read over the
> > > network
> > > > > >> by the actual spark worker assigned to the application.
> > > > > >
> > > > > > Is my understanding correct?  I haven't done any measurements at
> > all
> > > > > >> of a difference in performance, but it seems this would be much
> > > > > >> slower.
> > > > > >>
> > > > > >
> > > > > > This is a great question. Actually, from data locality
> perspective,
> > > two
> > > > > > approaches have no difference. Tachyon does client side caching,
> > > which
> > > > > > means, if a client on a node reads data not on its local machine,
> > the
> > > > > first
> > > > > > read will cache the data on that node. Therefore, all future
> access
> > > on
> > > > > that
> > > > > > node will read the data from its local memory. For example,
> suppose
> > > you
> > > > > > have a cluster with 100 nodes all running HDFS and Tachyon. Then
> > you
> > > > > launch
> > > > > > a Spark jobs running on 20 nodes only. When it reads or caches
> the
> > > data
> > > > > > first time, all data will be cached on those 20 nodes. In the
> > future,
> > > > > when
> > > > > > Spark master tries to schedule tasks, it will query Tachyon about
> > > data
> > > > > > locations, and take advantage of data localities automatically.
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Haoyuan
> > > > > >
> > > > > >
> > > > > >>
> > > > > >>
> > > > > >> Mark Hamstra wrote:
> > > > > >> > What worries me is the
> > > > > >> > combinatoric explosion of different caching and persistence
> > > > > mechanisms.
> > > > > >>
> > > > > >> great points, and I have no ideas of the real advantages yet.  I
> > > agree
> > > > > >> we'd need to actual observe an improvement to add yet another
> > > option.
> > > > > >> (I would really like some alternative to what I'm doing now, but
> > > maybe
> > > > > >> tachyon is all I need ...)
> > > > > >>
> > > > > >> Reynold Xin wrote:
> > > > > >> > Mark - you don't necessarily need to construct a separate
> > storage
> > > > > level.
> > > > > >> > One simple way to accomplish this is for the user application
> to
> > > > pass
> > > > > >> Spark
> > > > > >> > a DirectByteBuffer.
> > > > > >>
> > > > > >> hmm, that's true I suppose, but I had originally thought of
> making
> > > it
> > > > > >> another storage level, just for convenience & consistency.
> >  Couldn't
> > > > > >> you get rid of all the storage levels and just have the user
> apply
> > > > > >> various transformations to an RDD?  eg.
> > > > > >>
> > > > > >> rdd.cache(MEMORY_ONLY_SER)
> > > > > >>
> > > > > >> could be
> > > > > >>
> > > > > >> rdd.map{x => serializer.serialize(x)}.cache(MEMORY_ONLY)
> > > > > >>
> > > > > >>
> > > > > >> And I agree with all of Lijie's comments that using off-heap
> > memory
> > > is
> > > > > >> unsafe & difficult.  But I feel that isn't a reason to
> completely
> > > > > >> disallow it, if there is a significant performance improvement.
> >  It
> > > > > >> would need to be clearly documented as an advanced feature with
> > some
> > > > > >> risks involved.
> > > > > >>
> > > > > >> thanks,
> > > > > >> imran
> > > > > >>
> > > > > >> On Mon, Aug 26, 2013 at 4:38 AM, Lijie Xu <[email protected]>
> > > > wrote:
> > > > > >> > I remember that I talked about this off-heap approach with
> > Reynold
> > > > in
> > > > > >> > person several months ago. I think this approach is attractive
> > to
> > > > > >> > Spark/Shark, since there are many large objects in JVM. But
> the
> > > main
> > > > > >> > problem in original Spark (without Tachyon support) is that it
> > > uses
> > > > > the
> > > > > >> > same memory space both for storing critical data and
> processing
> > > > > temporary
> > > > > >> > data. Separating storing and processing is more important than
> > > > looking
> > > > > >> for
> > > > > >> > memory-efficient storing technique. So I think this separation
> > is
> > > > the
> > > > > >> main
> > > > > >> > contribution of Tachyon.
> > > > > >> >
> > > > > >> >
> > > > > >> > As for off-heap approach, we are not the first to realize this
> > > > > problem.
> > > > > >> > Apache DirectMemory is promising, though not mature currently.
> > > > > However, I
> > > > > >> > think there are some problems while using direct memory.
> > > > > >> >
> > > > > >> > 1)       Unsafe. As same as C++, there may be memory leak.
> Users
> > > > will
> > > > > >> also
> > > > > >> > be confused to set right memory-related configurations such as
> > > –Xmx
> > > > > and
> > > > > >> > –MaxDirectMemorySize.
> > > > > >> >
> > > > > >> > 2)       Difficult. Designing an effective and efficient
> memory
> > > > > >> management
> > > > > >> > system is not an easy job. How to allocate, replace, reclaim
> > > objects
> > > > > at
> > > > > >> > right time and at right location is challenging. It’s a bit
> > > similar
> > > > > with
> > > > > >> GC
> > > > > >> > algorithms.
> > > > > >> >
> > > > > >> > 3)       Limited usage. It’s useful for
> > write-once-read-many-times
> > > > > large
> > > > > >> > objects but not for others.
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > I also have two related questions:
> > > > > >> >
> > > > > >> > 1)       Can JVM’s heap use virtual memory or just use
> physical
> > > > > memory?
> > > > > >> >
> > > > > >> > 2)       Can direct memory use virtual memory or just use
> > physical
> > > > > >> memory?
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > On Mon, Aug 26, 2013 at 8:06 AM, Haoyuan Li <
> > [email protected]
> > > >
> > > > > >> wrote:
> > > > > >> >
> > > > > >> >> Hi Imran,
> > > > > >> >>
> > > > > >> >> One possible solution is that you can use
> > > > > >> >> Tachyon<https://github.com/amplab/tachyon>.
> > > > > >> >> When data is in Tachyon, Spark jobs will read it from
> off-heap
> > > > > memory.
> > > > > >> >> Internally, it uses direct byte buffers to store
> > > memory-serialized
> > > > > RDDs
> > > > > >> as
> > > > > >> >> you mentioned. Also, different Spark jobs can share the same
> > data
> > > > in
> > > > > >> >> Tachyon's memory. Here is a presentation
> > > > > >> >> (slide<
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf
> > > > > >> >> >)
> > > > > >> >> we did in May.
> > > > > >> >>
> > > > > >> >> Haoyuan
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <
> > > > [email protected]>
> > > > > >> >> wrote:
> > > > > >> >>
> > > > > >> >> > Hi,
> > > > > >> >> >
> > > > > >> >> > I was wondering if anyone has thought about putting cached
> > data
> > > > in
> > > > > an
> > > > > >> >> > RDD into off-heap memory, eg. w/ direct byte buffers.  For
> > > really
> > > > > >> >> > long-lived RDDs that use a lot of memory, this seems like a
> > > huge
> > > > > >> >> > improvement, since all the memory is now totally ignored
> > during
> > > > GC.
> > > > > >> >> > (and reading data from direct byte buffers is potentially
> > > faster
> > > > as
> > > > > >> >> > well, buts thats just a nice bonus).
> > > > > >> >> >
> > > > > >> >> > The easiest thing to do is to store memory-serialized RDDs
> in
> > > > > direct
> > > > > >> >> > byte buffers, but I guess we could also store the
> serialized
> > > RDD
> > > > on
> > > > > >> >> > disk and use a memory mapped file.  Serializing into
> off-heap
> > > > > buffers
> > > > > >> >> > is a really simple patch, I just changed a few lines (I
> > haven't
> > > > > done
> > > > > >> >> > any real tests w/ it yet, though).  But I dont' really
> have a
> > > ton
> > > > > of
> > > > > >> >> > experience w/ off-heap memory, so I thought I would ask
> what
> > > > others
> > > > > >> >> > think of the idea, if it makes sense or if there are any
> > > gotchas
> > > > I
> > > > > >> >> > should be aware of, etc.
> > > > > >> >> >
> > > > > >> >> > thanks,
> > > > > >> >> > Imran
> > > > > >> >> >
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > --
> > > Evan Chan
> > > Staff Engineer
> > > [email protected]  |
> > >
> > > <http://www.ooyala.com/>
> > > <http://www.facebook.com/ooyala><
> http://www.linkedin.com/company/ooyala
> > ><
> > > http://www.twitter.com/ooyala>
> > >
> >
>
>
>
> --
> --
> Evan Chan
> Staff Engineer
> [email protected]  |
>
> <http://www.ooyala.com/>
> <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><
> http://www.twitter.com/ooyala>
>

Re: off-heap RDDs

Reply via email to