That will be great! Haoyuan
On Thu, Sep 5, 2013 at 9:28 PM, Evan Chan <[email protected]> wrote: > Haoyuan, > > Thanks, that sounds great, exactly what we are looking for. > > We might be interested in integrating Tachyon with CFS (Cassandra File > System, the Cassandra-based implementation of HDFS). > > -Evan > > > > On Sat, Aug 31, 2013 at 3:33 PM, Haoyuan Li <[email protected]> wrote: > > > Evan, > > > > If I understand you correctly, you want to avoid network I/O as much as > > possible by caching the data on the node having the data on disk. > Actually, > > what I meant client caching would automatically do this. For example, > > suppose you have a cluster of machines, nothing cached in memory yet. > Then > > a spark application runs on it. Spark asks Tachyon where data X is. Since > > nothing is in memory yet, Tachyon would return disk locations for the > first > > time. Then Spark program will try to take advantage of disk data > locality, > > and load the data X in HDFS node N into the off-heap memory of node N. In > > the future, when Spark asks Tachyon the location of X, Tachyon will > return > > node N. There is no network I/O involved in the whole process. Let me > know > > if I misunderstood something. > > > > Haoyuan > > > > > > On Fri, Aug 30, 2013 at 10:00 AM, Evan Chan <[email protected]> wrote: > > > > > Hey guys, > > > > > > I would also prefer to strengthen and get behind Tachyon, rather than > > > implement a separate solution (though I guess if it's not offiically > > > supported, then nobody will ask questions). But it's more that > off-heap > > > memory is difficult, so it's better to focus efforts on one project, is > > my > > > feeling. > > > > > > Haoyuan, > > > > > > Tachyon brings cached HDFS data to the local client. Have we thought > > about > > > the opposite approach, which might be more efficient? > > > - Load the data in HDFS node N into the off-heap memory of node N > > > - in Spark, inform the framework (maybe via RDD partition/location > info) > > > of where the data is, that it is located in node N > > > - bring the computation to node N > > > > > > This avoids network IO and may be much more efficient for many types of > > > applications. I know this would be a big win for us. > > > > > > -Evan > > > > > > > > > On Wed, Aug 28, 2013 at 1:37 AM, Haoyuan Li <[email protected]> > > wrote: > > > > > > > No problem. Like reading/writing data from/to off-heap bytebuffer, > > when a > > > > program reads/writes data from/to Tachyon, Spark/Shark needs to do > > > ser/de. > > > > Efficient ser/de will help on performance a lot as people pointed > out. > > > One > > > > solution is that the application can do primitive operations directly > > on > > > > ByteBuffer, like how Shark is handling it now. Most related code is > > > located > > > > at " > > > > > > > > > > https://github.com/amplab/shark/tree/master/src/main/scala/shark/memstore2 > > > > " > > > > and " > > > > > > > > > > > > > > https://github.com/amplab/shark/tree/master/src/tachyon_enabled/scala/shark/tachyon > > > > ". > > > > > > > > Haoyuan > > > > > > > > > > > > On Wed, Aug 28, 2013 at 1:21 AM, Imran Rashid <[email protected]> > > > > wrote: > > > > > > > > > Thanks Haoyuan. It seems like we should try out Tachyon, sounds > like > > > > > it is what we are looking for. > > > > > > > > > > On Wed, Aug 28, 2013 at 8:18 AM, Haoyuan Li <[email protected]> > > > > wrote: > > > > > > Response inline. > > > > > > > > > > > > > > > > > > On Tue, Aug 27, 2013 at 1:37 AM, Imran Rashid < > > [email protected]> > > > > > wrote: > > > > > > > > > > > >> Thanks for all the great comments & discussion. Let me expand a > > bit > > > > > >> on our use case, and then I'm gonna combine responses to various > > > > > >> questions. > > > > > >> > > > > > >> In general, when we use spark, we have some really big RDDs that > > use > > > > > >> up a lot of memory (10s of GB per node) that are really our > "core" > > > > > >> data sets. We tend to start up a spark application, immediately > > > load > > > > > >> all those data sets, and just leave them loaded for the lifetime > > of > > > > > >> that process. We definitely create a lot of other RDDs along > the > > > way, > > > > > >> and lots of intermediate objects that we'd like to go through > > normal > > > > > >> garbage collection. But those all require much less memory, > maybe > > > > > >> 1/10th of the big RDDs that we just keep around. I know this > is a > > > bit > > > > > >> of a special case, but it seems like it probably isn't that > > > different > > > > > >> from a lot of use cases. > > > > > >> > > > > > >> Reynold Xin wrote: > > > > > >> > This is especially attractive if the application can read > > directly > > > > > from > > > > > >> a byte > > > > > >> > buffer without generic serialization (like Shark). > > > > > >> > > > > > >> interesting -- can you explain how this works in Shark? do you > > have > > > > > >> some general way of storing data in byte buffers that avoids > > > > > >> serialization? Or do you mean that if the user is effectively > > > > > >> creating an RDD of ints, that you create a an RDD[ByteBuffer], > and > > > > > >> then you read / write ints into the byte buffer yourself? > > > > > >> Sorry, I'm familiar with the basic idea of shark but not the > code > > at > > > > > >> all -- even a pointer to the code would be helpful. > > > > > >> > > > > > >> Haoyun Li wrote: > > > > > >> > One possible solution is that you can use > > > > > >> > Tachyon<https://github.com/amplab/tachyon>. > > > > > >> > > > > > >> This is a good idea, that I had probably overlooked. There are > > two > > > > > >> potential issues that I can think of with this approach, though: > > > > > >> 1) I was under the impression that Tachyon is still not really > > > tested > > > > > >> in production systems, and I need something a bit more mature. > Of > > > > > >> course, my changes wouldn't be thoroughly tested either, but > > > somehow I > > > > > >> feel better about deploying my 5-line patch to a codebase I > > > understand > > > > > >> than adding another entire system. (This isn't a good reason to > > add > > > > > >> this to spark in general, though, just might be a temporary > patch > > we > > > > > >> locally deploy) > > > > > >> > > > > > > > > > > > > This is a legitimate concern. The good news is that, several > > > companies > > > > > have > > > > > > been testing it for a while, and some are close to make it to > > > > production. > > > > > > For example, as Yahoo mentioned in today's meetup, we are working > > to > > > > > > integrate Shark and Tachyon closely, and results are very > > promising. > > > It > > > > > > will be in production soon. > > > > > > > > > > > > > > > > > >> 2) I may have misunderstood Tachyon, but it seems there is a big > > > > > >> difference in the data locality in these two approaches. On a > > large > > > > > >> cluster, HDFS will spread the data all over the cluster, and so > > any > > > > > >> particular piece of on-disk data will only live on a few > machines. > > > > > >> When you start a spark application, which only uses a small > subset > > > of > > > > > >> the nodes, odds are the data you want is *not* on those nodes. > So > > > > > >> even if tachyon caches data from HDFS into memory, it won't be > on > > > the > > > > > >> same nodes as the spark application. Which means that when the > > > spark > > > > > >> application reads data from the RDD, even though the data is in > > > memory > > > > > >> on some node in the cluster, it will need to be read over the > > > network > > > > > >> by the actual spark worker assigned to the application. > > > > > > > > > > > > Is my understanding correct? I haven't done any measurements at > > all > > > > > >> of a difference in performance, but it seems this would be much > > > > > >> slower. > > > > > >> > > > > > > > > > > > > This is a great question. Actually, from data locality > perspective, > > > two > > > > > > approaches have no difference. Tachyon does client side caching, > > > which > > > > > > means, if a client on a node reads data not on its local machine, > > the > > > > > first > > > > > > read will cache the data on that node. Therefore, all future > access > > > on > > > > > that > > > > > > node will read the data from its local memory. For example, > suppose > > > you > > > > > > have a cluster with 100 nodes all running HDFS and Tachyon. Then > > you > > > > > launch > > > > > > a Spark jobs running on 20 nodes only. When it reads or caches > the > > > data > > > > > > first time, all data will be cached on those 20 nodes. In the > > future, > > > > > when > > > > > > Spark master tries to schedule tasks, it will query Tachyon about > > > data > > > > > > locations, and take advantage of data localities automatically. > > > > > > > > > > > > Best, > > > > > > > > > > > > Haoyuan > > > > > > > > > > > > > > > > > >> > > > > > >> > > > > > >> Mark Hamstra wrote: > > > > > >> > What worries me is the > > > > > >> > combinatoric explosion of different caching and persistence > > > > > mechanisms. > > > > > >> > > > > > >> great points, and I have no ideas of the real advantages yet. I > > > agree > > > > > >> we'd need to actual observe an improvement to add yet another > > > option. > > > > > >> (I would really like some alternative to what I'm doing now, but > > > maybe > > > > > >> tachyon is all I need ...) > > > > > >> > > > > > >> Reynold Xin wrote: > > > > > >> > Mark - you don't necessarily need to construct a separate > > storage > > > > > level. > > > > > >> > One simple way to accomplish this is for the user application > to > > > > pass > > > > > >> Spark > > > > > >> > a DirectByteBuffer. > > > > > >> > > > > > >> hmm, that's true I suppose, but I had originally thought of > making > > > it > > > > > >> another storage level, just for convenience & consistency. > > Couldn't > > > > > >> you get rid of all the storage levels and just have the user > apply > > > > > >> various transformations to an RDD? eg. > > > > > >> > > > > > >> rdd.cache(MEMORY_ONLY_SER) > > > > > >> > > > > > >> could be > > > > > >> > > > > > >> rdd.map{x => serializer.serialize(x)}.cache(MEMORY_ONLY) > > > > > >> > > > > > >> > > > > > >> And I agree with all of Lijie's comments that using off-heap > > memory > > > is > > > > > >> unsafe & difficult. But I feel that isn't a reason to > completely > > > > > >> disallow it, if there is a significant performance improvement. > > It > > > > > >> would need to be clearly documented as an advanced feature with > > some > > > > > >> risks involved. > > > > > >> > > > > > >> thanks, > > > > > >> imran > > > > > >> > > > > > >> On Mon, Aug 26, 2013 at 4:38 AM, Lijie Xu <[email protected]> > > > > wrote: > > > > > >> > I remember that I talked about this off-heap approach with > > Reynold > > > > in > > > > > >> > person several months ago. I think this approach is attractive > > to > > > > > >> > Spark/Shark, since there are many large objects in JVM. But > the > > > main > > > > > >> > problem in original Spark (without Tachyon support) is that it > > > uses > > > > > the > > > > > >> > same memory space both for storing critical data and > processing > > > > > temporary > > > > > >> > data. Separating storing and processing is more important than > > > > looking > > > > > >> for > > > > > >> > memory-efficient storing technique. So I think this separation > > is > > > > the > > > > > >> main > > > > > >> > contribution of Tachyon. > > > > > >> > > > > > > >> > > > > > > >> > As for off-heap approach, we are not the first to realize this > > > > > problem. > > > > > >> > Apache DirectMemory is promising, though not mature currently. > > > > > However, I > > > > > >> > think there are some problems while using direct memory. > > > > > >> > > > > > > >> > 1) Unsafe. As same as C++, there may be memory leak. > Users > > > > will > > > > > >> also > > > > > >> > be confused to set right memory-related configurations such as > > > –Xmx > > > > > and > > > > > >> > –MaxDirectMemorySize. > > > > > >> > > > > > > >> > 2) Difficult. Designing an effective and efficient > memory > > > > > >> management > > > > > >> > system is not an easy job. How to allocate, replace, reclaim > > > objects > > > > > at > > > > > >> > right time and at right location is challenging. It’s a bit > > > similar > > > > > with > > > > > >> GC > > > > > >> > algorithms. > > > > > >> > > > > > > >> > 3) Limited usage. It’s useful for > > write-once-read-many-times > > > > > large > > > > > >> > objects but not for others. > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > I also have two related questions: > > > > > >> > > > > > > >> > 1) Can JVM’s heap use virtual memory or just use > physical > > > > > memory? > > > > > >> > > > > > > >> > 2) Can direct memory use virtual memory or just use > > physical > > > > > >> memory? > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > On Mon, Aug 26, 2013 at 8:06 AM, Haoyuan Li < > > [email protected] > > > > > > > > > >> wrote: > > > > > >> > > > > > > >> >> Hi Imran, > > > > > >> >> > > > > > >> >> One possible solution is that you can use > > > > > >> >> Tachyon<https://github.com/amplab/tachyon>. > > > > > >> >> When data is in Tachyon, Spark jobs will read it from > off-heap > > > > > memory. > > > > > >> >> Internally, it uses direct byte buffers to store > > > memory-serialized > > > > > RDDs > > > > > >> as > > > > > >> >> you mentioned. Also, different Spark jobs can share the same > > data > > > > in > > > > > >> >> Tachyon's memory. Here is a presentation > > > > > >> >> (slide< > > > > > >> >> > > > > > >> > > > > > > > > > > > > > > > https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf > > > > > >> >> >) > > > > > >> >> we did in May. > > > > > >> >> > > > > > >> >> Haoyuan > > > > > >> >> > > > > > >> >> > > > > > >> >> On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid < > > > > [email protected]> > > > > > >> >> wrote: > > > > > >> >> > > > > > >> >> > Hi, > > > > > >> >> > > > > > > >> >> > I was wondering if anyone has thought about putting cached > > data > > > > in > > > > > an > > > > > >> >> > RDD into off-heap memory, eg. w/ direct byte buffers. For > > > really > > > > > >> >> > long-lived RDDs that use a lot of memory, this seems like a > > > huge > > > > > >> >> > improvement, since all the memory is now totally ignored > > during > > > > GC. > > > > > >> >> > (and reading data from direct byte buffers is potentially > > > faster > > > > as > > > > > >> >> > well, buts thats just a nice bonus). > > > > > >> >> > > > > > > >> >> > The easiest thing to do is to store memory-serialized RDDs > in > > > > > direct > > > > > >> >> > byte buffers, but I guess we could also store the > serialized > > > RDD > > > > on > > > > > >> >> > disk and use a memory mapped file. Serializing into > off-heap > > > > > buffers > > > > > >> >> > is a really simple patch, I just changed a few lines (I > > haven't > > > > > done > > > > > >> >> > any real tests w/ it yet, though). But I dont' really > have a > > > ton > > > > > of > > > > > >> >> > experience w/ off-heap memory, so I thought I would ask > what > > > > others > > > > > >> >> > think of the idea, if it makes sense or if there are any > > > gotchas > > > > I > > > > > >> >> > should be aware of, etc. > > > > > >> >> > > > > > > >> >> > thanks, > > > > > >> >> > Imran > > > > > >> >> > > > > > > >> >> > > > > > >> > > > > > > > > > > > > > > > > > > > > > -- > > > -- > > > Evan Chan > > > Staff Engineer > > > [email protected] | > > > > > > <http://www.ooyala.com/> > > > <http://www.facebook.com/ooyala>< > http://www.linkedin.com/company/ooyala > > >< > > > http://www.twitter.com/ooyala> > > > > > > > > > -- > -- > Evan Chan > Staff Engineer > [email protected] | > > <http://www.ooyala.com/> > <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala>< > http://www.twitter.com/ooyala> >
