Haoyuan, Thanks, that sounds great, exactly what we are looking for.
We might be interested in integrating Tachyon with CFS (Cassandra File System, the Cassandra-based implementation of HDFS). -Evan On Sat, Aug 31, 2013 at 3:33 PM, Haoyuan Li <[email protected]> wrote: > Evan, > > If I understand you correctly, you want to avoid network I/O as much as > possible by caching the data on the node having the data on disk. Actually, > what I meant client caching would automatically do this. For example, > suppose you have a cluster of machines, nothing cached in memory yet. Then > a spark application runs on it. Spark asks Tachyon where data X is. Since > nothing is in memory yet, Tachyon would return disk locations for the first > time. Then Spark program will try to take advantage of disk data locality, > and load the data X in HDFS node N into the off-heap memory of node N. In > the future, when Spark asks Tachyon the location of X, Tachyon will return > node N. There is no network I/O involved in the whole process. Let me know > if I misunderstood something. > > Haoyuan > > > On Fri, Aug 30, 2013 at 10:00 AM, Evan Chan <[email protected]> wrote: > > > Hey guys, > > > > I would also prefer to strengthen and get behind Tachyon, rather than > > implement a separate solution (though I guess if it's not offiically > > supported, then nobody will ask questions). But it's more that off-heap > > memory is difficult, so it's better to focus efforts on one project, is > my > > feeling. > > > > Haoyuan, > > > > Tachyon brings cached HDFS data to the local client. Have we thought > about > > the opposite approach, which might be more efficient? > > - Load the data in HDFS node N into the off-heap memory of node N > > - in Spark, inform the framework (maybe via RDD partition/location info) > > of where the data is, that it is located in node N > > - bring the computation to node N > > > > This avoids network IO and may be much more efficient for many types of > > applications. I know this would be a big win for us. > > > > -Evan > > > > > > On Wed, Aug 28, 2013 at 1:37 AM, Haoyuan Li <[email protected]> > wrote: > > > > > No problem. Like reading/writing data from/to off-heap bytebuffer, > when a > > > program reads/writes data from/to Tachyon, Spark/Shark needs to do > > ser/de. > > > Efficient ser/de will help on performance a lot as people pointed out. > > One > > > solution is that the application can do primitive operations directly > on > > > ByteBuffer, like how Shark is handling it now. Most related code is > > located > > > at " > > > > > > https://github.com/amplab/shark/tree/master/src/main/scala/shark/memstore2 > > > " > > > and " > > > > > > > > > https://github.com/amplab/shark/tree/master/src/tachyon_enabled/scala/shark/tachyon > > > ". > > > > > > Haoyuan > > > > > > > > > On Wed, Aug 28, 2013 at 1:21 AM, Imran Rashid <[email protected]> > > > wrote: > > > > > > > Thanks Haoyuan. It seems like we should try out Tachyon, sounds like > > > > it is what we are looking for. > > > > > > > > On Wed, Aug 28, 2013 at 8:18 AM, Haoyuan Li <[email protected]> > > > wrote: > > > > > Response inline. > > > > > > > > > > > > > > > On Tue, Aug 27, 2013 at 1:37 AM, Imran Rashid < > [email protected]> > > > > wrote: > > > > > > > > > >> Thanks for all the great comments & discussion. Let me expand a > bit > > > > >> on our use case, and then I'm gonna combine responses to various > > > > >> questions. > > > > >> > > > > >> In general, when we use spark, we have some really big RDDs that > use > > > > >> up a lot of memory (10s of GB per node) that are really our "core" > > > > >> data sets. We tend to start up a spark application, immediately > > load > > > > >> all those data sets, and just leave them loaded for the lifetime > of > > > > >> that process. We definitely create a lot of other RDDs along the > > way, > > > > >> and lots of intermediate objects that we'd like to go through > normal > > > > >> garbage collection. But those all require much less memory, maybe > > > > >> 1/10th of the big RDDs that we just keep around. I know this is a > > bit > > > > >> of a special case, but it seems like it probably isn't that > > different > > > > >> from a lot of use cases. > > > > >> > > > > >> Reynold Xin wrote: > > > > >> > This is especially attractive if the application can read > directly > > > > from > > > > >> a byte > > > > >> > buffer without generic serialization (like Shark). > > > > >> > > > > >> interesting -- can you explain how this works in Shark? do you > have > > > > >> some general way of storing data in byte buffers that avoids > > > > >> serialization? Or do you mean that if the user is effectively > > > > >> creating an RDD of ints, that you create a an RDD[ByteBuffer], and > > > > >> then you read / write ints into the byte buffer yourself? > > > > >> Sorry, I'm familiar with the basic idea of shark but not the code > at > > > > >> all -- even a pointer to the code would be helpful. > > > > >> > > > > >> Haoyun Li wrote: > > > > >> > One possible solution is that you can use > > > > >> > Tachyon<https://github.com/amplab/tachyon>. > > > > >> > > > > >> This is a good idea, that I had probably overlooked. There are > two > > > > >> potential issues that I can think of with this approach, though: > > > > >> 1) I was under the impression that Tachyon is still not really > > tested > > > > >> in production systems, and I need something a bit more mature. Of > > > > >> course, my changes wouldn't be thoroughly tested either, but > > somehow I > > > > >> feel better about deploying my 5-line patch to a codebase I > > understand > > > > >> than adding another entire system. (This isn't a good reason to > add > > > > >> this to spark in general, though, just might be a temporary patch > we > > > > >> locally deploy) > > > > >> > > > > > > > > > > This is a legitimate concern. The good news is that, several > > companies > > > > have > > > > > been testing it for a while, and some are close to make it to > > > production. > > > > > For example, as Yahoo mentioned in today's meetup, we are working > to > > > > > integrate Shark and Tachyon closely, and results are very > promising. > > It > > > > > will be in production soon. > > > > > > > > > > > > > > >> 2) I may have misunderstood Tachyon, but it seems there is a big > > > > >> difference in the data locality in these two approaches. On a > large > > > > >> cluster, HDFS will spread the data all over the cluster, and so > any > > > > >> particular piece of on-disk data will only live on a few machines. > > > > >> When you start a spark application, which only uses a small subset > > of > > > > >> the nodes, odds are the data you want is *not* on those nodes. So > > > > >> even if tachyon caches data from HDFS into memory, it won't be on > > the > > > > >> same nodes as the spark application. Which means that when the > > spark > > > > >> application reads data from the RDD, even though the data is in > > memory > > > > >> on some node in the cluster, it will need to be read over the > > network > > > > >> by the actual spark worker assigned to the application. > > > > > > > > > > Is my understanding correct? I haven't done any measurements at > all > > > > >> of a difference in performance, but it seems this would be much > > > > >> slower. > > > > >> > > > > > > > > > > This is a great question. Actually, from data locality perspective, > > two > > > > > approaches have no difference. Tachyon does client side caching, > > which > > > > > means, if a client on a node reads data not on its local machine, > the > > > > first > > > > > read will cache the data on that node. Therefore, all future access > > on > > > > that > > > > > node will read the data from its local memory. For example, suppose > > you > > > > > have a cluster with 100 nodes all running HDFS and Tachyon. Then > you > > > > launch > > > > > a Spark jobs running on 20 nodes only. When it reads or caches the > > data > > > > > first time, all data will be cached on those 20 nodes. In the > future, > > > > when > > > > > Spark master tries to schedule tasks, it will query Tachyon about > > data > > > > > locations, and take advantage of data localities automatically. > > > > > > > > > > Best, > > > > > > > > > > Haoyuan > > > > > > > > > > > > > > >> > > > > >> > > > > >> Mark Hamstra wrote: > > > > >> > What worries me is the > > > > >> > combinatoric explosion of different caching and persistence > > > > mechanisms. > > > > >> > > > > >> great points, and I have no ideas of the real advantages yet. I > > agree > > > > >> we'd need to actual observe an improvement to add yet another > > option. > > > > >> (I would really like some alternative to what I'm doing now, but > > maybe > > > > >> tachyon is all I need ...) > > > > >> > > > > >> Reynold Xin wrote: > > > > >> > Mark - you don't necessarily need to construct a separate > storage > > > > level. > > > > >> > One simple way to accomplish this is for the user application to > > > pass > > > > >> Spark > > > > >> > a DirectByteBuffer. > > > > >> > > > > >> hmm, that's true I suppose, but I had originally thought of making > > it > > > > >> another storage level, just for convenience & consistency. > Couldn't > > > > >> you get rid of all the storage levels and just have the user apply > > > > >> various transformations to an RDD? eg. > > > > >> > > > > >> rdd.cache(MEMORY_ONLY_SER) > > > > >> > > > > >> could be > > > > >> > > > > >> rdd.map{x => serializer.serialize(x)}.cache(MEMORY_ONLY) > > > > >> > > > > >> > > > > >> And I agree with all of Lijie's comments that using off-heap > memory > > is > > > > >> unsafe & difficult. But I feel that isn't a reason to completely > > > > >> disallow it, if there is a significant performance improvement. > It > > > > >> would need to be clearly documented as an advanced feature with > some > > > > >> risks involved. > > > > >> > > > > >> thanks, > > > > >> imran > > > > >> > > > > >> On Mon, Aug 26, 2013 at 4:38 AM, Lijie Xu <[email protected]> > > > wrote: > > > > >> > I remember that I talked about this off-heap approach with > Reynold > > > in > > > > >> > person several months ago. I think this approach is attractive > to > > > > >> > Spark/Shark, since there are many large objects in JVM. But the > > main > > > > >> > problem in original Spark (without Tachyon support) is that it > > uses > > > > the > > > > >> > same memory space both for storing critical data and processing > > > > temporary > > > > >> > data. Separating storing and processing is more important than > > > looking > > > > >> for > > > > >> > memory-efficient storing technique. So I think this separation > is > > > the > > > > >> main > > > > >> > contribution of Tachyon. > > > > >> > > > > > >> > > > > > >> > As for off-heap approach, we are not the first to realize this > > > > problem. > > > > >> > Apache DirectMemory is promising, though not mature currently. > > > > However, I > > > > >> > think there are some problems while using direct memory. > > > > >> > > > > > >> > 1) Unsafe. As same as C++, there may be memory leak. Users > > > will > > > > >> also > > > > >> > be confused to set right memory-related configurations such as > > –Xmx > > > > and > > > > >> > –MaxDirectMemorySize. > > > > >> > > > > > >> > 2) Difficult. Designing an effective and efficient memory > > > > >> management > > > > >> > system is not an easy job. How to allocate, replace, reclaim > > objects > > > > at > > > > >> > right time and at right location is challenging. It’s a bit > > similar > > > > with > > > > >> GC > > > > >> > algorithms. > > > > >> > > > > > >> > 3) Limited usage. It’s useful for > write-once-read-many-times > > > > large > > > > >> > objects but not for others. > > > > >> > > > > > >> > > > > > >> > > > > > >> > I also have two related questions: > > > > >> > > > > > >> > 1) Can JVM’s heap use virtual memory or just use physical > > > > memory? > > > > >> > > > > > >> > 2) Can direct memory use virtual memory or just use > physical > > > > >> memory? > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > On Mon, Aug 26, 2013 at 8:06 AM, Haoyuan Li < > [email protected] > > > > > > > >> wrote: > > > > >> > > > > > >> >> Hi Imran, > > > > >> >> > > > > >> >> One possible solution is that you can use > > > > >> >> Tachyon<https://github.com/amplab/tachyon>. > > > > >> >> When data is in Tachyon, Spark jobs will read it from off-heap > > > > memory. > > > > >> >> Internally, it uses direct byte buffers to store > > memory-serialized > > > > RDDs > > > > >> as > > > > >> >> you mentioned. Also, different Spark jobs can share the same > data > > > in > > > > >> >> Tachyon's memory. Here is a presentation > > > > >> >> (slide< > > > > >> >> > > > > >> > > > > > > > > > > https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf > > > > >> >> >) > > > > >> >> we did in May. > > > > >> >> > > > > >> >> Haoyuan > > > > >> >> > > > > >> >> > > > > >> >> On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid < > > > [email protected]> > > > > >> >> wrote: > > > > >> >> > > > > >> >> > Hi, > > > > >> >> > > > > > >> >> > I was wondering if anyone has thought about putting cached > data > > > in > > > > an > > > > >> >> > RDD into off-heap memory, eg. w/ direct byte buffers. For > > really > > > > >> >> > long-lived RDDs that use a lot of memory, this seems like a > > huge > > > > >> >> > improvement, since all the memory is now totally ignored > during > > > GC. > > > > >> >> > (and reading data from direct byte buffers is potentially > > faster > > > as > > > > >> >> > well, buts thats just a nice bonus). > > > > >> >> > > > > > >> >> > The easiest thing to do is to store memory-serialized RDDs in > > > > direct > > > > >> >> > byte buffers, but I guess we could also store the serialized > > RDD > > > on > > > > >> >> > disk and use a memory mapped file. Serializing into off-heap > > > > buffers > > > > >> >> > is a really simple patch, I just changed a few lines (I > haven't > > > > done > > > > >> >> > any real tests w/ it yet, though). But I dont' really have a > > ton > > > > of > > > > >> >> > experience w/ off-heap memory, so I thought I would ask what > > > others > > > > >> >> > think of the idea, if it makes sense or if there are any > > gotchas > > > I > > > > >> >> > should be aware of, etc. > > > > >> >> > > > > > >> >> > thanks, > > > > >> >> > Imran > > > > >> >> > > > > > >> >> > > > > >> > > > > > > > > > > > > > > > -- > > -- > > Evan Chan > > Staff Engineer > > [email protected] | > > > > <http://www.ooyala.com/> > > <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala > >< > > http://www.twitter.com/ooyala> > > > -- -- Evan Chan Staff Engineer [email protected] | <http://www.ooyala.com/> <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>
