Thanks Haoyuan. It seems like we should try out Tachyon, sounds like it is what we are looking for.
On Wed, Aug 28, 2013 at 8:18 AM, Haoyuan Li <[email protected]> wrote: > Response inline. > > > On Tue, Aug 27, 2013 at 1:37 AM, Imran Rashid <[email protected]> wrote: > >> Thanks for all the great comments & discussion. Let me expand a bit >> on our use case, and then I'm gonna combine responses to various >> questions. >> >> In general, when we use spark, we have some really big RDDs that use >> up a lot of memory (10s of GB per node) that are really our "core" >> data sets. We tend to start up a spark application, immediately load >> all those data sets, and just leave them loaded for the lifetime of >> that process. We definitely create a lot of other RDDs along the way, >> and lots of intermediate objects that we'd like to go through normal >> garbage collection. But those all require much less memory, maybe >> 1/10th of the big RDDs that we just keep around. I know this is a bit >> of a special case, but it seems like it probably isn't that different >> from a lot of use cases. >> >> Reynold Xin wrote: >> > This is especially attractive if the application can read directly from >> a byte >> > buffer without generic serialization (like Shark). >> >> interesting -- can you explain how this works in Shark? do you have >> some general way of storing data in byte buffers that avoids >> serialization? Or do you mean that if the user is effectively >> creating an RDD of ints, that you create a an RDD[ByteBuffer], and >> then you read / write ints into the byte buffer yourself? >> Sorry, I'm familiar with the basic idea of shark but not the code at >> all -- even a pointer to the code would be helpful. >> >> Haoyun Li wrote: >> > One possible solution is that you can use >> > Tachyon<https://github.com/amplab/tachyon>. >> >> This is a good idea, that I had probably overlooked. There are two >> potential issues that I can think of with this approach, though: >> 1) I was under the impression that Tachyon is still not really tested >> in production systems, and I need something a bit more mature. Of >> course, my changes wouldn't be thoroughly tested either, but somehow I >> feel better about deploying my 5-line patch to a codebase I understand >> than adding another entire system. (This isn't a good reason to add >> this to spark in general, though, just might be a temporary patch we >> locally deploy) >> > > This is a legitimate concern. The good news is that, several companies have > been testing it for a while, and some are close to make it to production. > For example, as Yahoo mentioned in today's meetup, we are working to > integrate Shark and Tachyon closely, and results are very promising. It > will be in production soon. > > >> 2) I may have misunderstood Tachyon, but it seems there is a big >> difference in the data locality in these two approaches. On a large >> cluster, HDFS will spread the data all over the cluster, and so any >> particular piece of on-disk data will only live on a few machines. >> When you start a spark application, which only uses a small subset of >> the nodes, odds are the data you want is *not* on those nodes. So >> even if tachyon caches data from HDFS into memory, it won't be on the >> same nodes as the spark application. Which means that when the spark >> application reads data from the RDD, even though the data is in memory >> on some node in the cluster, it will need to be read over the network >> by the actual spark worker assigned to the application. > > Is my understanding correct? I haven't done any measurements at all >> of a difference in performance, but it seems this would be much >> slower. >> > > This is a great question. Actually, from data locality perspective, two > approaches have no difference. Tachyon does client side caching, which > means, if a client on a node reads data not on its local machine, the first > read will cache the data on that node. Therefore, all future access on that > node will read the data from its local memory. For example, suppose you > have a cluster with 100 nodes all running HDFS and Tachyon. Then you launch > a Spark jobs running on 20 nodes only. When it reads or caches the data > first time, all data will be cached on those 20 nodes. In the future, when > Spark master tries to schedule tasks, it will query Tachyon about data > locations, and take advantage of data localities automatically. > > Best, > > Haoyuan > > >> >> >> Mark Hamstra wrote: >> > What worries me is the >> > combinatoric explosion of different caching and persistence mechanisms. >> >> great points, and I have no ideas of the real advantages yet. I agree >> we'd need to actual observe an improvement to add yet another option. >> (I would really like some alternative to what I'm doing now, but maybe >> tachyon is all I need ...) >> >> Reynold Xin wrote: >> > Mark - you don't necessarily need to construct a separate storage level. >> > One simple way to accomplish this is for the user application to pass >> Spark >> > a DirectByteBuffer. >> >> hmm, that's true I suppose, but I had originally thought of making it >> another storage level, just for convenience & consistency. Couldn't >> you get rid of all the storage levels and just have the user apply >> various transformations to an RDD? eg. >> >> rdd.cache(MEMORY_ONLY_SER) >> >> could be >> >> rdd.map{x => serializer.serialize(x)}.cache(MEMORY_ONLY) >> >> >> And I agree with all of Lijie's comments that using off-heap memory is >> unsafe & difficult. But I feel that isn't a reason to completely >> disallow it, if there is a significant performance improvement. It >> would need to be clearly documented as an advanced feature with some >> risks involved. >> >> thanks, >> imran >> >> On Mon, Aug 26, 2013 at 4:38 AM, Lijie Xu <[email protected]> wrote: >> > I remember that I talked about this off-heap approach with Reynold in >> > person several months ago. I think this approach is attractive to >> > Spark/Shark, since there are many large objects in JVM. But the main >> > problem in original Spark (without Tachyon support) is that it uses the >> > same memory space both for storing critical data and processing temporary >> > data. Separating storing and processing is more important than looking >> for >> > memory-efficient storing technique. So I think this separation is the >> main >> > contribution of Tachyon. >> > >> > >> > As for off-heap approach, we are not the first to realize this problem. >> > Apache DirectMemory is promising, though not mature currently. However, I >> > think there are some problems while using direct memory. >> > >> > 1) Unsafe. As same as C++, there may be memory leak. Users will >> also >> > be confused to set right memory-related configurations such as –Xmx and >> > –MaxDirectMemorySize. >> > >> > 2) Difficult. Designing an effective and efficient memory >> management >> > system is not an easy job. How to allocate, replace, reclaim objects at >> > right time and at right location is challenging. It’s a bit similar with >> GC >> > algorithms. >> > >> > 3) Limited usage. It’s useful for write-once-read-many-times large >> > objects but not for others. >> > >> > >> > >> > I also have two related questions: >> > >> > 1) Can JVM’s heap use virtual memory or just use physical memory? >> > >> > 2) Can direct memory use virtual memory or just use physical >> memory? >> > >> > >> > >> > >> > On Mon, Aug 26, 2013 at 8:06 AM, Haoyuan Li <[email protected]> >> wrote: >> > >> >> Hi Imran, >> >> >> >> One possible solution is that you can use >> >> Tachyon<https://github.com/amplab/tachyon>. >> >> When data is in Tachyon, Spark jobs will read it from off-heap memory. >> >> Internally, it uses direct byte buffers to store memory-serialized RDDs >> as >> >> you mentioned. Also, different Spark jobs can share the same data in >> >> Tachyon's memory. Here is a presentation >> >> (slide< >> >> >> https://docs.google.com/viewer?url=http%3A%2F%2Ffiles.meetup.com%2F3138542%2FTachyon_2013-05-09_Spark_Meetup.pdf >> >> >) >> >> we did in May. >> >> >> >> Haoyuan >> >> >> >> >> >> On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <[email protected]> >> >> wrote: >> >> >> >> > Hi, >> >> > >> >> > I was wondering if anyone has thought about putting cached data in an >> >> > RDD into off-heap memory, eg. w/ direct byte buffers. For really >> >> > long-lived RDDs that use a lot of memory, this seems like a huge >> >> > improvement, since all the memory is now totally ignored during GC. >> >> > (and reading data from direct byte buffers is potentially faster as >> >> > well, buts thats just a nice bonus). >> >> > >> >> > The easiest thing to do is to store memory-serialized RDDs in direct >> >> > byte buffers, but I guess we could also store the serialized RDD on >> >> > disk and use a memory mapped file. Serializing into off-heap buffers >> >> > is a really simple patch, I just changed a few lines (I haven't done >> >> > any real tests w/ it yet, though). But I dont' really have a ton of >> >> > experience w/ off-heap memory, so I thought I would ask what others >> >> > think of the idea, if it makes sense or if there are any gotchas I >> >> > should be aware of, etc. >> >> > >> >> > thanks, >> >> > Imran >> >> > >> >> >>
