Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-16 14:39:02]: We're talking about an environment which we're always trying to optimize. Imagine that we're always trying to consolidate guests on to smaller numbers of hosts. We're effectively in a state where we _always_ want new guests. If this came at no cost to the guests, you'd be right. But at some point guest performance will be hit by this, so the advantage gained from freeing memory will be balanced by the disadvantage. Also, memory is not the only resource. At some point you become cpu bound; at that point freeing memory doesn't help and in fact may increase your cpu load. We'll probably need control over other resources as well, but IMHO memory is the most precious because it is non-renewable. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/15/2010 05:47 PM, Dave Hansen wrote: That's a bug that needs to be fixed. Eventually the host will come under pressure and will balloon the guest. If that kills the guest, the ballooning is not effective as a host memory management technique. I'm not convinced that it's just a bug that can be fixed. Consider a case where a host sees a guest with 100MB of free memory at the exact moment that a database app sees that memory. The host tries to balloon that memory away at the same time that the app goes and allocates it. That can certainly lead to an OOM very quickly, even for very small amounts of memory (much less than 100MB). Where's the bug? I think the issues are really fundamental to ballooning. There are two issues involved. One is, can the kernel accurately determine the amount of memory it needs to work? We have resources such as RAM and swap. We have liabilities in the form of swappable userspace memory, mlocked userspace memory, kernel memory to support these, and various reclaimable and non-reclaimable kernel caches. Can we determine the minimum amount of RAM to support are workload at a point in time? If we had this, we could modify the balloon to refuse to balloon if it takes the kernel beneath the minimum amount of RAM needed. In fact, this is similar to allocating memory with overcommit_memory = 0. The difference is the balloon allocates mlocked memory, while normal allocations can be charged against swap. But fundamentally it's the same. If all the guests do this, then it leaves that much more free memory on the host, which can be used flexibly for extra host page cache, new guests, etc... If the host detects lots of pagecache misses it can balloon guests down. If pagecache is quiet, why change anything? Page cache misses alone are not really sufficient. This is the classic problem where we try to differentiate streaming I/O (which we can't effectively cache) from I/O which can be effectively cached. True. Random I/O across a very large dataset is also difficult to cache. If the host wants to start new guests, it can balloon guests down. If no new guests are wanted, why change anything? We're talking about an environment which we're always trying to optimize. Imagine that we're always trying to consolidate guests on to smaller numbers of hosts. We're effectively in a state where we _always_ want new guests. If this came at no cost to the guests, you'd be right. But at some point guest performance will be hit by this, so the advantage gained from freeing memory will be balanced by the disadvantage. Also, memory is not the only resource. At some point you become cpu bound; at that point freeing memory doesn't help and in fact may increase your cpu load. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/14/2010 08:45 PM, Balbir Singh wrote: There are two decisions that need to be made: - how much memory a guest should be given - given some guest memory, what's the best use for it The first question can perhaps be answered by looking at guest I/O rates and giving more memory to more active guests. The second question is hard, but not any different than running non-virtualized - except if we can detect sharing or duplication. In this case, dropping a duplicated page is worthwhile, while dropping a shared page provides no benefit. I think there is another way of looking at it, give some free memory 1. Can the guest run more applications or run faster That's my second question. How to best use this memory. More applications == drop the page from cache, faster == keep page in cache. All we need is to select the right page to drop. 2. Can the host potentially get this memory via ballooning or some other means to start newer guest instances Well, we already have ballooning. The question is can we improve the eviction algorithm. I think the answer to 1 and 2 is yes. How the patch helps answer either question, I'm not sure. I don't think preferential dropping of unmapped page cache is the answer. Preferential dropping as selected by the host, that knows about the setup and if there is duplication involved. While we use the term preferential dropping, remember it is still via LRU and we don't always succeed. It is a best effort (if you can and the unmapped pages are not highly referenced) scenario. How can the host tell if there is duplication? It may know it has some pagecache, but it has no idea whether or to what extent guest pagecache duplicates host pagecache. Those tell you how to balance going after the different classes of things that we can reclaim. Again, this is useless when ballooning is being used. But, I'm thinking of a more general mechanism to force the system to both have MemFree _and_ be acting as if it is under memory pressure. If there is no memory pressure on the host, there is no reason for the guest to pretend it is under pressure. If there is memory pressure on the host, it should share the pain among its guests by applying the balloon. So I don't think voluntarily dropping cache is a good direction. There are two situations 1. Voluntarily drop cache, if it was setup to do so (the host knows that it caches that information anyway) It doesn't, really. The host only has aggregate information about itself, and no information about the guest. Dropping duplicate pages would be good if we could identify them. Even then, it's better to drop the page from the host, not the guest, unless we know the same page is cached by multiple guests. But why would the guest voluntarily drop the cache? If there is no memory pressure, dropping caches increases cpu overhead and latency even if the data is still cached on the host. 2. Drop the cache on either a special balloon option, again the host knows it caches that very same information, so it prefers to free that up first. Dropping in response to pressure is good. I'm just not convinced the patch helps in selecting the correct page to drop. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/14/2010 08:58 PM, Dave Hansen wrote: On Mon, 2010-06-14 at 19:34 +0300, Avi Kivity wrote: Again, this is useless when ballooning is being used. But, I'm thinking of a more general mechanism to force the system to both have MemFree _and_ be acting as if it is under memory pressure. If there is no memory pressure on the host, there is no reason for the guest to pretend it is under pressure. I can think of quite a few places where this would be beneficial. Ballooning is dangerous. I've OOMed quite a few guests by over-ballooning them. Anything that's voluntary like this is safer than things imposed by the host, although you do trade of effectiveness. That's a bug that needs to be fixed. Eventually the host will come under pressure and will balloon the guest. If that kills the guest, the ballooning is not effective as a host memory management technique. Trying to defer ballooning by voluntarily dropping cache is simply trying to defer being bitten by the bug. If all the guests do this, then it leaves that much more free memory on the host, which can be used flexibly for extra host page cache, new guests, etc... If the host detects lots of pagecache misses it can balloon guests down. If pagecache is quiet, why change anything? If the host wants to start new guests, it can balloon guests down. If no new guests are wanted, why change anything? etc... A system in this state where everyone is proactively keeping their footprints down is more likely to be able to handle load spikes. That is true. But from the guest's point of view, voluntarily giving up memory means dropping the guest's cushion vs load spikes. Reclaim is an expensive, costly activity, and this ensures that we don't have to do that when we're busy doing other things like handling load spikes. The guest doesn't want to reclaim memory from the host when it's under a load spike either. This was one of the concepts behind CMM2: reduce the overhead during peak periods. Ah, but CMM2 actually reduced work being done by sharing information between guest and host. It's also handy for planning. Guests exhibiting this behavior will _act_ as if they're under pressure. That's a good thing to approximate how a guest will act when it _is_ under pressure. If a guest acts as if it is under pressure, then it will be slower and consume more cpu. Bad for both guest and host. If there is memory pressure on the host, it should share the pain among its guests by applying the balloon. So I don't think voluntarily dropping cache is a good direction. I think we're trying to consider things slightly outside of ballooning at this point. If ballooning was the end-all solution, I'm fairly sure Balbir wouldn't be looking at this stuff. Just trying to keep options open. :) I see this as an extension to ballooning - perhaps I'm missing the big picture. I would dearly love to have CMM2 where decisions are made on a per-page basis instead of using heuristics. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/14/2010 08:40 PM, Balbir Singh wrote: * Avi Kivitya...@redhat.com [2010-06-14 18:34:58]: On 06/14/2010 06:12 PM, Dave Hansen wrote: On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote: 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. You don't have to be freeing entire slab pages for the reclaim to have been useful. You could just be making space so that _future_ allocations fill in the slab holes you just created. You may not be freeing pages, but you're reducing future system pressure. Depends. If you've evicted something that will be referenced soon, you're increasing system pressure. I don't think slab pages care about being referenced soon, they are either allocated or freed. A page is just a storage unit for the data structure. A new one can be allocated on demand. If we're talking just about slab pages, I agree. If we're applying pressure on the shrinkers, then you are removing live objects which can be costly to reinstantiate. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/14/2010 08:16 PM, Balbir Singh wrote: * Dave Hansend...@linux.vnet.ibm.com [2010-06-14 10:09:31]: On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote: If you've got duplicate pages and you know that they are duplicated and can be retrieved at a lower cost, why wouldn't we go after them first? I agree with this in theory. But, the guest lacks the information about what is truly duplicated and what the costs are for itself and/or the host to recreate it. Unmapped page cache may be the best proxy that we have at the moment for easy to recreate, but I think it's still too poor a match to make these patches useful. That is why the policy (in the next set) will come from the host. As to whether the data is truly duplicated, my experiments show up to 60% of the page cache is duplicated. Isn't that incredibly workload dependent? We can't expect the host admin to know whether duplication will occur or not. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-15 09:58:33]: On 06/14/2010 08:45 PM, Balbir Singh wrote: There are two decisions that need to be made: - how much memory a guest should be given - given some guest memory, what's the best use for it The first question can perhaps be answered by looking at guest I/O rates and giving more memory to more active guests. The second question is hard, but not any different than running non-virtualized - except if we can detect sharing or duplication. In this case, dropping a duplicated page is worthwhile, while dropping a shared page provides no benefit. I think there is another way of looking at it, give some free memory 1. Can the guest run more applications or run faster That's my second question. How to best use this memory. More applications == drop the page from cache, faster == keep page in cache. All we need is to select the right page to drop. Do we need to drop to the granularity of the page to drop? I think figuring out the class of pages and making sure that we don't write our own reclaim logic, but work with what we have to identify the class of pages is a good start. 2. Can the host potentially get this memory via ballooning or some other means to start newer guest instances Well, we already have ballooning. The question is can we improve the eviction algorithm. I think the answer to 1 and 2 is yes. How the patch helps answer either question, I'm not sure. I don't think preferential dropping of unmapped page cache is the answer. Preferential dropping as selected by the host, that knows about the setup and if there is duplication involved. While we use the term preferential dropping, remember it is still via LRU and we don't always succeed. It is a best effort (if you can and the unmapped pages are not highly referenced) scenario. How can the host tell if there is duplication? It may know it has some pagecache, but it has no idea whether or to what extent guest pagecache duplicates host pagecache. Well it is possible in host user space, I for example use memory cgroup and through the stats I have a good idea of how much is duplicated. I am ofcourse making an assumption with my setup of the cached mode, that the data in the guest page cache and page cache in the cgroup will be duplicated to a large extent. I did some trivial experiments like drop the data from the guest and look at the cost of bringing it in and dropping the data from both guest and host and look at the cost. I could see a difference. Unfortunately, I did not save the data, so I'll need to redo the experiment. Those tell you how to balance going after the different classes of things that we can reclaim. Again, this is useless when ballooning is being used. But, I'm thinking of a more general mechanism to force the system to both have MemFree _and_ be acting as if it is under memory pressure. If there is no memory pressure on the host, there is no reason for the guest to pretend it is under pressure. If there is memory pressure on the host, it should share the pain among its guests by applying the balloon. So I don't think voluntarily dropping cache is a good direction. There are two situations 1. Voluntarily drop cache, if it was setup to do so (the host knows that it caches that information anyway) It doesn't, really. The host only has aggregate information about itself, and no information about the guest. Dropping duplicate pages would be good if we could identify them. Even then, it's better to drop the page from the host, not the guest, unless we know the same page is cached by multiple guests. On the exact pages to drop, please see my comments above on the class of pages to drop. There are reasons for wanting to get the host to cache the data Unless the guest is using cache = none, the data will still hit the host page cache The host can do a better job of optimizing the writeouts But why would the guest voluntarily drop the cache? If there is no memory pressure, dropping caches increases cpu overhead and latency even if the data is still cached on the host. So, there are basically two approaches 1. First patch, proactive - enabled by a boot option 2. When ballooned, we try to (please NOTE try to) reclaim cached pages first. Failing which, we go after regular pages in the alloc_page() call in the balloon driver. 2. Drop the cache on either a special balloon option, again the host knows it caches that very same information, so it prefers to free that up first. Dropping in response to pressure is good. I'm just not convinced the patch helps in selecting the correct page to drop. That is why I've presented data on the experiments I've run and provided more arguments to backup the approach. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-15 10:12:44]: On 06/14/2010 08:16 PM, Balbir Singh wrote: * Dave Hansend...@linux.vnet.ibm.com [2010-06-14 10:09:31]: On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote: If you've got duplicate pages and you know that they are duplicated and can be retrieved at a lower cost, why wouldn't we go after them first? I agree with this in theory. But, the guest lacks the information about what is truly duplicated and what the costs are for itself and/or the host to recreate it. Unmapped page cache may be the best proxy that we have at the moment for easy to recreate, but I think it's still too poor a match to make these patches useful. That is why the policy (in the next set) will come from the host. As to whether the data is truly duplicated, my experiments show up to 60% of the page cache is duplicated. Isn't that incredibly workload dependent? We can't expect the host admin to know whether duplication will occur or not. I was referring to cache = (policy) we use based on the setup. I don't think the duplication is too workload specific. Moreover, we could use aggressive policies and restrict page cache usage or do it selectively on ballooning. We could also add other options to make the ballooning option truly optional, so that the system management software decides. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/15/2010 10:49 AM, Balbir Singh wrote: All we need is to select the right page to drop. Do we need to drop to the granularity of the page to drop? I think figuring out the class of pages and making sure that we don't write our own reclaim logic, but work with what we have to identify the class of pages is a good start. Well, the class of pages are 'pages that are duplicated on the host'. Unmapped page cache pages are 'pages that might be duplicated on the host'. IMO, that's not close enough. How can the host tell if there is duplication? It may know it has some pagecache, but it has no idea whether or to what extent guest pagecache duplicates host pagecache. Well it is possible in host user space, I for example use memory cgroup and through the stats I have a good idea of how much is duplicated. I am ofcourse making an assumption with my setup of the cached mode, that the data in the guest page cache and page cache in the cgroup will be duplicated to a large extent. I did some trivial experiments like drop the data from the guest and look at the cost of bringing it in and dropping the data from both guest and host and look at the cost. I could see a difference. Unfortunately, I did not save the data, so I'll need to redo the experiment. I'm sure we can detect it experimentally, but how do we do it programatically at run time (without dropping all the pages). Situations change, and I don't think we can infer from a few experiments that we'll have a similar amount of sharing. The cost of an incorrect decision is too high IMO (not that I think the kernel always chooses the right pages now, but I'd like to avoid regressions from the unvirtualized state). btw, when running with a disk controller that has a very large cache, we might also see duplication between guest and host. So, if this is a good idea, it shouldn't be enabled just for virtualization, but for any situation where we have a sizeable cache behind us. It doesn't, really. The host only has aggregate information about itself, and no information about the guest. Dropping duplicate pages would be good if we could identify them. Even then, it's better to drop the page from the host, not the guest, unless we know the same page is cached by multiple guests. On the exact pages to drop, please see my comments above on the class of pages to drop. Well, we disagree about that. There is some value in dropping duplicated pages (not always), but that's not what the patch does. It drops unmapped pagecache pages, which may or may not be duplicated. There are reasons for wanting to get the host to cache the data There are also reasons to get the guest to cache the data - it's more efficient to access it in the guest. Unless the guest is using cache = none, the data will still hit the host page cache The host can do a better job of optimizing the writeouts True, especially for non-raw storage. But even there we have to fsync all the time to keep the metadata right. But why would the guest voluntarily drop the cache? If there is no memory pressure, dropping caches increases cpu overhead and latency even if the data is still cached on the host. So, there are basically two approaches 1. First patch, proactive - enabled by a boot option 2. When ballooned, we try to (please NOTE try to) reclaim cached pages first. Failing which, we go after regular pages in the alloc_page() call in the balloon driver. Doesn't that mean you may evict a RU mapped page ahead of an LRU unmapped page, just in the hope that it is double-cached? Maybe we need the guest and host to talk to each other about which pages to keep. 2. Drop the cache on either a special balloon option, again the host knows it caches that very same information, so it prefers to free that up first. Dropping in response to pressure is good. I'm just not convinced the patch helps in selecting the correct page to drop. That is why I've presented data on the experiments I've run and provided more arguments to backup the approach. I'm still unconvinced, sorry. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/15/2010 10:52 AM, Balbir Singh wrote: That is why the policy (in the next set) will come from the host. As to whether the data is truly duplicated, my experiments show up to 60% of the page cache is duplicated. Isn't that incredibly workload dependent? We can't expect the host admin to know whether duplication will occur or not. I was referring to cache = (policy) we use based on the setup. I don't think the duplication is too workload specific. Moreover, we could use aggressive policies and restrict page cache usage or do it selectively on ballooning. We could also add other options to make the ballooning option truly optional, so that the system management software decides. Consider a read-only workload that exactly fits in guest cache. Without trimming, the guest will keep hitting its own cache, and the host will see no access to the cache at all. So the host (assuming it is under even low pressure) will evict those pages, and the guest will happily use its own cache. If we start to trim, the guest will have to go to disk. That's the best case. Now for the worst case. A random access workload that misses the cache on both guest and host. Now every page is duplicated, and trimming guest pages allows the host to increase its cache, and potentially reduce misses. In this case trimming duplicated pages works. Real life will see a mix of this. Often used pages won't be duplicated, and less often used pages may see some duplication, especially if the host cache portion dedicated to the guest is bigger than the guest cache. I can see that trimming duplicate pages helps, but (a) I'd like to be sure they are duplicates and (b) often trimming them from the host is better than trimming them from the guest. Trimming from the guest is worthwhile if the pages are not used very often (but enough that caching them in the host is worth it) and if the host cache can serve more than one guest. If we can identify those pages, we don't risk degrading best-case workloads (as defined above). (note ksm to some extent identifies those pages, though it is a bit expensive, and doesn't share with the host pagecache). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-15 12:44:31]: On 06/15/2010 10:49 AM, Balbir Singh wrote: All we need is to select the right page to drop. Do we need to drop to the granularity of the page to drop? I think figuring out the class of pages and making sure that we don't write our own reclaim logic, but work with what we have to identify the class of pages is a good start. Well, the class of pages are 'pages that are duplicated on the host'. Unmapped page cache pages are 'pages that might be duplicated on the host'. IMO, that's not close enough. Agreed, but what happens in reality with the code is that it drops not-so-frequently-used cache (still reusing the reclaim mechanism), but prioritizing cached memory. How can the host tell if there is duplication? It may know it has some pagecache, but it has no idea whether or to what extent guest pagecache duplicates host pagecache. Well it is possible in host user space, I for example use memory cgroup and through the stats I have a good idea of how much is duplicated. I am ofcourse making an assumption with my setup of the cached mode, that the data in the guest page cache and page cache in the cgroup will be duplicated to a large extent. I did some trivial experiments like drop the data from the guest and look at the cost of bringing it in and dropping the data from both guest and host and look at the cost. I could see a difference. Unfortunately, I did not save the data, so I'll need to redo the experiment. I'm sure we can detect it experimentally, but how do we do it programatically at run time (without dropping all the pages). Situations change, and I don't think we can infer from a few experiments that we'll have a similar amount of sharing. The cost of an incorrect decision is too high IMO (not that I think the kernel always chooses the right pages now, but I'd like to avoid regressions from the unvirtualized state). btw, when running with a disk controller that has a very large cache, we might also see duplication between guest and host. So, if this is a good idea, it shouldn't be enabled just for virtualization, but for any situation where we have a sizeable cache behind us. It depends, once the disk controller has the cache and the pages in the guest are not-so-frequently-used we can drop them. Please remember we still use the LRU to identify these pages. It doesn't, really. The host only has aggregate information about itself, and no information about the guest. Dropping duplicate pages would be good if we could identify them. Even then, it's better to drop the page from the host, not the guest, unless we know the same page is cached by multiple guests. On the exact pages to drop, please see my comments above on the class of pages to drop. Well, we disagree about that. There is some value in dropping duplicated pages (not always), but that's not what the patch does. It drops unmapped pagecache pages, which may or may not be duplicated. There are reasons for wanting to get the host to cache the data There are also reasons to get the guest to cache the data - it's more efficient to access it in the guest. Unless the guest is using cache = none, the data will still hit the host page cache The host can do a better job of optimizing the writeouts True, especially for non-raw storage. But even there we have to fsync all the time to keep the metadata right. But why would the guest voluntarily drop the cache? If there is no memory pressure, dropping caches increases cpu overhead and latency even if the data is still cached on the host. So, there are basically two approaches 1. First patch, proactive - enabled by a boot option 2. When ballooned, we try to (please NOTE try to) reclaim cached pages first. Failing which, we go after regular pages in the alloc_page() call in the balloon driver. Doesn't that mean you may evict a RU mapped page ahead of an LRU unmapped page, just in the hope that it is double-cached? Maybe we need the guest and host to talk to each other about which pages to keep. Yeah.. I guess that falls into the domain of CMM. 2. Drop the cache on either a special balloon option, again the host knows it caches that very same information, so it prefers to free that up first. Dropping in response to pressure is good. I'm just not convinced the patch helps in selecting the correct page to drop. That is why I've presented data on the experiments I've run and provided more arguments to backup the approach. I'm still unconvinced, sorry. The reason for making this optional is to let the administrators decide how they want to use the memory in the system. In some situations it might be a big no-no to waste memory, in some cases it might be acceptable. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-15 12:54:31]: On 06/15/2010 10:52 AM, Balbir Singh wrote: That is why the policy (in the next set) will come from the host. As to whether the data is truly duplicated, my experiments show up to 60% of the page cache is duplicated. Isn't that incredibly workload dependent? We can't expect the host admin to know whether duplication will occur or not. I was referring to cache = (policy) we use based on the setup. I don't think the duplication is too workload specific. Moreover, we could use aggressive policies and restrict page cache usage or do it selectively on ballooning. We could also add other options to make the ballooning option truly optional, so that the system management software decides. Consider a read-only workload that exactly fits in guest cache. Without trimming, the guest will keep hitting its own cache, and the host will see no access to the cache at all. So the host (assuming it is under even low pressure) will evict those pages, and the guest will happily use its own cache. If we start to trim, the guest will have to go to disk. That's the best case. Now for the worst case. A random access workload that misses the cache on both guest and host. Now every page is duplicated, and trimming guest pages allows the host to increase its cache, and potentially reduce misses. In this case trimming duplicated pages works. Real life will see a mix of this. Often used pages won't be duplicated, and less often used pages may see some duplication, especially if the host cache portion dedicated to the guest is bigger than the guest cache. I can see that trimming duplicate pages helps, but (a) I'd like to be sure they are duplicates and (b) often trimming them from the host is better than trimming them from the guest. Lets see the behaviour with these patches The first patch is a proactive approach to keep more memory around. Enabling the parameter implies we are OK paying the cost of some overhead. My data shows that leaves a significant amount of free memory with a small 5% (in my case) overhead. This brings us back to what you can do with free memory. The second patch shows no overhead and selectively tries to use free cache to return back on memory pressure (as indicated by the balloon driver). We've discussed the reasons for doing this 1. In the situations where cache is duplicated this should benefit us. Your contention is that we need to be specific about the duplication. That falls under the realm of CMM. 2. In the case of slab cache, duplication does not matter, it is a free page, that should be reclaimed ahead of mapped pages ideally. If the slab grows, it will get another new page. What is the cost of (1) In the worst case, we select a non-duplicated page, but for us to select it, it should be inactive, in that case we do I/O to bring back the page. Trimming from the guest is worthwhile if the pages are not used very often (but enough that caching them in the host is worth it) and if the host cache can serve more than one guest. If we can identify those pages, we don't risk degrading best-case workloads (as defined above). (note ksm to some extent identifies those pages, though it is a bit expensive, and doesn't share with the host pagecache). I see that you are hinting towards finding exact duplicates, I don't know if the cost and complexity justify it. I hope more users can try the patches with and without the boot parameter and provide additional feedback. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Tue, 2010-06-15 at 10:07 +0300, Avi Kivity wrote: On 06/14/2010 08:58 PM, Dave Hansen wrote: On Mon, 2010-06-14 at 19:34 +0300, Avi Kivity wrote: Again, this is useless when ballooning is being used. But, I'm thinking of a more general mechanism to force the system to both have MemFree _and_ be acting as if it is under memory pressure. If there is no memory pressure on the host, there is no reason for the guest to pretend it is under pressure. I can think of quite a few places where this would be beneficial. Ballooning is dangerous. I've OOMed quite a few guests by over-ballooning them. Anything that's voluntary like this is safer than things imposed by the host, although you do trade of effectiveness. That's a bug that needs to be fixed. Eventually the host will come under pressure and will balloon the guest. If that kills the guest, the ballooning is not effective as a host memory management technique. I'm not convinced that it's just a bug that can be fixed. Consider a case where a host sees a guest with 100MB of free memory at the exact moment that a database app sees that memory. The host tries to balloon that memory away at the same time that the app goes and allocates it. That can certainly lead to an OOM very quickly, even for very small amounts of memory (much less than 100MB). Where's the bug? I think the issues are really fundamental to ballooning. If all the guests do this, then it leaves that much more free memory on the host, which can be used flexibly for extra host page cache, new guests, etc... If the host detects lots of pagecache misses it can balloon guests down. If pagecache is quiet, why change anything? Page cache misses alone are not really sufficient. This is the classic problem where we try to differentiate streaming I/O (which we can't effectively cache) from I/O which can be effectively cached. If the host wants to start new guests, it can balloon guests down. If no new guests are wanted, why change anything? We're talking about an environment which we're always trying to optimize. Imagine that we're always trying to consolidate guests on to smaller numbers of hosts. We're effectively in a state where we _always_ want new guests. -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/11/2010 07:56 AM, Balbir Singh wrote: Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Which page will be preferred for eviction with this patch set? In this case the order is as follows 1. First we pick free pages if any 2. If we don't have free pages, we go after unmapped page cache and slab cache 3. If that fails as well, we go after regularly memory In the scenario that you describe, we'll not be able to easily free up the frequently referenced page from /etc/*. The code will move on to step 3 and do its regular reclaim. Still it seems to me you are subverting the normal order of reclaim. I don't see why an unmapped page cache or slab cache item should be evicted before a mapped page. Certainly the cost of rebuilding a dentry compared to the gain from evicting it, is much higher than that of reestablishing a mapped page. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-14 11:09:44]: On 06/11/2010 07:56 AM, Balbir Singh wrote: Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Which page will be preferred for eviction with this patch set? In this case the order is as follows 1. First we pick free pages if any 2. If we don't have free pages, we go after unmapped page cache and slab cache 3. If that fails as well, we go after regularly memory In the scenario that you describe, we'll not be able to easily free up the frequently referenced page from /etc/*. The code will move on to step 3 and do its regular reclaim. Still it seems to me you are subverting the normal order of reclaim. I don't see why an unmapped page cache or slab cache item should be evicted before a mapped page. Certainly the cost of rebuilding a dentry compared to the gain from evicting it, is much higher than that of reestablishing a mapped page. Subverting to aviod memory duplication, the word subverting is overloaded, let me try to reason a bit. First let me explain the problem Memory is a precious resource in a consolidated environment. We don't want to waste memory via page cache duplication (cache=writethrough and cache=writeback mode). Now here is what we are trying to do 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. 2. In the case of page cache (specifically unmapped page cache), there is duplication already, so why not go after unmapped page caches when the system is under memory pressure? In the case of 1, we don't force a dentry to be freed, but rather a freed page in the slab cache to be reclaimed ahead of forcing reclaim of mapped pages. Does the problem statement make sense? If so, do you agree with 1 and 2? Is there major concern about subverting regular reclaim? Does subverting it make sense in the duplicated scenario? -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/14/2010 11:48 AM, Balbir Singh wrote: In this case the order is as follows 1. First we pick free pages if any 2. If we don't have free pages, we go after unmapped page cache and slab cache 3. If that fails as well, we go after regularly memory In the scenario that you describe, we'll not be able to easily free up the frequently referenced page from /etc/*. The code will move on to step 3 and do its regular reclaim. Still it seems to me you are subverting the normal order of reclaim. I don't see why an unmapped page cache or slab cache item should be evicted before a mapped page. Certainly the cost of rebuilding a dentry compared to the gain from evicting it, is much higher than that of reestablishing a mapped page. Subverting to aviod memory duplication, the word subverting is overloaded, Right, should have used a different one. let me try to reason a bit. First let me explain the problem Memory is a precious resource in a consolidated environment. We don't want to waste memory via page cache duplication (cache=writethrough and cache=writeback mode). Now here is what we are trying to do 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. 2. In the case of page cache (specifically unmapped page cache), there is duplication already, so why not go after unmapped page caches when the system is under memory pressure? In the case of 1, we don't force a dentry to be freed, but rather a freed page in the slab cache to be reclaimed ahead of forcing reclaim of mapped pages. Sounds like this should be done unconditionally, then. An empty slab page is worth less than an unmapped pagecache page at all times, no? Does the problem statement make sense? If so, do you agree with 1 and 2? Is there major concern about subverting regular reclaim? Does subverting it make sense in the duplicated scenario? In the case of 2, how do you know there is duplication? You know the guest caches the page, but you have no information about the host. Since the page is cached in the guest, the host doesn't see it referenced, and is likely to drop it. If there is no duplication, then you may have dropped a recently-used page and will likely cause a major fault soon. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-14 15:40:28]: On 06/14/2010 11:48 AM, Balbir Singh wrote: In this case the order is as follows 1. First we pick free pages if any 2. If we don't have free pages, we go after unmapped page cache and slab cache 3. If that fails as well, we go after regularly memory In the scenario that you describe, we'll not be able to easily free up the frequently referenced page from /etc/*. The code will move on to step 3 and do its regular reclaim. Still it seems to me you are subverting the normal order of reclaim. I don't see why an unmapped page cache or slab cache item should be evicted before a mapped page. Certainly the cost of rebuilding a dentry compared to the gain from evicting it, is much higher than that of reestablishing a mapped page. Subverting to aviod memory duplication, the word subverting is overloaded, Right, should have used a different one. let me try to reason a bit. First let me explain the problem Memory is a precious resource in a consolidated environment. We don't want to waste memory via page cache duplication (cache=writethrough and cache=writeback mode). Now here is what we are trying to do 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. 2. In the case of page cache (specifically unmapped page cache), there is duplication already, so why not go after unmapped page caches when the system is under memory pressure? In the case of 1, we don't force a dentry to be freed, but rather a freed page in the slab cache to be reclaimed ahead of forcing reclaim of mapped pages. Sounds like this should be done unconditionally, then. An empty slab page is worth less than an unmapped pagecache page at all times, no? In a consolidated environment, even at the cost of some CPU to run shrinkers, I think potentially yes. Does the problem statement make sense? If so, do you agree with 1 and 2? Is there major concern about subverting regular reclaim? Does subverting it make sense in the duplicated scenario? In the case of 2, how do you know there is duplication? You know the guest caches the page, but you have no information about the host. Since the page is cached in the guest, the host doesn't see it referenced, and is likely to drop it. True, that is why the first patch is controlled via a boot parameter that the host can pass. For the second patch, I think we'll need something like a balloon size cache? with the cache argument being optional. If there is no duplication, then you may have dropped a recently-used page and will likely cause a major fault soon. Yes, agreed. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/14/2010 03:50 PM, Balbir Singh wrote: let me try to reason a bit. First let me explain the problem Memory is a precious resource in a consolidated environment. We don't want to waste memory via page cache duplication (cache=writethrough and cache=writeback mode). Now here is what we are trying to do 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. 2. In the case of page cache (specifically unmapped page cache), there is duplication already, so why not go after unmapped page caches when the system is under memory pressure? In the case of 1, we don't force a dentry to be freed, but rather a freed page in the slab cache to be reclaimed ahead of forcing reclaim of mapped pages. Sounds like this should be done unconditionally, then. An empty slab page is worth less than an unmapped pagecache page at all times, no? In a consolidated environment, even at the cost of some CPU to run shrinkers, I think potentially yes. I don't understand. If you're running the shrinkers then you're evicting live entries, which could cost you an I/O each. That's expensive, consolidated or not. If you're not running the shrinkers, why does it matter if you're consolidated or not? Drop that age unconditionally. Does the problem statement make sense? If so, do you agree with 1 and 2? Is there major concern about subverting regular reclaim? Does subverting it make sense in the duplicated scenario? In the case of 2, how do you know there is duplication? You know the guest caches the page, but you have no information about the host. Since the page is cached in the guest, the host doesn't see it referenced, and is likely to drop it. True, that is why the first patch is controlled via a boot parameter that the host can pass. For the second patch, I think we'll need something like a balloonsize cache? with the cache argument being optional. Whether a page is duplicated on the host or not is per-page, it cannot be a boot parameter. If we drop unmapped pagecache pages, we need to be sure they can be backed by the host, and that depends on the amount of sharing. Overall, I don't see how a user can tune this. If I were a guest admin, I'd play it safe by not assuming the host will back me, and disabling the feature. To get something like this to work, we need to reward cooperating guests somehow. If there is no duplication, then you may have dropped a recently-used page and will likely cause a major fault soon. Yes, agreed. So how do we deal with this? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Mon, 2010-06-14 at 16:01 +0300, Avi Kivity wrote: If we drop unmapped pagecache pages, we need to be sure they can be backed by the host, and that depends on the amount of sharing. You also have to set up the host up properly, and continue to maintain it in a way that finds and eliminates duplicates. I saw some benchmarks where KSM was doing great, finding lots of duplicate pages. Then, the host filled up, and guests started reclaiming. As memory pressure got worse, so did KSM's ability to find duplicates. At the same time, I see what you're trying to do with this. It really can be an alternative to ballooning if we do it right, since ballooning would probably evict similar pages. Although it would only work in idle guests, what about a knob that the host can turn to just get the guest to start running reclaim? -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/14/2010 06:12 PM, Dave Hansen wrote: On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote: 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. You don't have to be freeing entire slab pages for the reclaim to have been useful. You could just be making space so that _future_ allocations fill in the slab holes you just created. You may not be freeing pages, but you're reducing future system pressure. Depends. If you've evicted something that will be referenced soon, you're increasing system pressure. If unmapped page cache is the easiest thing to evict, then it should be the first thing that goes when a balloon request comes in, which is the case this patch is trying to handle. If it isn't the easiest thing to evict, then we _shouldn't_ evict it. Easy to evict is just one measure. There's benefit (size of data evicted), cost to refill (seeks, cpu), and likelihood that the cost to refill will be incurred (recency). It's all very complicated. We need better information to make these decisions. For one thing, I'd like to see age information tied to objects. We may have two pages that were referenced in wildly different times be next to each other in LRU order. We have many LRUs, but no idea of the relative recency of the tails of those LRUs. If each page or object had an age, we could scale those ages by the benefit from reclaim and cost to refill and make a better decision as to what to evict first. But of course page-age means increasing sizeof struct page, and we can only approximate its value by scanning the accessed bit, not determine it accurately (unlike the other objects managed by the cache). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/14/2010 06:33 PM, Dave Hansen wrote: On Mon, 2010-06-14 at 16:01 +0300, Avi Kivity wrote: If we drop unmapped pagecache pages, we need to be sure they can be backed by the host, and that depends on the amount of sharing. You also have to set up the host up properly, and continue to maintain it in a way that finds and eliminates duplicates. I saw some benchmarks where KSM was doing great, finding lots of duplicate pages. Then, the host filled up, and guests started reclaiming. As memory pressure got worse, so did KSM's ability to find duplicates. Yup. KSM needs to be backed up by ballooning, swap, and live migration. At the same time, I see what you're trying to do with this. It really can be an alternative to ballooning if we do it right, since ballooning would probably evict similar pages. Although it would only work in idle guests, what about a knob that the host can turn to just get the guest to start running reclaim? Isn't the knob in this proposal the balloon? AFAICT, the idea here is to change how the guest reacts to being ballooned, but the trigger itself would not change. My issue is that changing the type of object being preferentially reclaimed just changes the type of workload that would prematurely suffer from reclaim. In this case, workloads that use a lot of unmapped pagecache would suffer. btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Mon, 2010-06-14 at 18:44 +0300, Avi Kivity wrote: On 06/14/2010 06:33 PM, Dave Hansen wrote: At the same time, I see what you're trying to do with this. It really can be an alternative to ballooning if we do it right, since ballooning would probably evict similar pages. Although it would only work in idle guests, what about a knob that the host can turn to just get the guest to start running reclaim? Isn't the knob in this proposal the balloon? AFAICT, the idea here is to change how the guest reacts to being ballooned, but the trigger itself would not change. I think the patch was made on the following assumptions: 1. Guests will keep filling their memory with relatively worthless page cache that they don't really need. 2. When they do this, it hurts the overall system with no real gain for anyone. In the case of a ballooned guest, they _won't_ keep filling memory. The balloon will prevent them. So, I guess I was just going down the path of considering if this would be useful without ballooning in place. To me, it's really hard to justify _with_ ballooning in place. My issue is that changing the type of object being preferentially reclaimed just changes the type of workload that would prematurely suffer from reclaim. In this case, workloads that use a lot of unmapped pagecache would suffer. btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs? Those tell you how to balance going after the different classes of things that we can reclaim. Again, this is useless when ballooning is being used. But, I'm thinking of a more general mechanism to force the system to both have MemFree _and_ be acting as if it is under memory pressure. Balbir, can you elaborate a bit on why you would need these patches on a guest that is being ballooned? -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Dave Hansen d...@linux.vnet.ibm.com [2010-06-14 08:12:56]: On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote: 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. You don't have to be freeing entire slab pages for the reclaim to have been useful. You could just be making space so that _future_ allocations fill in the slab holes you just created. You may not be freeing pages, but you're reducing future system pressure. If unmapped page cache is the easiest thing to evict, then it should be the first thing that goes when a balloon request comes in, which is the case this patch is trying to handle. If it isn't the easiest thing to evict, then we _shouldn't_ evict it. Like I said earlier, a lot of that works correctly as you said, but it is also an idealization. If you've got duplicate pages and you know that they are duplicated and can be retrieved at a lower cost, why wouldn't we go after them first? -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote: If you've got duplicate pages and you know that they are duplicated and can be retrieved at a lower cost, why wouldn't we go after them first? I agree with this in theory. But, the guest lacks the information about what is truly duplicated and what the costs are for itself and/or the host to recreate it. Unmapped page cache may be the best proxy that we have at the moment for easy to recreate, but I think it's still too poor a match to make these patches useful. -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Dave Hansen d...@linux.vnet.ibm.com [2010-06-14 10:09:31]: On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote: If you've got duplicate pages and you know that they are duplicated and can be retrieved at a lower cost, why wouldn't we go after them first? I agree with this in theory. But, the guest lacks the information about what is truly duplicated and what the costs are for itself and/or the host to recreate it. Unmapped page cache may be the best proxy that we have at the moment for easy to recreate, but I think it's still too poor a match to make these patches useful. That is why the policy (in the next set) will come from the host. As to whether the data is truly duplicated, my experiments show up to 60% of the page cache is duplicated. The first patch today is again enabled by the host. Both of them are expected to be useful in the cache != none case. The data I have shows more details including the performance and overhead. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/14/2010 06:55 PM, Dave Hansen wrote: On Mon, 2010-06-14 at 18:44 +0300, Avi Kivity wrote: On 06/14/2010 06:33 PM, Dave Hansen wrote: At the same time, I see what you're trying to do with this. It really can be an alternative to ballooning if we do it right, since ballooning would probably evict similar pages. Although it would only work in idle guests, what about a knob that the host can turn to just get the guest to start running reclaim? Isn't the knob in this proposal the balloon? AFAICT, the idea here is to change how the guest reacts to being ballooned, but the trigger itself would not change. I think the patch was made on the following assumptions: 1. Guests will keep filling their memory with relatively worthless page cache that they don't really need. 2. When they do this, it hurts the overall system with no real gain for anyone. In the case of a ballooned guest, they _won't_ keep filling memory. The balloon will prevent them. So, I guess I was just going down the path of considering if this would be useful without ballooning in place. To me, it's really hard to justify _with_ ballooning in place. There are two decisions that need to be made: - how much memory a guest should be given - given some guest memory, what's the best use for it The first question can perhaps be answered by looking at guest I/O rates and giving more memory to more active guests. The second question is hard, but not any different than running non-virtualized - except if we can detect sharing or duplication. In this case, dropping a duplicated page is worthwhile, while dropping a shared page provides no benefit. How the patch helps answer either question, I'm not sure. I don't think preferential dropping of unmapped page cache is the answer. My issue is that changing the type of object being preferentially reclaimed just changes the type of workload that would prematurely suffer from reclaim. In this case, workloads that use a lot of unmapped pagecache would suffer. btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs? Those tell you how to balance going after the different classes of things that we can reclaim. Again, this is useless when ballooning is being used. But, I'm thinking of a more general mechanism to force the system to both have MemFree _and_ be acting as if it is under memory pressure. If there is no memory pressure on the host, there is no reason for the guest to pretend it is under pressure. If there is memory pressure on the host, it should share the pain among its guests by applying the balloon. So I don't think voluntarily dropping cache is a good direction. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-14 18:34:58]: On 06/14/2010 06:12 PM, Dave Hansen wrote: On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote: 1. A slab page will not be freed until the entire page is free (all slabs have been kfree'd so to speak). Normal reclaim will definitely free this page, but a lot of it depends on how frequently we are scanning the LRU list and when this page got added. You don't have to be freeing entire slab pages for the reclaim to have been useful. You could just be making space so that _future_ allocations fill in the slab holes you just created. You may not be freeing pages, but you're reducing future system pressure. Depends. If you've evicted something that will be referenced soon, you're increasing system pressure. I don't think slab pages care about being referenced soon, they are either allocated or freed. A page is just a storage unit for the data structure. A new one can be allocated on demand. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-14 19:34:00]: On 06/14/2010 06:55 PM, Dave Hansen wrote: On Mon, 2010-06-14 at 18:44 +0300, Avi Kivity wrote: On 06/14/2010 06:33 PM, Dave Hansen wrote: At the same time, I see what you're trying to do with this. It really can be an alternative to ballooning if we do it right, since ballooning would probably evict similar pages. Although it would only work in idle guests, what about a knob that the host can turn to just get the guest to start running reclaim? Isn't the knob in this proposal the balloon? AFAICT, the idea here is to change how the guest reacts to being ballooned, but the trigger itself would not change. I think the patch was made on the following assumptions: 1. Guests will keep filling their memory with relatively worthless page cache that they don't really need. 2. When they do this, it hurts the overall system with no real gain for anyone. In the case of a ballooned guest, they _won't_ keep filling memory. The balloon will prevent them. So, I guess I was just going down the path of considering if this would be useful without ballooning in place. To me, it's really hard to justify _with_ ballooning in place. There are two decisions that need to be made: - how much memory a guest should be given - given some guest memory, what's the best use for it The first question can perhaps be answered by looking at guest I/O rates and giving more memory to more active guests. The second question is hard, but not any different than running non-virtualized - except if we can detect sharing or duplication. In this case, dropping a duplicated page is worthwhile, while dropping a shared page provides no benefit. I think there is another way of looking at it, give some free memory 1. Can the guest run more applications or run faster 2. Can the host potentially get this memory via ballooning or some other means to start newer guest instances I think the answer to 1 and 2 is yes. How the patch helps answer either question, I'm not sure. I don't think preferential dropping of unmapped page cache is the answer. Preferential dropping as selected by the host, that knows about the setup and if there is duplication involved. While we use the term preferential dropping, remember it is still via LRU and we don't always succeed. It is a best effort (if you can and the unmapped pages are not highly referenced) scenario. My issue is that changing the type of object being preferentially reclaimed just changes the type of workload that would prematurely suffer from reclaim. In this case, workloads that use a lot of unmapped pagecache would suffer. btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs? Those tell you how to balance going after the different classes of things that we can reclaim. Again, this is useless when ballooning is being used. But, I'm thinking of a more general mechanism to force the system to both have MemFree _and_ be acting as if it is under memory pressure. If there is no memory pressure on the host, there is no reason for the guest to pretend it is under pressure. If there is memory pressure on the host, it should share the pain among its guests by applying the balloon. So I don't think voluntarily dropping cache is a good direction. There are two situations 1. Voluntarily drop cache, if it was setup to do so (the host knows that it caches that information anyway) 2. Drop the cache on either a special balloon option, again the host knows it caches that very same information, so it prefers to free that up first. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Mon, 2010-06-14 at 19:34 +0300, Avi Kivity wrote: Again, this is useless when ballooning is being used. But, I'm thinking of a more general mechanism to force the system to both have MemFree _and_ be acting as if it is under memory pressure. If there is no memory pressure on the host, there is no reason for the guest to pretend it is under pressure. I can think of quite a few places where this would be beneficial. Ballooning is dangerous. I've OOMed quite a few guests by over-ballooning them. Anything that's voluntary like this is safer than things imposed by the host, although you do trade of effectiveness. If all the guests do this, then it leaves that much more free memory on the host, which can be used flexibly for extra host page cache, new guests, etc... A system in this state where everyone is proactively keeping their footprints down is more likely to be able to handle load spikes. Reclaim is an expensive, costly activity, and this ensures that we don't have to do that when we're busy doing other things like handling load spikes. This was one of the concepts behind CMM2: reduce the overhead during peak periods. It's also handy for planning. Guests exhibiting this behavior will _act_ as if they're under pressure. That's a good thing to approximate how a guest will act when it _is_ under pressure. If there is memory pressure on the host, it should share the pain among its guests by applying the balloon. So I don't think voluntarily dropping cache is a good direction. I think we're trying to consider things slightly outside of ballooning at this point. If ballooning was the end-all solution, I'm fairly sure Balbir wouldn't be looking at this stuff. Just trying to keep options open. :) -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-11 14:05:53]: On Fri, 11 Jun 2010 10:16:32 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-11 10:54:41]: On Thu, 10 Jun 2010 17:07:32 -0700 Dave Hansen d...@linux.vnet.ibm.com wrote: On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: I'm not sure victimizing unmapped cache pages is a good idea. Shouldn't page selection use the LRU for recency information instead of the cost of guest reclaim? Dropping a frequently used unmapped cache page can be more expensive than dropping an unused text page that was loaded as part of some executable's initialization and forgotten. We victimize the unmapped cache only if it is unused (in LRU order). We don't force the issue too much. We also have free slab cache to go after. Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Hmm. I'm not fan of estimating working set size by calculation based on some numbers without considering history or feedback. Can't we use some kind of feedback algorithm as hi-low-watermark, random walk or GA (or somehing more smart) to detect the size ? Could you please clarify at what level you are suggesting size detection? I assume it is outside the OS, right? OS includes kernel and system programs ;) I can think of both way in kernel and in user approarh and they should be complement to each other. An example of kernel-based approach is. 1. add a shrinker callback(A) for balloon-driver-for-guest as guest kswapd. 2. add a shrinker callback(B) for balloon-driver-for-host as host kswapd. (I guess current balloon driver is only for host. Please imagine.) (A) increases free memory in Guest. (B) increases free memory in Host. This is an example of feedback based memory resizing between host and guest. I think (B) is necessary at least before considering complecated things. B is left to the hypervisor and the memory policy running on it. My patches address Linux running as a guest, with a Linux hypervisor at the moment, but that can be extended to other balloon drivers as well. To implement something clever, (A) and (B) should take into account that how frequently memory reclaim in guest (which requires some I/O) happens. Yes, I think the policy in the hypervisor needs to look at those details as well. If doing outside kernel, I think using memcg is better than depends on balloon driver. But co-operative balloon and memcg may show us something good. Yes, agreed. Co-operative is better, if there is no co-operation than memcg might be used for enforcement. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On 06/08/2010 06:51 PM, Balbir Singh wrote: Balloon unmapped page cache pages first From: Balbir Singhbal...@linux.vnet.ibm.com This patch builds on the ballooning infrastructure by ballooning unmapped page cache pages first. It looks for low hanging fruit first and tries to reclaim clean unmapped pages first. I'm not sure victimizing unmapped cache pages is a good idea. Shouldn't page selection use the LRU for recency information instead of the cost of guest reclaim? Dropping a frequently used unmapped cache page can be more expensive than dropping an unused text page that was loaded as part of some executable's initialization and forgotten. Many workloads have many unmapped cache pages, for example static web serving and the all-important kernel build. The key advantage was that it resulted in lesser RSS usage in the host and more cached usage, indicating that the caching had been pushed towards the host. The guest cached memory usage was lower and free memory in the guest was also higher. Caching in the host is only helpful if the cache can be shared, otherwise it's better to cache in the guest. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Avi Kivity a...@redhat.com [2010-06-10 12:43:11]: On 06/08/2010 06:51 PM, Balbir Singh wrote: Balloon unmapped page cache pages first From: Balbir Singhbal...@linux.vnet.ibm.com This patch builds on the ballooning infrastructure by ballooning unmapped page cache pages first. It looks for low hanging fruit first and tries to reclaim clean unmapped pages first. I'm not sure victimizing unmapped cache pages is a good idea. Shouldn't page selection use the LRU for recency information instead of the cost of guest reclaim? Dropping a frequently used unmapped cache page can be more expensive than dropping an unused text page that was loaded as part of some executable's initialization and forgotten. We victimize the unmapped cache only if it is unused (in LRU order). We don't force the issue too much. We also have free slab cache to go after. Many workloads have many unmapped cache pages, for example static web serving and the all-important kernel build. I've tested kernbench, you can see the results in the original posting and there is no observable overhead as a result of the patch in my run. The key advantage was that it resulted in lesser RSS usage in the host and more cached usage, indicating that the caching had been pushed towards the host. The guest cached memory usage was lower and free memory in the guest was also higher. Caching in the host is only helpful if the cache can be shared, otherwise it's better to cache in the guest. Hmm.. so we would need a ballon cache hint from the monitor, so that it is not unconditional? Overall my results show the following 1. No drastic reduction of guest unmapped cache, just sufficient to show lesser RSS in the host. More freeable memory (as in cached memory + free memory) visible on the host. 2. No significant impact on the benchmark (numbers) running in the guest. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: I'm not sure victimizing unmapped cache pages is a good idea. Shouldn't page selection use the LRU for recency information instead of the cost of guest reclaim? Dropping a frequently used unmapped cache page can be more expensive than dropping an unused text page that was loaded as part of some executable's initialization and forgotten. We victimize the unmapped cache only if it is unused (in LRU order). We don't force the issue too much. We also have free slab cache to go after. Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Which page will be preferred for eviction with this patch set? -- Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Thu, 10 Jun 2010 17:07:32 -0700 Dave Hansen d...@linux.vnet.ibm.com wrote: On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: I'm not sure victimizing unmapped cache pages is a good idea. Shouldn't page selection use the LRU for recency information instead of the cost of guest reclaim? Dropping a frequently used unmapped cache page can be more expensive than dropping an unused text page that was loaded as part of some executable's initialization and forgotten. We victimize the unmapped cache only if it is unused (in LRU order). We don't force the issue too much. We also have free slab cache to go after. Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Hmm. I'm not fan of estimating working set size by calculation based on some numbers without considering history or feedback. Can't we use some kind of feedback algorithm as hi-low-watermark, random walk or GA (or somehing more smart) to detect the size ? Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-11 10:54:41]: On Thu, 10 Jun 2010 17:07:32 -0700 Dave Hansen d...@linux.vnet.ibm.com wrote: On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: I'm not sure victimizing unmapped cache pages is a good idea. Shouldn't page selection use the LRU for recency information instead of the cost of guest reclaim? Dropping a frequently used unmapped cache page can be more expensive than dropping an unused text page that was loaded as part of some executable's initialization and forgotten. We victimize the unmapped cache only if it is unused (in LRU order). We don't force the issue too much. We also have free slab cache to go after. Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Hmm. I'm not fan of estimating working set size by calculation based on some numbers without considering history or feedback. Can't we use some kind of feedback algorithm as hi-low-watermark, random walk or GA (or somehing more smart) to detect the size ? Could you please clarify at what level you are suggesting size detection? I assume it is outside the OS, right? -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
* Dave Hansen d...@linux.vnet.ibm.com [2010-06-10 17:07:32]: On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: I'm not sure victimizing unmapped cache pages is a good idea. Shouldn't page selection use the LRU for recency information instead of the cost of guest reclaim? Dropping a frequently used unmapped cache page can be more expensive than dropping an unused text page that was loaded as part of some executable's initialization and forgotten. We victimize the unmapped cache only if it is unused (in LRU order). We don't force the issue too much. We also have free slab cache to go after. Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Which page will be preferred for eviction with this patch set? In this case the order is as follows 1. First we pick free pages if any 2. If we don't have free pages, we go after unmapped page cache and slab cache 3. If that fails as well, we go after regularly memory In the scenario that you describe, we'll not be able to easily free up the frequently referenced page from /etc/*. The code will move on to step 3 and do its regular reclaim. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Fri, 11 Jun 2010 10:16:32 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-11 10:54:41]: On Thu, 10 Jun 2010 17:07:32 -0700 Dave Hansen d...@linux.vnet.ibm.com wrote: On Thu, 2010-06-10 at 19:55 +0530, Balbir Singh wrote: I'm not sure victimizing unmapped cache pages is a good idea. Shouldn't page selection use the LRU for recency information instead of the cost of guest reclaim? Dropping a frequently used unmapped cache page can be more expensive than dropping an unused text page that was loaded as part of some executable's initialization and forgotten. We victimize the unmapped cache only if it is unused (in LRU order). We don't force the issue too much. We also have free slab cache to go after. Just to be clear, let's say we have a mapped page (say of /sbin/init) that's been unreferenced since _just_ after the system booted. We also have an unmapped page cache page of a file often used at runtime, say one from /etc/resolv.conf or /etc/passwd. Hmm. I'm not fan of estimating working set size by calculation based on some numbers without considering history or feedback. Can't we use some kind of feedback algorithm as hi-low-watermark, random walk or GA (or somehing more smart) to detect the size ? Could you please clarify at what level you are suggesting size detection? I assume it is outside the OS, right? OS includes kernel and system programs ;) I can think of both way in kernel and in user approarh and they should be complement to each other. An example of kernel-based approach is. 1. add a shrinker callback(A) for balloon-driver-for-guest as guest kswapd. 2. add a shrinker callback(B) for balloon-driver-for-host as host kswapd. (I guess current balloon driver is only for host. Please imagine.) (A) increases free memory in Guest. (B) increases free memory in Host. This is an example of feedback based memory resizing between host and guest. I think (B) is necessary at least before considering complecated things. To implement something clever, (A) and (B) should take into account that how frequently memory reclaim in guest (which requires some I/O) happens. If doing outside kernel, I think using memcg is better than depends on balloon driver. But co-operative balloon and memcg may show us something good. Thanks, -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control
On Fri, 11 Jun 2010 14:05:53 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: I can think of both way in kernel and in user approarh and they should be complement to each other. An example of kernel-based approach is. 1. add a shrinker callback(A) for balloon-driver-for-guest as guest kswapd. 2. add a shrinker callback(B) for balloon-driver-for-host as host kswapd. (I guess current balloon driver is only for host. Please imagine.) guest. Sorry. -Kame -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html