[Devel] Re: memrlimit controller merge to mainline
On Tue, 5 Aug 2008, Balbir Singh wrote: Hugh Dickins wrote: [snip] BUG: unable to handle kernel paging request at 6b6b6b8b IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7) [78161323] ? exit_mmap+0xaf/0x133 [781226b1] ? mmput+0x4c/0xba [78165ce3] ? try_to_unuse+0x20b/0x3f5 [78371534] ? _spin_unlock+0x22/0x3c [7816636a] ? sys_swapoff+0x17b/0x37c [78102d95] ? sysenter_past_esp+0x6a/0xa5 I am unable to reproduce the problem, Me neither, I've spent many hours trying 2.6.27-rc1-mm1 and then back to 2.6.26-rc8-mm1. But I've been SO stupid: saw it originally on one machine with SLAB_DEBUG=y, have been trying since mostly on another with SLUB_DEBUG=y, but never thought to boot with slub_debug=P,task_struct until now. but I do have an initial hypothesis CPU0 CPU1 try_to_unuse task 1 stars exiting look at mm = task1-mm ..increment mm_users task 1 exits mm-owner needs to be updated, but no new owner is found (mm_users 1, but no other task has task-mm = task1-mm) mm_update_next_owner() leaves grace period user count drops, call mmput(mm) task 1 freed dereferencing mm-owner fails Yes, that looks right to me: seems obvious now. I don't think your careful alternation of CPU0/1 events at the end matters: the swapoff CPU simply dereferences mm-owner after that task has gone. (That's a shame, I'd always hoped that mm-owner-comm was going to be good for use in mm messages, even when tearing down the mm.) I do have a potential solution in mind, but I want to make sure my hypothesis is correct. It seems wrong that memrlimit_cgroup_uncharge_as should be called after mm-owner may have been changed, even if it's to something safe. But I forget the mm/task exit details, surely they're tricky. By the way, is the ordering in mm_update_next_owner the best? Would there be less movement if it searched amongst siblings before it searched amongst children? Ought it to make a first pass trying to stay within the same cgroup? Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Tue, 29 Jul 2008, KAMEZAWA Hiroyuki wrote: On Fri, 25 Jul 2008 17:46:45 +0100 (BST) Hugh Dickins [EMAIL PROTECTED] wrote: IIRC Rik expressed the same by pointing out that a cgroup at its swap limit would then be forced to grow in mem (until it hits its mem limit): so controlling the less precious resource would increase pressure on the more precious resource. (Actually, that probably bears little relation to what he said - sorry, Rik!) I don't recall what answer he got, perhaps I'd be persuaded if I heard it again. Added Nishimura to CC. IMHO, from user point of view, both of - having 2 controls as mem controller + swap controller - mem + swap controller doesn't have much difference. The users will use as they like. I'm not suggesting either one of those alternatives. I'm suggesting we have a mem controller (the thing we already have) and a mem+swap controller (which we don't yet have: a controller for the total mem+swap of a cgroup); the mem+swap controller likely making use of much that is in the mem controller, as Paul has said. (Unfortunately I don't have a good name for this mem+swap.) I happen to believe that the mem+swap controller would actually be a lot more useful than the current mem controller, and would expect many to run with mem+swap controller enabled but mem controller disabled or unlimited. How much is mem and how much is swap being left to global reclaim to decide, not imposed by any cgroup policy. What I don't like the sound of at all is a swap controller. Do you think that a mem controller (limit 1G) and a mem+swap controller (limit 2G) is equivalent to a mem controller (limit 1G) and a swap controller (limit 1G)? No: imagine memory pressure from outside the cgroup - with the mem+swap controller it can push as much as suits of the 2G out to swap; whereas with the swap controller, once 1G is out, it has to stop pushing any more of that cgroup out. I think that's absurd - but perhaps I just haven't looked, and I've totally misinterpreted the talk of a swap controller. From memory controller's point of view, treating mem+swap by the same controller makes sense. Because memory controller can check wheter we can use more swap or not, we can avoid hopeless-scanning of Anon at swap-shortage. (By split-lru, I think we can do this avoidance.) That's a detail I'm not concerned with on this level. Another-Topic? In recent servers, memory is big, swap is (relatively) small. You'll know much more about those common proportions than I do. I'd wonder why such big memory servers have any swap at all: to cope with VM management defects we should be fixing? And under memory resource controller, the whole swap is easily occupied by a group. I want to avoid it. Why? I presume because you're thinking it a precious resource. I don't think its relative smallness makes it more precious. For users, swap is not precious because it's not fast. Yes, and that's my view. But for memory reclaiming, swap is precious resource to page out anonymous/shmem/tmpfs memory. I see that makes swap a useful resource, I don't see that it makes it a precious resource. We page out to it precisely because it's less precious than the memory; both users and kernel would much prefer to keep all the data in memory, but sometimes there isn't enough memory so we go to swap. There is just one way in which I see swap as precious, and that is to get around some VM management stupidity. If, for example, on i386 there's a shortage of lowmem and lots of anonymous in lowmem that we should shift to highmem, then I think it's still the case that we have to do that balancing via writing out to and reading in from swap, because nobody has actually hooked up page migration to do that when appropriate? But that's an argument for extending page migration, not for needing a swap controller. I think usual system-admin considers swap as some emergency spare of memory. Yes, I do too. I'd like to allow this emergency spare to each cgroup. We do allow that emergency spare to each cgroup. Perhaps you're saying you want to divide it up in advance between the cgroups? But why? Sounds like a nice idea (reminds me of what Paul said about using temporary files), but a solution to what problem? (For example, swap is used even if vm.swappiness==0. This is for avoiding OOM-Killer under some situation, this behavior is added by Rik.) Sorry, I don't know what you're referring to there, but again, suspect it's a detail we don't need to be concerned with here. == following is another use case I explained to Rik at 23/May/08 == IIRC, a man shown his motivation to controll swap in OLS2007/BOF as following. Consider following system. (and there is no swap controller.) Memory 4G. Swap 1G. with 2 cgroups A, B. state 1) swap is not used. Amemory limit to be 1G no swap usage memory_usage=0M Bmemory limit to be 1G no swap usage memory_usage=0M state 2) Run
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008, Paul Menage wrote: On Fri, Jul 25, 2008 at 12:46 PM, Hugh Dickins [EMAIL PROTECTED] wrote: No, I'm trying to say something stronger than that. I'm saying, as I've said before, that I cannot imagine why anyone would want to control swap itself - what they want to control is the total of mem+swap. Swap is a second-class citizen, nobody wants swap if they can have mem, so why control it separately? Scheduling jobs on to machines is much more straightforward when they request xGB of memory and yGB of swap rather than just (x+y)GB of (memory+swap). We want to be able to guarantee to jobs that they will be able to use xGB of real memory. I don't see that I'm denying you a way to guarantee that (though I've been thinking more of the limits than the guarantees): I'm not saying that you cannot have a mem controller, I'm saying that you can also have a mem+swap controller; but that a swap-by-itself controller makes no sense to me. Actually my preferred approach to swap controlling would be something like: - allow malloc to support mmaping pages from a temporary file rather than mmapping anonymous memory I think that works until you get to fork: shared files and private/anonymous/swap behave differently from then on. Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008, Balbir Singh wrote: I see what your saying. When you look at Linux right now, we control swap independent of memory, so I am not totally opposed to setting swap, instead of swap+mem. I might not want to swap from a particular cgroup, in which case, I set swap to 0 and risk OOMing, which might be an acceptable trade-off depending on my setup. I could easily change this policy on demand and add swap if OOMing was no longer OK. It's taken me a while to understand your point. I think you're saying that with a swap controller, you can set the swap limit to 0 on a cgroup if you want to keep it entirely in memory, without setting any mem limit upon it; whereas with my mem+swap controller, you'd have to set a mem limit then an equal mem+swap limit to achieve the same never go to swap effect, and maybe you don't want to set a mem limit. Hmm, but an unreachably high mem limit, and equal mem+swap limit, would achieve that effect. Sorry, I don't think I have understood (and even if the unreachably high limit didn't work, this seems more about setting a don't-swap flag than imposing a swap limit). Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008, Paul Menage wrote: So I think we'd be complicating some of the vm paths in mainline with a feature that isn't likely to get a lot of real use. What do you (and others on the containers list) think? Should we ask Andrew/Linus to hold off on this for now? My preference would be to do that until we have someone who can stand up with a concrete scenario where they want to use this in a real environment. I see Andrew has already acted, so it's now moot. But I'd like to say that I do agree with you and the conclusion to hold off for now. I was a bit alarmed earlier to see those patches sailing on through; but realized that I'd done very little to substantiate my hatred of the whole thing, and decided that I didn't feel strongly enough to stand in the way now. But I am glad you've stepped in, thank you. (Different topic, but one day I ought to get around to saying again how absurd I think a swap controller; whereas a mem+swap controller makes plenty of sense. I think Rik and others said the same.) By the way, here's a BUG I got from CONFIG_CGROUP_MEMRLIMIT_CTLR=y but no use of it, when doing swapoff a week ago. Not investigated at all, I'm afraid, but at a guess it might come from memrlimit work placing too much faith in the mm_users count - swapoff is only one of several places which have to inc/dec mm_users for some reason. BUG: unable to handle kernel paging request at 6b6b6b8b IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 *pde = Oops: [#1] PREEMPT SMP last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device thermal ac battery button Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7) EIP: 0060:[7817078f] EFLAGS: 00010206 CPU: 0 EIP is at memrlimit_cgroup_uncharge_as+0x18/0x29 EAX: 6b6b6b6b EBX: 7963215c ECX: 7c032000 EDX: 0025e000 ESI: 96902518 EDI: 9fbb1aa0 EBP: 7c033e9c ESP: 7c033e9c DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process swapoff (pid: 22500, ti=7c032000 task=907e2b70 task.ti=7c032000) Stack: 7c033edc 78161323 9fbb1aa0 025e ff77 7c033ecc 96902518 7c033ec8 0089 7963215c 9fbb1aa0 9fbb1b28 a272f040 7c033ef4 781226b1 9fbb1aa0 9fbb1aa0 790fa884 a272f0c8 7c033f80 78165ce3 Call Trace: [78161323] ? exit_mmap+0xaf/0x133 [781226b1] ? mmput+0x4c/0xba [78165ce3] ? try_to_unuse+0x20b/0x3f5 [78371534] ? _spin_unlock+0x22/0x3c [7816636a] ? sys_swapoff+0x17b/0x37c [78102d95] ? sysenter_past_esp+0x6a/0xa5 === Code: 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 ad a6 fd ff c9 c3 55 89 e5 8b 45 08 8b 55 0c 8b 80 30 02 00 00 c1 e2 0c 8b 80 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 e6 a6 fd ff 58 5a c9 c3 55 89 e5 8b EIP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 SS:ESP 0068:7c033e9c Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008, Paul Menage wrote: On Fri, Jul 25, 2008 at 5:06 AM, Hugh Dickins [EMAIL PROTECTED] wrote: (Different topic, but one day I ought to get around to saying again how absurd I think a swap controller; whereas a mem+swap controller makes plenty of sense. I think Rik and others said the same.) Agreed that a swap controller without a memory controller doesn't make much sense, but a memory controller without a swap controller can make sense on machines that don't intend to use swap. I agree that a memory controller without a swap controller can make sense: I hope so, anyway, since that's what's in mainline. Even if swap is used, memory is a more precious resource than swap, and you were right to go about controlling memory first. So if they were separate controllers, we'd use the proposed cgroup dependency features to make the swap controller depend on the memory controller - in which case you'd only be able to mount the swap controller on a hierarchy that also had the memory controller, and the swap controller would be able to make use of the page ownership information. It's more of a modularity issue than a functionality issue, I think - the swap controller and memory controller are tracking fundamentally different things (space on disk versus pages in memory), and the only dependency between them is the memory controller tracking the ownership of a page and providing it to the swap controller. It sounds as if you're interpreting my mem+swap controller as a mem controller and a swap controller and the swap controller makes use of some of the mem controller infrastructure. No, I'm trying to say something stronger than that. I'm saying, as I've said before, that I cannot imagine why anyone would want to control swap itself - what they want to control is the total of mem+swap. Swap is a second-class citizen, nobody wants swap if they can have mem, so why control it separately? IIRC Rik expressed the same by pointing out that a cgroup at its swap limit would then be forced to grow in mem (until it hits its mem limit): so controlling the less precious resource would increase pressure on the more precious resource. (Actually, that probably bears little relation to what he said - sorry, Rik!) I don't recall what answer he got, perhaps I'd be persuaded if I heard it again. Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008, Balbir Singh wrote: I'll try and recreate the problem and fix it. If memrlimit_cgroup_uncharge_as() created the problem, it's most likely related to mm-owner not being correct and we are dereferencing the wrong memory. Thanks for the bug report, I'll look further. Good luck! I have only seen it once, on a dual-core laptop; though I don't remember to try swapoff while busy as often as I should (be sure to alternate between a couple or more of swapareas, so you can swap a new one on just before swapping an old one off, to be pretty sure of success). May be easier to find in the source: my suspicion is that a bad mm_users assumption will come into it. But I realize now that it could be entirely unrelated to memrlimit, just that uncharge_as was the one to get hit by bad refcounting elsewhere. Oh, that reminds me, I never reported back on my res_counter warnings at shutdown: never saw them again, once I added in the set of changes you came up with shortly after that - thanks. Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC/PATCH] cgroup swap subsystem
On Wed, 5 Mar 2008, Pavel Emelyanov wrote: Daisuke Nishimura wrote: Todo: - rebase new kernel, and split into some patches. - Merge with memory subsystem (if it would be better), or remove dependency on CONFIG_CGROUP_MEM_CONT if possible (needs to make page_cgroup more generic one). Merge is a must IMHO. I can hardly imagine a situation in which someone would need these two separately. Strongly agree. Nobody's interested in swap as such: it's just secondary memory, where RAM is primary memory. People want to control memory as the sum of the two; and I expect they may also want to control primary memory (all that the current memcg does) within that. I wonder if such nesting of limits fits easily into cgroups or will be problematic. Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH][SHMEM] Factor out sbi-free_inodes manipulations
On Mon, 26 Nov 2007, Andrew Morton wrote: It is peculair to (wrongly) return -ENOMEM + if (shmem_reserve_inode(inode-i_sb)) + return -ENOSPC; and to then correct it in the caller.. Oops, I missed that completely. Something boringly conventional such as the below, perhaps? Much better, thanks. Hugh ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH][SHMEM] Factor out sbi-free_inodes manipulations
Looks good, but we can save slightly more there (depending on config), and I found your inc/dec names a little confusing, since the count is going the other way: how do you feel about this version? (I'd like it better if those helpers could take a struct inode *, but they cannot.) Hugh From: Pavel Emelyanov [EMAIL PROTECTED] The shmem_sb_info structure has a number of free_inodes. This value is altered in appropriate places under spinlock and with the sbi-max_inodes != 0 check. Consolidate these manipulations into two helpers. This is minus 42 bytes of shmem.o and minus 4 :) lines of code. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Signed-off-by: Hugh Dickins [EMAIL PROTECTED] --- mm/shmem.c | 72 --- 1 file changed, 34 insertions(+), 38 deletions(-) --- 2.6.24-rc3/mm/shmem.c 2007-11-07 04:21:45.0 + +++ linux/mm/shmem.c2007-11-23 12:43:28.0 + @@ -207,6 +207,31 @@ static void shmem_free_blocks(struct ino } } +static int shmem_reserve_inode(struct super_block *sb) +{ + struct shmem_sb_info *sbinfo = SHMEM_SB(sb); + if (sbinfo-max_inodes) { + spin_lock(sbinfo-stat_lock); + if (!sbinfo-free_inodes) { + spin_unlock(sbinfo-stat_lock); + return -ENOMEM; + } + sbinfo-free_inodes--; + spin_unlock(sbinfo-stat_lock); + } + return 0; +} + +static void shmem_free_inode(struct super_block *sb) +{ + struct shmem_sb_info *sbinfo = SHMEM_SB(sb); + if (sbinfo-max_inodes) { + spin_lock(sbinfo-stat_lock); + sbinfo-free_inodes++; + spin_unlock(sbinfo-stat_lock); + } +} + /* * shmem_recalc_inode - recalculate the size of an inode * @@ -762,7 +787,6 @@ static int shmem_notify_change(struct de static void shmem_delete_inode(struct inode *inode) { - struct shmem_sb_info *sbinfo = SHMEM_SB(inode-i_sb); struct shmem_inode_info *info = SHMEM_I(inode); if (inode-i_op-truncate == shmem_truncate) { @@ -777,11 +801,7 @@ static void shmem_delete_inode(struct in } } BUG_ON(inode-i_blocks); - if (sbinfo-max_inodes) { - spin_lock(sbinfo-stat_lock); - sbinfo-free_inodes++; - spin_unlock(sbinfo-stat_lock); - } + shmem_free_inode(inode-i_sb); clear_inode(inode); } @@ -1398,15 +1418,8 @@ shmem_get_inode(struct super_block *sb, struct shmem_inode_info *info; struct shmem_sb_info *sbinfo = SHMEM_SB(sb); - if (sbinfo-max_inodes) { - spin_lock(sbinfo-stat_lock); - if (!sbinfo-free_inodes) { - spin_unlock(sbinfo-stat_lock); - return NULL; - } - sbinfo-free_inodes--; - spin_unlock(sbinfo-stat_lock); - } + if (shmem_reserve_inode(sb)) + return NULL; inode = new_inode(sb); if (inode) { @@ -1450,11 +1463,8 @@ shmem_get_inode(struct super_block *sb, NULL); break; } - } else if (sbinfo-max_inodes) { - spin_lock(sbinfo-stat_lock); - sbinfo-free_inodes++; - spin_unlock(sbinfo-stat_lock); - } + } else + shmem_free_inode(sb); return inode; } @@ -1797,22 +1807,14 @@ static int shmem_create(struct inode *di static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry) { struct inode *inode = old_dentry-d_inode; - struct shmem_sb_info *sbinfo = SHMEM_SB(inode-i_sb); /* * No ordinary (disk based) filesystem counts links as inodes; * but each new link needs a new dentry, pinning lowmem, and * tmpfs dentries cannot be pruned until they are unlinked. */ - if (sbinfo-max_inodes) { - spin_lock(sbinfo-stat_lock); - if (!sbinfo-free_inodes) { - spin_unlock(sbinfo-stat_lock); - return -ENOSPC; - } - sbinfo-free_inodes--; - spin_unlock(sbinfo-stat_lock); - } + if (shmem_reserve_inode(inode-i_sb)) + return -ENOSPC; dir-i_size += BOGO_DIRENT_SIZE; inode-i_ctime = dir-i_ctime = dir-i_mtime = CURRENT_TIME; @@ -1827,14 +1829,8 @@ static int shmem_unlink(struct inode *di { struct inode *inode = dentry-d_inode; - if (inode-i_nlink 1 !S_ISDIR(inode-i_mode)) { - struct shmem_sb_info *sbinfo = SHMEM_SB(inode-i_sb); - if (sbinfo-max_inodes) { - spin_lock(sbinfo-stat_lock); - sbinfo-free_inodes++; - spin_unlock(sbinfo-stat_lock
[Devel] Re: [PATCH 6/6 mm] memcgroup: revert swap_state mods
On Fri, 9 Nov 2007, KAMEZAWA Hiroyuki wrote: On Fri, 9 Nov 2007 07:14:22 + (GMT) Hugh Dickins [EMAIL PROTECTED] wrote: If we're charging rss and we're charging cache, it seems obvious that we should be charging swapcache - as has been done. But in practice that doesn't work out so well: both swapin readahead and swapoff leave the majority of pages charged to the wrong cgroup (the cgroup that happened to read them in, rather than the cgroup to which they belong). Thank you. I welcome this patch :) Thank you! But perhaps less welcome if I don't confirm... Could I confirm a change in the logic ? * Before this patch, wrong swapcache charge is added to one who called try_to_free_page(). try_to_free_pages? No, I don't think any wrong charge was made there. It was when reading in swap pages. The usual way is by swapin_readahead, which reads in a cluster of swap pages, which are quite likely to belong to different memcgroups, but were all charged to the one which is doing the fault on its target page. Another way is in swapoff, where they all got charged to whoever was doing the swapoff (and the charging in unuse_pte was a no-op). * After this patch, anonymous page's charge will drop to 0 when page_remove_rmap() is called. Yes, when its final (usually its only) page_remove_rmap is called. Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/6 mm] swapoff: scan ptes preemptibly
Provided that CONFIG_HIGHPTE is not set, unuse_pte_range can reduce latency in swapoff by scanning the page table preemptibly: so long as unuse_pte is careful to recheck that entry under pte lock. (To tell the truth, this patch was not inspired by any cries for lower latency here: rather, this restructuring permits a future memory controller patch to allocate with GFP_KERNEL in unuse_pte, where before it could not. But it would be wrong to tuck this change away inside a memcgroup patch.) Signed-off-by: Hugh Dickins [EMAIL PROTECTED] --- This patch could go anywhere in the mm series before the memory-controller patches: I suggest just after swapin-fix-valid-swaphandles-defect.patch Subsequent patches N/6 go in different places amongst the memory-controller patches: please see accompanying suggestions. mm/swapfile.c | 38 +++--- 1 file changed, 31 insertions(+), 7 deletions(-) --- patch0/mm/swapfile.c2007-11-07 19:41:45.0 + +++ patch1/mm/swapfile.c2007-11-08 12:34:12.0 + @@ -506,9 +506,19 @@ unsigned int count_swap_pages(int type, * just let do_wp_page work it out if a write is requested later - to * force COW, vm_page_prot omits write permission from any private vma. */ -static void unuse_pte(struct vm_area_struct *vma, pte_t *pte, +static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, swp_entry_t entry, struct page *page) { + spinlock_t *ptl; + pte_t *pte; + int found = 1; + + pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl); + if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry { + found = 0; + goto out; + } + inc_mm_counter(vma-vm_mm, anon_rss); get_page(page); set_pte_at(vma-vm_mm, addr, pte, @@ -520,6 +530,9 @@ static void unuse_pte(struct vm_area_str * immediately swapped out again after swapon. */ activate_page(page); +out: + pte_unmap_unlock(pte, ptl); + return found; } static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, @@ -528,22 +541,33 @@ static int unuse_pte_range(struct vm_are { pte_t swp_pte = swp_entry_to_pte(entry); pte_t *pte; - spinlock_t *ptl; int found = 0; - pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl); + /* +* We don't actually need pte lock while scanning for swp_pte: since +* we hold page lock and mmap_sem, swp_pte cannot be inserted into the +* page table while we're scanning; though it could get zapped, and on +* some architectures (e.g. x86_32 with PAE) we might catch a glimpse +* of unmatched parts which look like swp_pte, so unuse_pte must +* recheck under pte lock. Scanning without pte lock lets it be +* preemptible whenever CONFIG_PREEMPT but not CONFIG_HIGHPTE. +*/ + pte = pte_offset_map(pmd, addr); do { /* * swapoff spends a _lot_ of time in this loop! * Test inline before going to call unuse_pte. */ if (unlikely(pte_same(*pte, swp_pte))) { - unuse_pte(vma, pte++, addr, entry, page); - found = 1; - break; + pte_unmap(pte); + found = unuse_pte(vma, pmd, addr, entry, page); + if (found) + goto out; + pte = pte_offset_map(pmd, addr); } } while (pte++, addr += PAGE_SIZE, addr != end); - pte_unmap_unlock(pte - 1, ptl); + pte_unmap(pte - 1); +out: return found; } ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/6 mm] memcgroup: temporarily revert swapoff mod
Whaaa? This patch precisely reverts the swapoff: scan ptes preemptibly patch just presented. It's a temporary measure to allow existing memory controller patches to apply without rejects: in due course they should be rendered down into one sensible patch, and this reversion disappear. Signed-off-by: Hugh Dickins [EMAIL PROTECTED] --- This patch should go immediately before the memory-controller patches, or immediately before memory-controller-memory-accounting-v7.patch mm/swapfile.c | 38 +++--- 1 file changed, 7 insertions(+), 31 deletions(-) --- patch1/mm/swapfile.c2007-11-08 12:34:12.0 + +++ patch2/mm/swapfile.c2007-11-08 12:34:12.0 + @@ -506,19 +506,9 @@ unsigned int count_swap_pages(int type, * just let do_wp_page work it out if a write is requested later - to * force COW, vm_page_prot omits write permission from any private vma. */ -static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, +static void unuse_pte(struct vm_area_struct *vma, pte_t *pte, unsigned long addr, swp_entry_t entry, struct page *page) { - spinlock_t *ptl; - pte_t *pte; - int found = 1; - - pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl); - if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry { - found = 0; - goto out; - } - inc_mm_counter(vma-vm_mm, anon_rss); get_page(page); set_pte_at(vma-vm_mm, addr, pte, @@ -530,9 +520,6 @@ static int unuse_pte(struct vm_area_stru * immediately swapped out again after swapon. */ activate_page(page); -out: - pte_unmap_unlock(pte, ptl); - return found; } static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, @@ -541,33 +528,22 @@ static int unuse_pte_range(struct vm_are { pte_t swp_pte = swp_entry_to_pte(entry); pte_t *pte; + spinlock_t *ptl; int found = 0; - /* -* We don't actually need pte lock while scanning for swp_pte: since -* we hold page lock and mmap_sem, swp_pte cannot be inserted into the -* page table while we're scanning; though it could get zapped, and on -* some architectures (e.g. x86_32 with PAE) we might catch a glimpse -* of unmatched parts which look like swp_pte, so unuse_pte must -* recheck under pte lock. Scanning without pte lock lets it be -* preemptible whenever CONFIG_PREEMPT but not CONFIG_HIGHPTE. -*/ - pte = pte_offset_map(pmd, addr); + pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl); do { /* * swapoff spends a _lot_ of time in this loop! * Test inline before going to call unuse_pte. */ if (unlikely(pte_same(*pte, swp_pte))) { - pte_unmap(pte); - found = unuse_pte(vma, pmd, addr, entry, page); - if (found) - goto out; - pte = pte_offset_map(pmd, addr); + unuse_pte(vma, pte++, addr, entry, page); + found = 1; + break; } } while (pte++, addr += PAGE_SIZE, addr != end); - pte_unmap(pte - 1); -out: + pte_unmap_unlock(pte - 1, ptl); return found; } ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 3/6 mm] memcgroup: fix try_to_free order
Why does try_to_free_mem_cgroup_pages try for order 1 pages? It's called when mem_cgroup_charge_common would go over the limit, and that's adding an order 0 page. I see no reason: it has to be a typo: fix it. Signed-off-by: Hugh Dickins [EMAIL PROTECTED] --- Insert just after memory-controller-add-per-container-lru-and-reclaim-v7.patch mm/vmscan.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- patch2/mm/vmscan.c 2007-11-08 15:46:21.0 + +++ patch3/mm/vmscan.c 2007-11-08 15:48:08.0 + @@ -1354,7 +1354,7 @@ unsigned long try_to_free_mem_cgroup_pag .may_swap = 1, .swap_cluster_max = SWAP_CLUSTER_MAX, .swappiness = vm_swappiness, - .order = 1, + .order = 0, .mem_cgroup = mem_cont, .isolate_pages = mem_cgroup_isolate_pages, }; ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 5/6 mm] memcgroup: fix zone isolation OOM
mem_cgroup_charge_common shows a tendency to OOM without good reason, when a memhog goes well beyond its rss limit but with plenty of swap available. Seen on x86 but not on PowerPC; seen when the next patch omits swapcache from memcgroup, but we presume it can happen without. mem_cgroup_isolate_pages is not quite satisfying reclaim's criteria for OOM avoidance. Already it has to scan beyond the nr_to_scan limit when it finds a !LRU page or an active page when handling inactive or an inactive page when handling active. It needs to do exactly the same when it finds a page from the wrong zone (the x86 tests had two zones, the PowerPC tests had only one). Don't increment scan and then decrement it in these cases, just move the incrementation down. Fix recent off-by-one when checking against nr_to_scan. Cut out Check if the meta page went away from under us, presumably left over from early debugging: no amount of such checks could save us if this list really were being updated without locking. This change does make the unlimited scan while holding two spinlocks even worse - bad for latency and bad for containment; but that's a separate issue which is better left to be fixed a little later. Signed-off-by: Hugh Dickins [EMAIL PROTECTED] --- Insert just after bugfix-for-memory-cgroup-controller-avoid-pagelru-page-in-mem_cgroup_isolate_pages-fix.patch or just before memory-cgroup-enhancements mm/memcontrol.c | 17 - 1 file changed, 4 insertions(+), 13 deletions(-) --- patch4/mm/memcontrol.c 2007-11-08 16:03:33.0 + +++ patch5/mm/memcontrol.c 2007-11-08 16:51:39.0 + @@ -260,24 +260,20 @@ unsigned long mem_cgroup_isolate_pages(u spin_lock(mem_cont-lru_lock); scan = 0; list_for_each_entry_safe_reverse(pc, tmp, src, lru) { - if (scan++ nr_to_scan) + if (scan = nr_to_scan) break; page = pc-page; VM_BUG_ON(!pc); - if (unlikely(!PageLRU(page))) { - scan--; + if (unlikely(!PageLRU(page))) continue; - } if (PageActive(page) !active) { __mem_cgroup_move_lists(pc, true); - scan--; continue; } if (!PageActive(page) active) { __mem_cgroup_move_lists(pc, false); - scan--; continue; } @@ -288,13 +284,8 @@ unsigned long mem_cgroup_isolate_pages(u if (page_zone(page) != z) continue; - /* -* Check if the meta page went away from under us -*/ - if (!list_empty(pc-lru)) - list_move(pc-lru, pc_list); - else - continue; + scan++; + list_move(pc-lru, pc_list); if (__isolate_lru_page(page, mode) == 0) { list_move(page-lru, dst); ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 6/6 mm] memcgroup: revert swap_state mods
If we're charging rss and we're charging cache, it seems obvious that we should be charging swapcache - as has been done. But in practice that doesn't work out so well: both swapin readahead and swapoff leave the majority of pages charged to the wrong cgroup (the cgroup that happened to read them in, rather than the cgroup to which they belong). (Which is why unuse_pte's GFP_KERNEL while holding pte lock never showed up as a problem: no allocation was ever done there, every page read being already charged to the cgroup which initiated the swapoff.) It all works rather better if we leave the charging to do_swap_page and unuse_pte, and do nothing for swapcache itself: revert mm/swap_state.c to what it was before the memory-controller patches. This also speeds up significantly a contained process working at its limit: because it no longer needs to keep waiting for swap writeback to complete. Is it unfair that swap pages become uncharged once they're unmapped, even though they're still clearly private to particular cgroups? For a short while, yes; but PageReclaim arranges for those pages to go to the end of the inactive list and be reclaimed soon if necessary. shmem/tmpfs pages are a distinct case: their charging also benefits from this change, but their second life on the lists as swapcache pages may prove more unfair - that I need to check next. Signed-off-by: Hugh Dickins [EMAIL PROTECTED] --- Insert just after 5/6: the tree builds okay if it goes earlier, just after memory-controller-bug_on.patch, but 5/6 fixes OOM made more likely by 6/6. Alternatively, hand edit all of the mm/swap_state.c mods out of all of the memory-controller patches which modify it. mm/swap_state.c | 15 ++- 1 file changed, 2 insertions(+), 13 deletions(-) --- patch5/mm/swap_state.c 2007-11-08 15:58:50.0 + +++ patch6/mm/swap_state.c 2007-11-08 16:01:11.0 + @@ -17,7 +17,6 @@ #include linux/backing-dev.h #include linux/pagevec.h #include linux/migrate.h -#include linux/memcontrol.h #include asm/pgtable.h @@ -79,11 +78,6 @@ static int __add_to_swap_cache(struct pa BUG_ON(!PageLocked(page)); BUG_ON(PageSwapCache(page)); BUG_ON(PagePrivate(page)); - - error = mem_cgroup_cache_charge(page, current-mm, gfp_mask); - if (error) - goto out; - error = radix_tree_preload(gfp_mask); if (!error) { write_lock_irq(swapper_space.tree_lock); @@ -95,14 +89,10 @@ static int __add_to_swap_cache(struct pa set_page_private(page, entry.val); total_swapcache_pages++; __inc_zone_page_state(page, NR_FILE_PAGES); - } else - mem_cgroup_uncharge_page(page); - + } write_unlock_irq(swapper_space.tree_lock); radix_tree_preload_end(); - } else - mem_cgroup_uncharge_page(page); -out: + } return error; } @@ -143,7 +133,6 @@ void __delete_from_swap_cache(struct pag BUG_ON(PageWriteback(page)); BUG_ON(PagePrivate(page)); - mem_cgroup_uncharge_page(page); radix_tree_delete(swapper_space.page_tree, page_private(page)); set_page_private(page, 0); ClearPageSwapCache(page); ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 4/6 mm] memcgroup: reinstate swapoff mod
This patch reinstates the swapoff: scan ptes preemptibly mod we started with: in due course it should be rendered down into the earlier patches, leaving us with a more straightforward mem_cgroup_charge mod to unuse_pte, allocating with GFP_KERNEL while holding no spinlock and no atomic kmap. Signed-off-by: Hugh Dickins [EMAIL PROTECTED] --- Insert just after memory-controller-make-charging-gfp-mask-aware.patch or you may prefer to insert 4-6 all together before memory-cgroup-enhancements mm/swapfile.c | 42 ++ 1 file changed, 34 insertions(+), 8 deletions(-) --- patch3/mm/swapfile.c2007-11-08 15:48:08.0 + +++ patch4/mm/swapfile.c2007-11-08 15:55:12.0 + @@ -507,11 +507,23 @@ unsigned int count_swap_pages(int type, * just let do_wp_page work it out if a write is requested later - to * force COW, vm_page_prot omits write permission from any private vma. */ -static int unuse_pte(struct vm_area_struct *vma, pte_t *pte, +static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, swp_entry_t entry, struct page *page) { + spinlock_t *ptl; + pte_t *pte; + int ret = 1; + if (mem_cgroup_charge(page, vma-vm_mm, GFP_KERNEL)) - return -ENOMEM; + ret = -ENOMEM; + + pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl); + if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry { + if (ret 0) + mem_cgroup_uncharge_page(page); + ret = 0; + goto out; + } inc_mm_counter(vma-vm_mm, anon_rss); get_page(page); @@ -524,7 +536,9 @@ static int unuse_pte(struct vm_area_stru * immediately swapped out again after swapon. */ activate_page(page); - return 1; +out: + pte_unmap_unlock(pte, ptl); + return ret; } static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, @@ -533,21 +547,33 @@ static int unuse_pte_range(struct vm_are { pte_t swp_pte = swp_entry_to_pte(entry); pte_t *pte; - spinlock_t *ptl; int ret = 0; - pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl); + /* +* We don't actually need pte lock while scanning for swp_pte: since +* we hold page lock and mmap_sem, swp_pte cannot be inserted into the +* page table while we're scanning; though it could get zapped, and on +* some architectures (e.g. x86_32 with PAE) we might catch a glimpse +* of unmatched parts which look like swp_pte, so unuse_pte must +* recheck under pte lock. Scanning without pte lock lets it be +* preemptible whenever CONFIG_PREEMPT but not CONFIG_HIGHPTE. +*/ + pte = pte_offset_map(pmd, addr); do { /* * swapoff spends a _lot_ of time in this loop! * Test inline before going to call unuse_pte. */ if (unlikely(pte_same(*pte, swp_pte))) { - ret = unuse_pte(vma, pte++, addr, entry, page); - break; + pte_unmap(pte); + ret = unuse_pte(vma, pmd, addr, entry, page); + if (ret) + goto out; + pte = pte_offset_map(pmd, addr); } } while (pte++, addr += PAGE_SIZE, addr != end); - pte_unmap_unlock(pte - 1, ptl); + pte_unmap(pte - 1); +out: return ret; } ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC] [-mm PATCH] Memory controller fix swap charging context in unuse_pte()
On Mon, 29 Oct 2007, Balbir Singh wrote: On Mon, Oct 29, 2007 at 01:57:40AM +0530, Balbir Singh wrote: Hugh Dickins wrote: [snip] Without your mem_cgroup mods in mm/swap_state.c, unuse_pte makes the right assignments (I believe). But I find that swapout (using 600M in a 512M machine) from a 200M cgroup quickly OOMs, whereas it behaves correctly with your mm/swap_state.c. On my UML setup, I booted the UML instance with 512M of memory and used the swapout program that you shared. I tried two things 1. Ran swapout without any changes. The program ran well without any OOM condition occuring, lot of reclaim occured. 2. Ran swapout with the changes to mm/swap_state.c removed (diff below) and I still did not see any OOM. The reclaim count was much lesser since swap cache did not get accounted back to the cgroup from which pages were being evicted. I am not sure why I don't see the OOM that you see, still trying. May be I missing something obvious at this late hour in the night :-) I reconfirm that I do see those OOMs. I'll have to try harder to analyze how they come about: I sure don't expect you to debug a problem you cannot reproduce. But what happens if you try it native rather than using UML? Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC] [-mm PATCH] Memory controller fix swap charging context in unuse_pte()
On Wed, 24 Oct 2007, Balbir Singh wrote: Hugh Dickins wrote: Thanks, Balbir. Sorry for the delay. I've not forgotten our agreement that I should be splitting it into before-and-after mem cgroup patches. But it's low priority for me until we're genuinely assigning to a cgroup there. Hope to get back to looking into that tomorrow, but no promises. No Problem. We have some time with this one. Phew - I still haven't got there. I think you still see no problem, where I claim that simply omitting the mem charge mods from mm/swap_state.c leads to OOMs? Maybe our difference is because my memhog in the cgroup is using more memory than RAM, not just more memory than allowed to the cgroup. I suspect that arrives at a state (when the swapcache pages are not charged) where it cannot locate the pages it needs to reclaim to stay within its limit. Yes, in my case there I use memory less than RAM and more than that is allowed by the cgroup. It's quite possible that in your case the swapcache has grown significantly without any limit/control on it. The memhog program is using memory at a rate much higher than the rate of reclaim. Could you share your memhog program, please? Gosh, it's nothing special. Appended below, but please don't shame me by taking it too seriously. Defaults to working on a 600M mmap because I'm in the habit of booting mem=512M. You probably have something better yourself that you'd rather use. In the use case you've mentioned/tested, having these mods to control swapcache is actually useful, right? No idea what you mean by these mods to control swapcache? With your mem_cgroup mods in mm/swap_state.c, swapoff assigns the pages read in from swap to whoever's running swapoff and your unuse_pte mem_cgroup_charge never does anything useful: swap pages should get assigned to the appropriate cgroups at that point. Without your mem_cgroup mods in mm/swap_state.c, unuse_pte makes the right assignments (I believe). But I find that swapout (using 600M in a 512M machine) from a 200M cgroup quickly OOMs, whereas it behaves correctly with your mm/swap_state.c. Thought little yet about what happens to shmem swapped pages, and swap readahead pages; but still suspect that they and the above issue will need a limbo cgroup, for pages which are expected to belong to a not-yet-identified mem cgroup. Could you share your major objections at this point with the memory controller at this point. I hope to be able to look into/resolve them as my first priority in my list of items to work on. The things I've noticed so far, as mentioned before and above. But it does worry me that I only came here through finding swapoff broken by that unuse_mm return value, and then found one issue after another. It feels like the mem cgroup people haven't really thought through or tested swap at all, and that if I looked further I'd uncover more. That's simply FUD, and I apologize if I'm being unfair: but that is how it feels, and I expect we all know that phase in a project when solving one problem uncovers three - suggests it's not ready. Hugh /* swapout.c */ #include unistd.h #include stdlib.h #include stdio.h #include errno.h #include sys/mman.h int main(int argc, char *argv[]) { unsigned long *base = (unsigned long *)0x0840; unsigned long size; unsigned long limit; unsigned long i; char *ptr = NULL; size = argv[1]? strtoul(argv[1], ptr, 0): 600; if (size = 3*1024) size = 0; size *= 1024*1024; limit = size / sizeof(unsigned long); if (size == 0 || base + limit + 1024 size) { errno = EINVAL; perror(swapout); exit(1); } base = mmap(base, size, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); if (base == (unsigned long *)(-1)) { perror(mmap); exit(1); } for (i = 0; i limit; i++) base[i] = i; if (ptr *ptr == '.') { printf(Type Return to continue ); fflush(stdout); getchar(); } for (i = 0; i limit; i++) base[i] = limit - i; return 0; } ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC] [-mm PATCH] Memory controller fix swap charging context in unuse_pte()
On Mon, 15 Oct 2007, Balbir Singh wrote: Hugh Dickins wrote: --- 2.6.23-rc8-mm2/mm/swapfile.c2007-09-27 12:03:36.0 +0100 +++ linux/mm/swapfile.c 2007-10-07 14:33:05.0 +0100 @@ -507,11 +507,23 @@ unsigned int count_swap_pages(int type, * just let do_wp_page work it out if a write is requested later - to * force COW, vm_page_prot omits write permission from any private vma. */ -static int unuse_pte(struct vm_area_struct *vma, pte_t *pte, +static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, swp_entry_t entry, struct page *page) ... I tested this patch and it seems to be working fine. I tried swapoff -a in the middle of tests consuming swap. Not 100% rigorous, but a good test nevertheless. Tested-by: Balbir Singh [EMAIL PROTECTED] Thanks, Balbir. Sorry for the delay. I've not forgotten our agreement that I should be splitting it into before-and-after mem cgroup patches. But it's low priority for me until we're genuinely assigning to a cgroup there. Hope to get back to looking into that tomorrow, but no promises. I think you still see no problem, where I claim that simply omitting the mem charge mods from mm/swap_state.c leads to OOMs? Maybe our difference is because my memhog in the cgroup is using more memory than RAM, not just more memory than allowed to the cgroup. I suspect that arrives at a state (when the swapcache pages are not charged) where it cannot locate the pages it needs to reclaim to stay within its limit. Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC] [-mm PATCH] Memory controller fix swap charging context in unuse_pte()
On Fri, 5 Oct 2007, Balbir Singh wrote: Found-by: Hugh Dickins [EMAIL PROTECTED] mem_cgroup_charge() in unuse_pte() is called under a lock, the pte_lock. That's clearly incorrect, since we pass GFP_KERNEL to mem_cgroup_charge() for allocation of page_cgroup. This patch release the lock and reacquires the lock after the call to mem_cgroup_charge(). Tested on a powerpc box by calling swapoff in the middle of a cgroup running a workload that pushes pages to swap. Hard to test it adequately at present, while that call to mem_cgroup_charge is never allocating anything new. Sorry, it's a bit ugly (the intertwining of unuse_pte and its caller), it's got a bug, and fixing that bug makes it uglier. The bug is that you ignore the pte ptr returned by pte_offset_map_lock: we could be preempted on to a different cpu just there, so a different cpu's kmap_atomic area used, with a different pte pointer; which would need to be passed back to the caller for when it unmaps. I much prefer my patch appended further down: considering how it's safe for you to drop the ptl there because of holding page lock, pushed me into seeing that we can actually do our scanning without ptl, which in many configurations has the advantage of staying preemptible (though preemptible swapoff is not terribly high on anyone's ticklist ;). But you may well prefer that we split it into two: with me taking responsibility and blame for the preliminary patch which relaxes the locking, and you then adding the mem_cgroup_charge (and the exceptional mem_cgroup_uncharge_page) on top of that. Hugh Signed-off-by: Balbir Singh [EMAIL PROTECTED] --- mm/swapfile.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff -puN mm/swapfile.c~memory-controller-fix-unuse-pte-charging mm/swapfile.c --- linux-2.6.23-rc8/mm/swapfile.c~memory-controller-fix-unuse-pte-charging 2007-10-03 13:45:56.0 +0530 +++ linux-2.6.23-rc8-balbir/mm/swapfile.c 2007-10-05 08:49:54.0 +0530 @@ -507,11 +507,18 @@ unsigned int count_swap_pages(int type, * just let do_wp_page work it out if a write is requested later - to * force COW, vm_page_prot omits write permission from any private vma. */ -static int unuse_pte(struct vm_area_struct *vma, pte_t *pte, - unsigned long addr, swp_entry_t entry, struct page *page) +static int unuse_pte(struct vm_area_struct *vma, pte_t *pte, pmd_t *pmd, + unsigned long addr, swp_entry_t entry, struct page *page, + spinlock_t **ptl) { - if (mem_cgroup_charge(page, vma-vm_mm, GFP_KERNEL)) + pte_unmap_unlock(pte - 1, *ptl); + + if (mem_cgroup_charge(page, vma-vm_mm, GFP_KERNEL)) { + pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl); return -ENOMEM; + } + + pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl); inc_mm_counter(vma-vm_mm, anon_rss); get_page(page); @@ -543,7 +550,8 @@ static int unuse_pte_range(struct vm_are * Test inline before going to call unuse_pte. */ if (unlikely(pte_same(*pte, swp_pte))) { - ret = unuse_pte(vma, pte++, addr, entry, page); + ret = unuse_pte(vma, pte++, pmd, addr, entry, page, + ptl); break; } } while (pte++, addr += PAGE_SIZE, addr != end); --- 2.6.23-rc8-mm2/mm/swapfile.c2007-09-27 12:03:36.0 +0100 +++ linux/mm/swapfile.c 2007-10-07 14:33:05.0 +0100 @@ -507,11 +507,23 @@ unsigned int count_swap_pages(int type, * just let do_wp_page work it out if a write is requested later - to * force COW, vm_page_prot omits write permission from any private vma. */ -static int unuse_pte(struct vm_area_struct *vma, pte_t *pte, +static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, swp_entry_t entry, struct page *page) { + spinlock_t *ptl; + pte_t *pte; + int ret = 1; + if (mem_cgroup_charge(page, vma-vm_mm, GFP_KERNEL)) - return -ENOMEM; + ret = -ENOMEM; + + pte = pte_offset_map_lock(vma-vm_mm, pmd, addr, ptl); + if (unlikely(!pte_same(*pte, swp_entry_to_pte(entry { + if (ret 0) + mem_cgroup_uncharge_page(page); + ret = 0; + goto out; + } inc_mm_counter(vma-vm_mm, anon_rss); get_page(page); @@ -524,7 +536,9 @@ static int unuse_pte(struct vm_area_stru * immediately swapped out again after swapon. */ activate_page(page); - return 1; +out: + pte_unmap_unlock(pte, ptl); + return ret; } static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd, @@ -533,21 +547,33 @@ static int unuse_pte_range(struct vm_are { pte_t swp_pte = swp_entry_to_pte(entry); pte_t *pte
[Devel] Re: [PATCH] Wake up mandatory locks waiter on chmod (v2)
On Tue, 18 Sep 2007, J. Bruce Fields wrote: On Tue, Sep 18, 2007 at 12:54:56PM -0400, Trond Myklebust wrote: It gets even better when you throw mmap() into the mix :-) Hm. Documentation/mandatory.txt claims that it mandatory locks and mmap() with MAP_SHARED exclude each other, but I can't see where that's enfoced. That file doesn't make any mention of the above race. I believe the locks_verify_locked() call from mm/mmap.c prevents mmap'ing shared-write a file with mandatory locks in force; and the mapping_writably_mapped() calls from fs/locks.c prevent mandatory locking on a file while it's mmap'ed shared-write. Though I think there's no lock to prevent those checks racing, so it's not quite watertight. Hugh ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel