Re: [PATCH 1/4] mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
On 04/05/2017 01:40 PM, Andrey Ryabinin wrote: > On 04/05/2017 10:46 AM, Vlastimil Babka wrote: >> The function __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent >> deadlock during page migration by lock_page() (see the comment in >> __unmap_and_move()). Then it unconditionally clears the flag, which can >> clear a >> pre-existing PF_MEMALLOC flag and result in recursive reclaim. This was not a >> problem until commit a8161d1ed609 ("mm, page_alloc: restructure direct >> compaction handling in slowpath"), because direct compation was called only >> after direct reclaim, which was skipped when PF_MEMALLOC flag was set. >> >> Even now it's only a theoretical issue, as the new callsite of >> __alloc_pages_direct_compact() is reached only for costly orders and when >> gfp_pfmemalloc_allowed() is true, which means either __GFP_NOMEMALLOC is in >is false > >> gfp_flags or in_interrupt() is true. There is no such known context, but >> let's >> play it safe and make __alloc_pages_direct_compact() robust for cases where >> PF_MEMALLOC is already set. >> >> Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling >> in slowpath") >> Reported-by: Andrey Ryabinin <aryabi...@virtuozzo.com> >> Signed-off-by: Vlastimil Babka <vba...@suse.cz> >> Cc: <sta...@vger.kernel.org> >> --- >> mm/page_alloc.c | 3 ++- >> 1 file changed, 2 insertions(+), 1 deletion(-) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index 3589f8be53be..b84e6ffbe756 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -3288,6 +3288,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned >> int order, >> enum compact_priority prio, enum compact_result *compact_result) >> { >> struct page *page; >> +unsigned int noreclaim_flag = current->flags & PF_MEMALLOC; >> >> if (!order) >> return NULL; >> @@ -3295,7 +3296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned >> int order, >> current->flags |= PF_MEMALLOC; >> *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac, >> prio); >> -current->flags &= ~PF_MEMALLOC; >> +current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag; > > Perhaps this would look better: > > tsk_restore_flags(current, noreclaim_flag, PF_MEMALLOC); > > ? Well, I didn't care much considering this is for stable only, and patch 2/4 rewrites this to the new api. >> if (*compact_result <= COMPACT_INACTIVE) >> return NULL; >> >
[PATCH 0/4] more robust PF_MEMALLOC handling
Hi, this series aims to unify the setting and clearing of PF_MEMALLOC, which prevents recursive reclaim. There are some places that clear the flag unconditionally from current->flags, which may result in clearing a pre-existing flag. This already resulted in a bug report that Patch 1 fixes (without the new helpers, to make backporting easier). Patch 2 introduces the new helpers, modelled after existing memalloc_noio_* and memalloc_nofs_* helpers, and converts mm core to use them. Patches 3 and 4 convert non-mm code. Based on next-20170404. Vlastimil Babka (4): mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC mm: introduce memalloc_noreclaim_{save,restore} treewide: convert PF_MEMALLOC manipulations to new helpers mtd: nand: nandsim: convert to memalloc_noreclaim_*() drivers/block/nbd.c| 7 --- drivers/mtd/nand/nandsim.c | 29 + drivers/scsi/iscsi_tcp.c | 7 --- include/linux/sched/mm.h | 12 mm/page_alloc.c| 10 ++ mm/vmscan.c| 17 +++-- net/core/dev.c | 7 --- net/core/sock.c| 7 --- 8 files changed, 54 insertions(+), 42 deletions(-) -- 2.12.2
[PATCH 4/4] mtd: nand: nandsim: convert to memalloc_noreclaim_*()
Nandsim has own functions set_memalloc() and clear_memalloc() for robust setting and clearing of PF_MEMALLOC. Replace them by the new generic helpers. No functional change. Signed-off-by: Vlastimil Babka <vba...@suse.cz> Cc: Boris Brezillon <boris.brezil...@free-electrons.com> Cc: Richard Weinberger <rich...@nod.at> --- drivers/mtd/nand/nandsim.c | 29 + 1 file changed, 9 insertions(+), 20 deletions(-) diff --git a/drivers/mtd/nand/nandsim.c b/drivers/mtd/nand/nandsim.c index cef818f535ed..03a0d057bf2f 100644 --- a/drivers/mtd/nand/nandsim.c +++ b/drivers/mtd/nand/nandsim.c @@ -40,6 +40,7 @@ #include #include #include +#include #include #include #include @@ -1368,31 +1369,18 @@ static int get_pages(struct nandsim *ns, struct file *file, size_t count, loff_t return 0; } -static int set_memalloc(void) -{ - if (current->flags & PF_MEMALLOC) - return 0; - current->flags |= PF_MEMALLOC; - return 1; -} - -static void clear_memalloc(int memalloc) -{ - if (memalloc) - current->flags &= ~PF_MEMALLOC; -} - static ssize_t read_file(struct nandsim *ns, struct file *file, void *buf, size_t count, loff_t pos) { ssize_t tx; - int err, memalloc; + int err; + unsigned int noreclaim_flag; err = get_pages(ns, file, count, pos); if (err) return err; - memalloc = set_memalloc(); + noreclaim_flag = memalloc_noreclaim_save(); tx = kernel_read(file, pos, buf, count); - clear_memalloc(memalloc); + memalloc_noreclaim_restore(noreclaim_flag); put_pages(ns); return tx; } @@ -1400,14 +1388,15 @@ static ssize_t read_file(struct nandsim *ns, struct file *file, void *buf, size_ static ssize_t write_file(struct nandsim *ns, struct file *file, void *buf, size_t count, loff_t pos) { ssize_t tx; - int err, memalloc; + int err; + unsigned int noreclaim_flag; err = get_pages(ns, file, count, pos); if (err) return err; - memalloc = set_memalloc(); + noreclaim_flag = memalloc_noreclaim_save(); tx = kernel_write(file, buf, count, pos); - clear_memalloc(memalloc); + memalloc_noreclaim_restore(noreclaim_flag); put_pages(ns); return tx; } -- 2.12.2
[PATCH 2/4] mm: introduce memalloc_noreclaim_{save,restore}
The previous patch has shown that simply setting and clearing PF_MEMALLOC in current->flags can result in wrongly clearing a pre-existing PF_MEMALLOC flag and potentially lead to recursive reclaim. Let's introduce helpers that support proper nesting by saving the previous stat of the flag, similar to the existing memalloc_noio_* and memalloc_nofs_* helpers. Convert existing setting/clearing of PF_MEMALLOC within mm to the new helpers. There are no known issues with the converted code, but the change makes it more robust. Suggested-by: Michal Hocko <mho...@suse.com> Signed-off-by: Vlastimil Babka <vba...@suse.cz> --- include/linux/sched/mm.h | 12 mm/page_alloc.c | 11 ++- mm/vmscan.c | 17 +++-- 3 files changed, 29 insertions(+), 11 deletions(-) diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 9daabe138c99..2b24a6974847 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -191,4 +191,16 @@ static inline void memalloc_nofs_restore(unsigned int flags) current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags; } +static inline unsigned int memalloc_noreclaim_save(void) +{ + unsigned int flags = current->flags & PF_MEMALLOC; + current->flags |= PF_MEMALLOC; + return flags; +} + +static inline void memalloc_noreclaim_restore(unsigned int flags) +{ + current->flags = (current->flags & ~PF_MEMALLOC) | flags; +} + #endif /* _LINUX_SCHED_MM_H */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b84e6ffbe756..037e32dccd7a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3288,15 +3288,15 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, enum compact_priority prio, enum compact_result *compact_result) { struct page *page; - unsigned int noreclaim_flag = current->flags & PF_MEMALLOC; + unsigned int noreclaim_flag; if (!order) return NULL; - current->flags |= PF_MEMALLOC; + noreclaim_flag = memalloc_noreclaim_save(); *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac, prio); - current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag; + memalloc_noreclaim_restore(noreclaim_flag); if (*compact_result <= COMPACT_INACTIVE) return NULL; @@ -3443,12 +3443,13 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order, { struct reclaim_state reclaim_state; int progress; + unsigned int noreclaim_flag; cond_resched(); /* We now go into synchronous reclaim */ cpuset_memory_pressure_bump(); - current->flags |= PF_MEMALLOC; + noreclaim_flag = memalloc_noreclaim_save(); lockdep_set_current_reclaim_state(gfp_mask); reclaim_state.reclaimed_slab = 0; current->reclaim_state = _state; @@ -3458,7 +3459,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order, current->reclaim_state = NULL; lockdep_clear_current_reclaim_state(); - current->flags &= ~PF_MEMALLOC; + memalloc_noreclaim_restore(noreclaim_flag); cond_resched(); diff --git a/mm/vmscan.c b/mm/vmscan.c index 58615bb27f2f..ff63b91a0f48 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2992,6 +2992,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, struct zonelist *zonelist; unsigned long nr_reclaimed; int nid; + unsigned int noreclaim_flag; struct scan_control sc = { .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) | @@ -3018,9 +3019,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, sc.gfp_mask, sc.reclaim_idx); - current->flags |= PF_MEMALLOC; + noreclaim_flag = memalloc_noreclaim_save(); nr_reclaimed = do_try_to_free_pages(zonelist, ); - current->flags &= ~PF_MEMALLOC; + memalloc_noreclaim_restore(noreclaim_flag); trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); @@ -3544,8 +3545,9 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim) struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask); struct task_struct *p = current; unsigned long nr_reclaimed; + unsigned int noreclaim_flag; - p->flags |= PF_MEMALLOC; + noreclaim_flag = memalloc_noreclaim_save(); lockdep_set_current_reclaim_state(sc.gfp_mask); reclaim_state.reclaimed_slab = 0; p->reclaim_state = _state; @@ -3554,7 +3556,7 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
[PATCH 3/4] treewide: convert PF_MEMALLOC manipulations to new helpers
We now have memalloc_noreclaim_{save,restore} helpers for robust setting and clearing of PF_MEMALLOC. Let's convert the code which was using the generic tsk_restore_flags(). No functional change. Signed-off-by: Vlastimil Babka <vba...@suse.cz> Cc: Josef Bacik <jba...@fb.com> Cc: Lee Duncan <ldun...@suse.com> Cc: Chris Leech <cle...@redhat.com> Cc: "David S. Miller" <da...@davemloft.net> Cc: Eric Dumazet <eduma...@google.com> --- drivers/block/nbd.c | 7 --- drivers/scsi/iscsi_tcp.c | 7 --- net/core/dev.c | 7 --- net/core/sock.c | 7 --- 4 files changed, 16 insertions(+), 12 deletions(-) diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 03ae72985c79..929fc548c7fb 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -210,7 +211,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send, struct socket *sock = nbd->socks[index]->sock; int result; struct msghdr msg; - unsigned long pflags = current->flags; + unsigned int noreclaim_flag; if (unlikely(!sock)) { dev_err_ratelimited(disk_to_dev(nbd->disk), @@ -221,7 +222,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send, msg.msg_iter = *iter; - current->flags |= PF_MEMALLOC; + noreclaim_flag = memalloc_noreclaim_save(); do { sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC; msg.msg_name = NULL; @@ -244,7 +245,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int send, *sent += result; } while (msg_data_left()); - tsk_restore_flags(current, pflags, PF_MEMALLOC); + memalloc_noreclaim_restore(noreclaim_flag); return result; } diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c index 4228aba1f654..4842fc0e809d 100644 --- a/drivers/scsi/iscsi_tcp.c +++ b/drivers/scsi/iscsi_tcp.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include #include @@ -371,10 +372,10 @@ static inline int iscsi_sw_tcp_xmit_qlen(struct iscsi_conn *conn) static int iscsi_sw_tcp_pdu_xmit(struct iscsi_task *task) { struct iscsi_conn *conn = task->conn; - unsigned long pflags = current->flags; + unsigned int noreclaim_flag; int rc = 0; - current->flags |= PF_MEMALLOC; + noreclaim_flag = memalloc_noreclaim_save(); while (iscsi_sw_tcp_xmit_qlen(conn)) { rc = iscsi_sw_tcp_xmit(conn); @@ -387,7 +388,7 @@ static int iscsi_sw_tcp_pdu_xmit(struct iscsi_task *task) rc = 0; } - tsk_restore_flags(current, pflags, PF_MEMALLOC); + memalloc_noreclaim_restore(noreclaim_flag); return rc; } diff --git a/net/core/dev.c b/net/core/dev.c index fde8b3f7136b..e0705a126b24 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -81,6 +81,7 @@ #include #include #include +#include #include #include #include @@ -4227,7 +4228,7 @@ static int __netif_receive_skb(struct sk_buff *skb) int ret; if (sk_memalloc_socks() && skb_pfmemalloc(skb)) { - unsigned long pflags = current->flags; + unsigned int noreclaim_flag; /* * PFMEMALLOC skbs are special, they should @@ -4238,9 +4239,9 @@ static int __netif_receive_skb(struct sk_buff *skb) * Use PF_MEMALLOC as this saves us from propagating the allocation * context down to all allocation sites. */ - current->flags |= PF_MEMALLOC; + noreclaim_flag = memalloc_noreclaim_save(); ret = __netif_receive_skb_core(skb, true); - tsk_restore_flags(current, pflags, PF_MEMALLOC); + memalloc_noreclaim_restore(noreclaim_flag); } else ret = __netif_receive_skb_core(skb, false); diff --git a/net/core/sock.c b/net/core/sock.c index 392f9b6f96e2..0b2d06b4c308 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -102,6 +102,7 @@ #include #include #include +#include #include #include #include @@ -372,14 +373,14 @@ EXPORT_SYMBOL_GPL(sk_clear_memalloc); int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb) { int ret; - unsigned long pflags = current->flags; + unsigned int noreclaim_flag; /* these should have been dropped before queueing */ BUG_ON(!sock_flag(sk, SOCK_MEMALLOC)); - current->flags |= PF_MEMALLOC; + noreclaim_flag = memalloc_noreclaim_save(); ret = sk->sk_backlog_rcv(sk, skb); - tsk_restore_flags(current, pflags, PF_MEMALLOC); + memalloc_noreclaim_restore(noreclaim_flag); return ret; } -- 2.12.2
[PATCH 1/4] mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
The function __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent deadlock during page migration by lock_page() (see the comment in __unmap_and_move()). Then it unconditionally clears the flag, which can clear a pre-existing PF_MEMALLOC flag and result in recursive reclaim. This was not a problem until commit a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling in slowpath"), because direct compation was called only after direct reclaim, which was skipped when PF_MEMALLOC flag was set. Even now it's only a theoretical issue, as the new callsite of __alloc_pages_direct_compact() is reached only for costly orders and when gfp_pfmemalloc_allowed() is true, which means either __GFP_NOMEMALLOC is in gfp_flags or in_interrupt() is true. There is no such known context, but let's play it safe and make __alloc_pages_direct_compact() robust for cases where PF_MEMALLOC is already set. Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling in slowpath") Reported-by: Andrey Ryabinin <aryabi...@virtuozzo.com> Signed-off-by: Vlastimil Babka <vba...@suse.cz> Cc: <sta...@vger.kernel.org> --- mm/page_alloc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3589f8be53be..b84e6ffbe756 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3288,6 +3288,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, enum compact_priority prio, enum compact_result *compact_result) { struct page *page; + unsigned int noreclaim_flag = current->flags & PF_MEMALLOC; if (!order) return NULL; @@ -3295,7 +3296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, current->flags |= PF_MEMALLOC; *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac, prio); - current->flags &= ~PF_MEMALLOC; + current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag; if (*compact_result <= COMPACT_INACTIVE) return NULL; -- 2.12.2
Re: mm: VM_BUG_ON_PAGE(PageTail(page)) in mbind
On 26.1.2016 21:28, Kirill A. Shutemov wrote: > From 396ad132be07a2d2b9ec5d1d6ec9fe2fffe8105e Mon Sep 17 00:00:00 2001 > From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com> > Date: Tue, 26 Jan 2016 22:59:16 +0300 > Subject: [PATCH] sg: mark VMA as VM_IO to prevent migration > > Reduced testcase: > > #include > #include > #include > #include > > #define SIZE 0x2000 > > int main() > { > int fd; > void *p; > > fd = open("/dev/sg0", O_RDWR); > p = mmap(NULL, SIZE, PROT_EXEC, MAP_PRIVATE | MAP_LOCKED, fd, > 0); > mbind(p, SIZE, 0, NULL, 0, MPOL_MF_MOVE); > return 0; > } > > We shouldn't try to migrate pages in sg VMA as we don't have a way to > update Sg_scatter_hold::pages accordingly from mm core. > > Let's mark the VMA as VM_IO to indicate to mm core that the VMA is > migratable. ^ not migratable. Acked-by: Vlastimil Babka <vba...@suse.cz> > > Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com> > Reported-by: Dmitry Vyukov <dvyu...@google.com> > --- > drivers/scsi/sg.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c > index 503ab8b46c0b..5e820674432c 100644 > --- a/drivers/scsi/sg.c > +++ b/drivers/scsi/sg.c > @@ -1261,7 +1261,7 @@ sg_mmap(struct file *filp, struct vm_area_struct *vma) > } > > sfp->mmap_called = 1; > - vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP; > + vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP; > vma->vm_private_data = sfp; > vma->vm_ops = _mmap_vm_ops; > return 0; > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html