Re: [PATCH 1/4] mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC

2017-04-07 Thread Vlastimil Babka
On 04/05/2017 01:40 PM, Andrey Ryabinin wrote:
> On 04/05/2017 10:46 AM, Vlastimil Babka wrote:
>> The function __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent
>> deadlock during page migration by lock_page() (see the comment in
>> __unmap_and_move()). Then it unconditionally clears the flag, which can 
>> clear a
>> pre-existing PF_MEMALLOC flag and result in recursive reclaim. This was not a
>> problem until commit a8161d1ed609 ("mm, page_alloc: restructure direct
>> compaction handling in slowpath"), because direct compation was called only
>> after direct reclaim, which was skipped when PF_MEMALLOC flag was set.
>> 
>> Even now it's only a theoretical issue, as the new callsite of
>> __alloc_pages_direct_compact() is reached only for costly orders and when
>> gfp_pfmemalloc_allowed() is true, which means either __GFP_NOMEMALLOC is in
>is false   
> 
>> gfp_flags or in_interrupt() is true. There is no such known context, but 
>> let's
>> play it safe and make __alloc_pages_direct_compact() robust for cases where
>> PF_MEMALLOC is already set.
>> 
>> Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling 
>> in slowpath")
>> Reported-by: Andrey Ryabinin <aryabi...@virtuozzo.com>
>> Signed-off-by: Vlastimil Babka <vba...@suse.cz>
>> Cc: <sta...@vger.kernel.org>
>> ---
>>  mm/page_alloc.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3589f8be53be..b84e6ffbe756 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3288,6 +3288,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
>> int order,
>>  enum compact_priority prio, enum compact_result *compact_result)
>>  {
>>  struct page *page;
>> +unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
>>  
>>  if (!order)
>>  return NULL;
>> @@ -3295,7 +3296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
>> int order,
>>  current->flags |= PF_MEMALLOC;
>>  *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>>  prio);
>> -current->flags &= ~PF_MEMALLOC;
>> +current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;
> 
> Perhaps this would look better:
> 
>   tsk_restore_flags(current, noreclaim_flag, PF_MEMALLOC);
> 
> ?

Well, I didn't care much considering this is for stable only, and patch 2/4
rewrites this to the new api.

>>  if (*compact_result <= COMPACT_INACTIVE)
>>  return NULL;
>> 
> 



[PATCH 0/4] more robust PF_MEMALLOC handling

2017-04-05 Thread Vlastimil Babka
Hi,

this series aims to unify the setting and clearing of PF_MEMALLOC, which
prevents recursive reclaim. There are some places that clear the flag
unconditionally from current->flags, which may result in clearing a
pre-existing flag. This already resulted in a bug report that Patch 1 fixes
(without the new helpers, to make backporting easier). Patch 2 introduces the
new helpers, modelled after existing memalloc_noio_* and memalloc_nofs_*
helpers, and converts mm core to use them. Patches 3 and 4 convert non-mm code.

Based on next-20170404.

Vlastimil Babka (4):
  mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC
  mm: introduce memalloc_noreclaim_{save,restore}
  treewide: convert PF_MEMALLOC manipulations to new helpers
  mtd: nand: nandsim: convert to memalloc_noreclaim_*()

 drivers/block/nbd.c|  7 ---
 drivers/mtd/nand/nandsim.c | 29 +
 drivers/scsi/iscsi_tcp.c   |  7 ---
 include/linux/sched/mm.h   | 12 
 mm/page_alloc.c| 10 ++
 mm/vmscan.c| 17 +++--
 net/core/dev.c |  7 ---
 net/core/sock.c|  7 ---
 8 files changed, 54 insertions(+), 42 deletions(-)

-- 
2.12.2



[PATCH 4/4] mtd: nand: nandsim: convert to memalloc_noreclaim_*()

2017-04-05 Thread Vlastimil Babka
Nandsim has own functions set_memalloc() and clear_memalloc() for robust
setting and clearing of PF_MEMALLOC. Replace them by the new generic helpers.
No functional change.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
Cc: Boris Brezillon <boris.brezil...@free-electrons.com>
Cc: Richard Weinberger <rich...@nod.at>
---
 drivers/mtd/nand/nandsim.c | 29 +
 1 file changed, 9 insertions(+), 20 deletions(-)

diff --git a/drivers/mtd/nand/nandsim.c b/drivers/mtd/nand/nandsim.c
index cef818f535ed..03a0d057bf2f 100644
--- a/drivers/mtd/nand/nandsim.c
+++ b/drivers/mtd/nand/nandsim.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1368,31 +1369,18 @@ static int get_pages(struct nandsim *ns, struct file 
*file, size_t count, loff_t
return 0;
 }
 
-static int set_memalloc(void)
-{
-   if (current->flags & PF_MEMALLOC)
-   return 0;
-   current->flags |= PF_MEMALLOC;
-   return 1;
-}
-
-static void clear_memalloc(int memalloc)
-{
-   if (memalloc)
-   current->flags &= ~PF_MEMALLOC;
-}
-
 static ssize_t read_file(struct nandsim *ns, struct file *file, void *buf, 
size_t count, loff_t pos)
 {
ssize_t tx;
-   int err, memalloc;
+   int err;
+   unsigned int noreclaim_flag;
 
err = get_pages(ns, file, count, pos);
if (err)
return err;
-   memalloc = set_memalloc();
+   noreclaim_flag = memalloc_noreclaim_save();
tx = kernel_read(file, pos, buf, count);
-   clear_memalloc(memalloc);
+   memalloc_noreclaim_restore(noreclaim_flag);
put_pages(ns);
return tx;
 }
@@ -1400,14 +1388,15 @@ static ssize_t read_file(struct nandsim *ns, struct 
file *file, void *buf, size_
 static ssize_t write_file(struct nandsim *ns, struct file *file, void *buf, 
size_t count, loff_t pos)
 {
ssize_t tx;
-   int err, memalloc;
+   int err;
+   unsigned int noreclaim_flag;
 
err = get_pages(ns, file, count, pos);
if (err)
return err;
-   memalloc = set_memalloc();
+   noreclaim_flag = memalloc_noreclaim_save();
tx = kernel_write(file, buf, count, pos);
-   clear_memalloc(memalloc);
+   memalloc_noreclaim_restore(noreclaim_flag);
put_pages(ns);
return tx;
 }
-- 
2.12.2



[PATCH 2/4] mm: introduce memalloc_noreclaim_{save,restore}

2017-04-05 Thread Vlastimil Babka
The previous patch has shown that simply setting and clearing PF_MEMALLOC in
current->flags can result in wrongly clearing a pre-existing PF_MEMALLOC flag
and potentially lead to recursive reclaim. Let's introduce helpers that support
proper nesting by saving the previous stat of the flag, similar to the existing
memalloc_noio_* and memalloc_nofs_* helpers. Convert existing setting/clearing
of PF_MEMALLOC within mm to the new helpers.

There are no known issues with the converted code, but the change makes it more
robust.

Suggested-by: Michal Hocko <mho...@suse.com>
Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---
 include/linux/sched/mm.h | 12 
 mm/page_alloc.c  | 11 ++-
 mm/vmscan.c  | 17 +++--
 3 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 9daabe138c99..2b24a6974847 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -191,4 +191,16 @@ static inline void memalloc_nofs_restore(unsigned int 
flags)
current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
 }
 
+static inline unsigned int memalloc_noreclaim_save(void)
+{
+   unsigned int flags = current->flags & PF_MEMALLOC;
+   current->flags |= PF_MEMALLOC;
+   return flags;
+}
+
+static inline void memalloc_noreclaim_restore(unsigned int flags)
+{
+   current->flags = (current->flags & ~PF_MEMALLOC) | flags;
+}
+
 #endif /* _LINUX_SCHED_MM_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b84e6ffbe756..037e32dccd7a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3288,15 +3288,15 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned 
int order,
enum compact_priority prio, enum compact_result *compact_result)
 {
struct page *page;
-   unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
+   unsigned int noreclaim_flag;
 
if (!order)
return NULL;
 
-   current->flags |= PF_MEMALLOC;
+   noreclaim_flag = memalloc_noreclaim_save();
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
prio);
-   current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;
+   memalloc_noreclaim_restore(noreclaim_flag);
 
if (*compact_result <= COMPACT_INACTIVE)
return NULL;
@@ -3443,12 +3443,13 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 {
struct reclaim_state reclaim_state;
int progress;
+   unsigned int noreclaim_flag;
 
cond_resched();
 
/* We now go into synchronous reclaim */
cpuset_memory_pressure_bump();
-   current->flags |= PF_MEMALLOC;
+   noreclaim_flag = memalloc_noreclaim_save();
lockdep_set_current_reclaim_state(gfp_mask);
reclaim_state.reclaimed_slab = 0;
current->reclaim_state = _state;
@@ -3458,7 +3459,7 @@ __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 
current->reclaim_state = NULL;
lockdep_clear_current_reclaim_state();
-   current->flags &= ~PF_MEMALLOC;
+   memalloc_noreclaim_restore(noreclaim_flag);
 
cond_resched();
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 58615bb27f2f..ff63b91a0f48 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2992,6 +2992,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct 
mem_cgroup *memcg,
struct zonelist *zonelist;
unsigned long nr_reclaimed;
int nid;
+   unsigned int noreclaim_flag;
struct scan_control sc = {
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
@@ -3018,9 +3019,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct 
mem_cgroup *memcg,
sc.gfp_mask,
sc.reclaim_idx);
 
-   current->flags |= PF_MEMALLOC;
+   noreclaim_flag = memalloc_noreclaim_save();
nr_reclaimed = do_try_to_free_pages(zonelist, );
-   current->flags &= ~PF_MEMALLOC;
+   memalloc_noreclaim_restore(noreclaim_flag);
 
trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
 
@@ -3544,8 +3545,9 @@ unsigned long shrink_all_memory(unsigned long 
nr_to_reclaim)
struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask);
struct task_struct *p = current;
unsigned long nr_reclaimed;
+   unsigned int noreclaim_flag;
 
-   p->flags |= PF_MEMALLOC;
+   noreclaim_flag = memalloc_noreclaim_save();
lockdep_set_current_reclaim_state(sc.gfp_mask);
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = _state;
@@ -3554,7 +3556,7 @@ unsigned long shrink_all_memory(unsigned long 
nr_to_reclaim)
 

[PATCH 3/4] treewide: convert PF_MEMALLOC manipulations to new helpers

2017-04-05 Thread Vlastimil Babka
We now have memalloc_noreclaim_{save,restore} helpers for robust setting and
clearing of PF_MEMALLOC. Let's convert the code which was using the generic
tsk_restore_flags(). No functional change.

Signed-off-by: Vlastimil Babka <vba...@suse.cz>
Cc: Josef Bacik <jba...@fb.com>
Cc: Lee Duncan <ldun...@suse.com>
Cc: Chris Leech <cle...@redhat.com>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: Eric Dumazet <eduma...@google.com>
---
 drivers/block/nbd.c  | 7 ---
 drivers/scsi/iscsi_tcp.c | 7 ---
 net/core/dev.c   | 7 ---
 net/core/sock.c  | 7 ---
 4 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 03ae72985c79..929fc548c7fb 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -210,7 +211,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int 
send,
struct socket *sock = nbd->socks[index]->sock;
int result;
struct msghdr msg;
-   unsigned long pflags = current->flags;
+   unsigned int noreclaim_flag;
 
if (unlikely(!sock)) {
dev_err_ratelimited(disk_to_dev(nbd->disk),
@@ -221,7 +222,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int 
send,
 
msg.msg_iter = *iter;
 
-   current->flags |= PF_MEMALLOC;
+   noreclaim_flag = memalloc_noreclaim_save();
do {
sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
msg.msg_name = NULL;
@@ -244,7 +245,7 @@ static int sock_xmit(struct nbd_device *nbd, int index, int 
send,
*sent += result;
} while (msg_data_left());
 
-   tsk_restore_flags(current, pflags, PF_MEMALLOC);
+   memalloc_noreclaim_restore(noreclaim_flag);
 
return result;
 }
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 4228aba1f654..4842fc0e809d 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -371,10 +372,10 @@ static inline int iscsi_sw_tcp_xmit_qlen(struct 
iscsi_conn *conn)
 static int iscsi_sw_tcp_pdu_xmit(struct iscsi_task *task)
 {
struct iscsi_conn *conn = task->conn;
-   unsigned long pflags = current->flags;
+   unsigned int noreclaim_flag;
int rc = 0;
 
-   current->flags |= PF_MEMALLOC;
+   noreclaim_flag = memalloc_noreclaim_save();
 
while (iscsi_sw_tcp_xmit_qlen(conn)) {
rc = iscsi_sw_tcp_xmit(conn);
@@ -387,7 +388,7 @@ static int iscsi_sw_tcp_pdu_xmit(struct iscsi_task *task)
rc = 0;
}
 
-   tsk_restore_flags(current, pflags, PF_MEMALLOC);
+   memalloc_noreclaim_restore(noreclaim_flag);
return rc;
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index fde8b3f7136b..e0705a126b24 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -81,6 +81,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -4227,7 +4228,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
int ret;
 
if (sk_memalloc_socks() && skb_pfmemalloc(skb)) {
-   unsigned long pflags = current->flags;
+   unsigned int noreclaim_flag;
 
/*
 * PFMEMALLOC skbs are special, they should
@@ -4238,9 +4239,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
 * Use PF_MEMALLOC as this saves us from propagating the 
allocation
 * context down to all allocation sites.
 */
-   current->flags |= PF_MEMALLOC;
+   noreclaim_flag = memalloc_noreclaim_save();
ret = __netif_receive_skb_core(skb, true);
-   tsk_restore_flags(current, pflags, PF_MEMALLOC);
+   memalloc_noreclaim_restore(noreclaim_flag);
} else
ret = __netif_receive_skb_core(skb, false);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 392f9b6f96e2..0b2d06b4c308 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -102,6 +102,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -372,14 +373,14 @@ EXPORT_SYMBOL_GPL(sk_clear_memalloc);
 int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
int ret;
-   unsigned long pflags = current->flags;
+   unsigned int noreclaim_flag;
 
/* these should have been dropped before queueing */
BUG_ON(!sock_flag(sk, SOCK_MEMALLOC));
 
-   current->flags |= PF_MEMALLOC;
+   noreclaim_flag = memalloc_noreclaim_save();
ret = sk->sk_backlog_rcv(sk, skb);
-   tsk_restore_flags(current, pflags, PF_MEMALLOC);
+   memalloc_noreclaim_restore(noreclaim_flag);
 
return ret;
 }
-- 
2.12.2



[PATCH 1/4] mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC

2017-04-05 Thread Vlastimil Babka
The function __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent
deadlock during page migration by lock_page() (see the comment in
__unmap_and_move()). Then it unconditionally clears the flag, which can clear a
pre-existing PF_MEMALLOC flag and result in recursive reclaim. This was not a
problem until commit a8161d1ed609 ("mm, page_alloc: restructure direct
compaction handling in slowpath"), because direct compation was called only
after direct reclaim, which was skipped when PF_MEMALLOC flag was set.

Even now it's only a theoretical issue, as the new callsite of
__alloc_pages_direct_compact() is reached only for costly orders and when
gfp_pfmemalloc_allowed() is true, which means either __GFP_NOMEMALLOC is in
gfp_flags or in_interrupt() is true. There is no such known context, but let's
play it safe and make __alloc_pages_direct_compact() robust for cases where
PF_MEMALLOC is already set.

Fixes: a8161d1ed609 ("mm, page_alloc: restructure direct compaction handling in 
slowpath")
Reported-by: Andrey Ryabinin <aryabi...@virtuozzo.com>
Signed-off-by: Vlastimil Babka <vba...@suse.cz>
Cc: <sta...@vger.kernel.org>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589f8be53be..b84e6ffbe756 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3288,6 +3288,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int 
order,
enum compact_priority prio, enum compact_result *compact_result)
 {
struct page *page;
+   unsigned int noreclaim_flag = current->flags & PF_MEMALLOC;
 
if (!order)
return NULL;
@@ -3295,7 +3296,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int 
order,
current->flags |= PF_MEMALLOC;
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
prio);
-   current->flags &= ~PF_MEMALLOC;
+   current->flags = (current->flags & ~PF_MEMALLOC) | noreclaim_flag;
 
if (*compact_result <= COMPACT_INACTIVE)
return NULL;
-- 
2.12.2



Re: mm: VM_BUG_ON_PAGE(PageTail(page)) in mbind

2016-01-26 Thread Vlastimil Babka
On 26.1.2016 21:28, Kirill A. Shutemov wrote:
> From 396ad132be07a2d2b9ec5d1d6ec9fe2fffe8105e Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com>
> Date: Tue, 26 Jan 2016 22:59:16 +0300
> Subject: [PATCH] sg: mark VMA as VM_IO to prevent migration
> 
> Reduced testcase:
> 
>   #include 
>   #include 
>   #include 
>   #include 
> 
>   #define SIZE 0x2000
> 
>   int main()
>   {
>   int fd;
>   void *p;
> 
>   fd = open("/dev/sg0", O_RDWR);
>   p = mmap(NULL, SIZE, PROT_EXEC, MAP_PRIVATE | MAP_LOCKED, fd, 
> 0);
>   mbind(p, SIZE, 0, NULL, 0, MPOL_MF_MOVE);
>   return 0;
>   }
> 
> We shouldn't try to migrate pages in sg VMA as we don't have a way to
> update Sg_scatter_hold::pages accordingly from mm core.
> 
> Let's mark the VMA as VM_IO to indicate to mm core that the VMA is
> migratable.

 ^ not migratable.

Acked-by: Vlastimil Babka <vba...@suse.cz>


> 
> Signed-off-by: Kirill A. Shutemov <kirill.shute...@linux.intel.com>
> Reported-by: Dmitry Vyukov <dvyu...@google.com>
> ---
>  drivers/scsi/sg.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c
> index 503ab8b46c0b..5e820674432c 100644
> --- a/drivers/scsi/sg.c
> +++ b/drivers/scsi/sg.c
> @@ -1261,7 +1261,7 @@ sg_mmap(struct file *filp, struct vm_area_struct *vma)
>   }
>  
>   sfp->mmap_called = 1;
> - vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
> + vma->vm_flags |= VM_IO | VM_DONTEXPAND | VM_DONTDUMP;
>   vma->vm_private_data = sfp;
>   vma->vm_ops = _mmap_vm_ops;
>   return 0;
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html