Currently, if memcg reclaim encounters a page under writeback it waits for the writeback to finish. This is done in order to avoid hitting OOM when there are a lot of potentially reclaimable pages under writeback, as memcg lacks dirty pages limit. Although it saves us from premature OOM, this technique is deadlock prone if writeback is supposed to be done by a process that might need to allocate memory, like in case of vstorage. If the process responsible for writeback tries to allocate a page it might get stuck in too_many_isolated() loop waiting for processes performing memcg reclaim to put isolated pages back to the LRU, but memcg reclaim might be stuck waiting for writeback to complete, resulting in a deadlock.
To avoid this kind of deadlock, let's, instead of waiting for page writeback directly, call congestion_wait() after returning isolated pages to the LRU in case writeback pages are recycled through the LRU before IO can complete. This should still prevent premature memcg OOM while rendering the deadlock described above impossible. https://jira.sw.ru/browse/PSBM-48115 Signed-off-by: Vladimir Davydov <[email protected]> --- mm/vmscan.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 3f6ce18df3ed..3ac08ddf50b8 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -929,11 +929,11 @@ static unsigned long shrink_page_list(struct list_head *page_list, * __GFP_IO|__GFP_FS for this reason); but more thought * would probably show more reasons. * - * 3) memcg encounters a page that is not already marked + * 3) memcg encounters a page that is already marked * PageReclaim. memcg does not have any dirty pages * throttling so we could easily OOM just because too many * pages are in writeback and there is nothing else to - * reclaim. Wait for the writeback to complete. + * reclaim. Stall memcg reclaim then. */ if (PageWriteback(page)) { /* Case 1 above */ @@ -954,7 +954,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, * enough to care. What we do want is for this * page to have PageReclaim set next time memcg * reclaim reaches the tests above, so it will - * then wait_on_page_writeback() to avoid OOM; + * then stall to avoid OOM; * and it's also appropriate in global reclaim. */ SetPageReclaim(page); @@ -964,7 +964,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, /* Case 3 above */ } else { - wait_on_page_writeback(page); + nr_immediate++; + goto keep_locked; } } @@ -1586,10 +1587,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, if (nr_writeback && nr_writeback == nr_taken) zone_set_flag(zone, ZONE_WRITEBACK); - /* - * memcg will stall in page writeback so only consider forcibly - * stalling for global reclaim - */ + if (!global_reclaim(sc) && nr_immediate) + congestion_wait(BLK_RW_ASYNC, HZ/10); + if (global_reclaim(sc)) { /* * Tag a zone as congested if all the dirty pages scanned were -- 2.1.4 _______________________________________________ Devel mailing list [email protected] https://lists.openvz.org/mailman/listinfo/devel
