On Tue, 4 Dec 2018, Vlastimil Babka wrote:

> So, AFAIK, the situation is:
> - commit 5265047ac301 in 4.1 introduced __GFP_THISNODE for THP. The
> intention came a bit earlier in 4.0 commit 077fcf116c8c. (I admit acking
> both as it seemed to make sense).

Yes, both are based on the preference to fault local thp and fallback to 
local pages before allocating remotely because it does not cause the 
performance regression introduced by not setting __GFP_THISNODE.

> - The resulting node-reclaim-like behavior regressed Andrea's KVM
> workloads, but reverting it (only for madvised or non-default
> defrag=always THP by commit ac5b2c18911f) would regress David's
> workloads starting with 4.20 to pre-4.1 levels.

Almost, but the defrag=always case had the subtle difference of also 
setting __GFP_NORETRY whereas MADV_HUGEPAGE did not.  This was different 
than the comment in __alloc_pages_slowpath() that expected thp fault 
allocations to be caught by checking __GFP_NORETRY.

> If the decision is that it's too late to revert a 4.1 regression for one
> kind of workload in 4.20 because it causes regression for another
> workload, then I guess we just revert ac5b2c18911f (patch 1) for 4.20
> and don't rush a different fix (patch 2) to 4.20. It's not a big
> difference if a 4.1 regression is fixed in 4.20 or 4.21?

The revert is certainly needed to prevent the regression, yes, but I 
anticipate that Andrea will report back that patch 2 at least improves the 
situation for the problem that he was addressing, specifically that it is 
pointless to thrash any node or reclaim unnecessarily when compaction has 
already failed.  This is what setting __GFP_NORETRY for all thp fault 
allocations fixes.

> Because there might be other unexpected consequences of patch 2 that
> testing won't be able to catch in the remaining 4.20 rc's. And I'm not
> even sure if it will fix Andrea's workloads. While it should prevent
> node-reclaim-like thrashing, it will still mean that KVM (or anyone)
> won't be able to allocate THP's remotely, even if the local node is
> exhausted of both huge and base pages.

Patch 2 does nothing with respect to the remote allocation policy; it 
simply prevents reclaim (and potentially thrashing).  Patch 1 sets 
__GFP_THISNODE to prevent the remote allocation.

Note that the commit to be reverted in patch 1, if not reverted, would 
cause an even more severe regression from Andrea's case if remote memory 
is fragmented as well: it opens the door to thrashing both local and 
remote memory instead of only local memory.  I measured this as a 40% 
allocation latency regression when purposefully fragmenting all nodes and 
faulting without __GFP_THISNODE, and that was on Haswell; I can imagine it 
would be even greater on Rome.

Reply via email to