Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-02-05 Thread Trevor Cordes
On 2017-02-05 Michal Hocko wrote:
> On Fri 03-02-17 18:36:54, Trevor Cordes wrote:
> > I ran to_test/linus-tree/oom_hickups branch (4.10.0-rc6+) for 50
> > hours and it does NOT have the bug!  No problems at all so far.  
> 
> OK, that is definitely good to know. My other fix ("mm, vmscan:
> consider eligible zones in get_scan_count") was more theoretical than
> bug driven. I would add your
> Tested-by: Trevor Cordes 
> 
> unless you have anything against that.

I am happy to be in the tested-by; go ahead.

> > So I think whatever to_test/linus-tree/oom_hickups has that
> > since-4.9 has that vanilla 4.10-rc6 does *not* have is indeed the
> > fix.
> > 
> > For my reference, and I know you guys aren't distro-specific, what
> > is the best way to get this fix into Fedora 24 (currently 4.9)?  
> 
> I will send this patch to 4.9+ stable as soon as it hits Linus tree.

That's great news!  It will make everyone on the rhbz happy.  Thank you!


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-02-05 Thread Trevor Cordes
On 2017-02-05 Michal Hocko wrote:
> On Fri 03-02-17 18:36:54, Trevor Cordes wrote:
> > I ran to_test/linus-tree/oom_hickups branch (4.10.0-rc6+) for 50
> > hours and it does NOT have the bug!  No problems at all so far.  
> 
> OK, that is definitely good to know. My other fix ("mm, vmscan:
> consider eligible zones in get_scan_count") was more theoretical than
> bug driven. I would add your
> Tested-by: Trevor Cordes 
> 
> unless you have anything against that.

I am happy to be in the tested-by; go ahead.

> > So I think whatever to_test/linus-tree/oom_hickups has that
> > since-4.9 has that vanilla 4.10-rc6 does *not* have is indeed the
> > fix.
> > 
> > For my reference, and I know you guys aren't distro-specific, what
> > is the best way to get this fix into Fedora 24 (currently 4.9)?  
> 
> I will send this patch to 4.9+ stable as soon as it hits Linus tree.

That's great news!  It will make everyone on the rhbz happy.  Thank you!


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-02-05 Thread Michal Hocko
On Fri 03-02-17 18:36:54, Trevor Cordes wrote:
> On 2017-02-01 Michal Hocko wrote:
> > On Wed 01-02-17 03:29:28, Trevor Cordes wrote:
> > > On 2017-01-30 Michal Hocko wrote:  
> > [...]
> > > > Testing with Valinall rc6 released just yesterday would be a good
> > > > fit. There are some more fixes sitting on mmotm on top and maybe
> > > > we want some of them in finall 4.10. Anyway all those pending
> > > > changes should be merged in the next merge window - aka 4.11  
> > > 
> > > After 30 hours of running vanilla 4.10.0-rc6, the box started to go
> > > bonkers at 3am, so vanilla does not fix the bug :-(  But, the bug
> > > hit differently this time, the box just bogged down like crazy and
> > > gave really weird top output.  Starting nano would take 10s, then
> > > would run full speed, then when saving a file would take 5s.
> > > Starting any prog not in cache took equally as long.  
> > 
> > Could you try with to_test/linus-tree/oom_hickups branch on the same
> > git tree? I have cherry-picked "mm, vmscan: consider eligible zones in
> > get_scan_count" which might be the missing part.
> 
> I ran to_test/linus-tree/oom_hickups branch (4.10.0-rc6+) for 50 hours
> and it does NOT have the bug!  No problems at all so far.

OK, that is definitely good to know. My other fix ("mm, vmscan: consider
eligible zones in get_scan_count") was more theoretical than bug driven.
I would add your
Tested-by: Trevor Cordes 

unless you have anything against that.

> So I think whatever to_test/linus-tree/oom_hickups has that since-4.9
> has that vanilla 4.10-rc6 does *not* have is indeed the fix.
> 
> For my reference, and I know you guys aren't distro-specific, what is
> the best way to get this fix into Fedora 24 (currently 4.9)?

I will send this patch to 4.9+ stable as soon as it hits Linus tree.

-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-02-05 Thread Michal Hocko
On Fri 03-02-17 18:36:54, Trevor Cordes wrote:
> On 2017-02-01 Michal Hocko wrote:
> > On Wed 01-02-17 03:29:28, Trevor Cordes wrote:
> > > On 2017-01-30 Michal Hocko wrote:  
> > [...]
> > > > Testing with Valinall rc6 released just yesterday would be a good
> > > > fit. There are some more fixes sitting on mmotm on top and maybe
> > > > we want some of them in finall 4.10. Anyway all those pending
> > > > changes should be merged in the next merge window - aka 4.11  
> > > 
> > > After 30 hours of running vanilla 4.10.0-rc6, the box started to go
> > > bonkers at 3am, so vanilla does not fix the bug :-(  But, the bug
> > > hit differently this time, the box just bogged down like crazy and
> > > gave really weird top output.  Starting nano would take 10s, then
> > > would run full speed, then when saving a file would take 5s.
> > > Starting any prog not in cache took equally as long.  
> > 
> > Could you try with to_test/linus-tree/oom_hickups branch on the same
> > git tree? I have cherry-picked "mm, vmscan: consider eligible zones in
> > get_scan_count" which might be the missing part.
> 
> I ran to_test/linus-tree/oom_hickups branch (4.10.0-rc6+) for 50 hours
> and it does NOT have the bug!  No problems at all so far.

OK, that is definitely good to know. My other fix ("mm, vmscan: consider
eligible zones in get_scan_count") was more theoretical than bug driven.
I would add your
Tested-by: Trevor Cordes 

unless you have anything against that.

> So I think whatever to_test/linus-tree/oom_hickups has that since-4.9
> has that vanilla 4.10-rc6 does *not* have is indeed the fix.
> 
> For my reference, and I know you guys aren't distro-specific, what is
> the best way to get this fix into Fedora 24 (currently 4.9)?

I will send this patch to 4.9+ stable as soon as it hits Linus tree.

-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-02-04 Thread Rik van Riel
On Fri, 2017-02-03 at 18:36 -0600, Trevor Cordes wrote:
> On 2017-02-01 Michal Hocko wrote:
> > On Wed 01-02-17 03:29:28, Trevor Cordes wrote:
> > > On 2017-01-30 Michal Hocko wrote:  
> > 
> > [...]
> > > > Testing with Valinall rc6 released just yesterday would be a
> > > > good
> > > > fit. There are some more fixes sitting on mmotm on top and
> > > > maybe
> > > > we want some of them in finall 4.10. Anyway all those pending
> > > > changes should be merged in the next merge window - aka 4.11  
> > > 
> > > After 30 hours of running vanilla 4.10.0-rc6, the box started to
> > > go
> > > bonkers at 3am, so vanilla does not fix the bug :-(  But, the bug
> > > hit differently this time, the box just bogged down like crazy
> > > and
> > > gave really weird top output.  Starting nano would take 10s, then
> > > would run full speed, then when saving a file would take 5s.
> > > Starting any prog not in cache took equally as long.  
> > 
> > Could you try with to_test/linus-tree/oom_hickups branch on the
> > same
> > git tree? I have cherry-picked "mm, vmscan: consider eligible zones
> > in
> > get_scan_count" which might be the missing part.
> 
> I ran to_test/linus-tree/oom_hickups branch (4.10.0-rc6+) for 50
> hours
> and it does NOT have the bug!  No problems at all so far.
> 
> So I think whatever to_test/linus-tree/oom_hickups has that since-4.9
> has that vanilla 4.10-rc6 does *not* have is indeed the fix.
> 
> For my reference, and I know you guys aren't distro-specific, what is
> the best way to get this fix into Fedora 24 (currently 4.9)?  Can it
> be
> backported or made as a patch they can apply to 4.9?  Or 4.10?

The best way would be to open a Fedora bug, and CC me on it :)

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-02-04 Thread Rik van Riel
On Fri, 2017-02-03 at 18:36 -0600, Trevor Cordes wrote:
> On 2017-02-01 Michal Hocko wrote:
> > On Wed 01-02-17 03:29:28, Trevor Cordes wrote:
> > > On 2017-01-30 Michal Hocko wrote:  
> > 
> > [...]
> > > > Testing with Valinall rc6 released just yesterday would be a
> > > > good
> > > > fit. There are some more fixes sitting on mmotm on top and
> > > > maybe
> > > > we want some of them in finall 4.10. Anyway all those pending
> > > > changes should be merged in the next merge window - aka 4.11  
> > > 
> > > After 30 hours of running vanilla 4.10.0-rc6, the box started to
> > > go
> > > bonkers at 3am, so vanilla does not fix the bug :-(  But, the bug
> > > hit differently this time, the box just bogged down like crazy
> > > and
> > > gave really weird top output.  Starting nano would take 10s, then
> > > would run full speed, then when saving a file would take 5s.
> > > Starting any prog not in cache took equally as long.  
> > 
> > Could you try with to_test/linus-tree/oom_hickups branch on the
> > same
> > git tree? I have cherry-picked "mm, vmscan: consider eligible zones
> > in
> > get_scan_count" which might be the missing part.
> 
> I ran to_test/linus-tree/oom_hickups branch (4.10.0-rc6+) for 50
> hours
> and it does NOT have the bug!  No problems at all so far.
> 
> So I think whatever to_test/linus-tree/oom_hickups has that since-4.9
> has that vanilla 4.10-rc6 does *not* have is indeed the fix.
> 
> For my reference, and I know you guys aren't distro-specific, what is
> the best way to get this fix into Fedora 24 (currently 4.9)?  Can it
> be
> backported or made as a patch they can apply to 4.9?  Or 4.10?

The best way would be to open a Fedora bug, and CC me on it :)

-- 
All Rights Reversed.

signature.asc
Description: This is a digitally signed message part


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-02-03 Thread Trevor Cordes
On 2017-02-01 Michal Hocko wrote:
> On Wed 01-02-17 03:29:28, Trevor Cordes wrote:
> > On 2017-01-30 Michal Hocko wrote:  
> [...]
> > > Testing with Valinall rc6 released just yesterday would be a good
> > > fit. There are some more fixes sitting on mmotm on top and maybe
> > > we want some of them in finall 4.10. Anyway all those pending
> > > changes should be merged in the next merge window - aka 4.11  
> > 
> > After 30 hours of running vanilla 4.10.0-rc6, the box started to go
> > bonkers at 3am, so vanilla does not fix the bug :-(  But, the bug
> > hit differently this time, the box just bogged down like crazy and
> > gave really weird top output.  Starting nano would take 10s, then
> > would run full speed, then when saving a file would take 5s.
> > Starting any prog not in cache took equally as long.  
> 
> Could you try with to_test/linus-tree/oom_hickups branch on the same
> git tree? I have cherry-picked "mm, vmscan: consider eligible zones in
> get_scan_count" which might be the missing part.

I ran to_test/linus-tree/oom_hickups branch (4.10.0-rc6+) for 50 hours
and it does NOT have the bug!  No problems at all so far.

So I think whatever to_test/linus-tree/oom_hickups has that since-4.9
has that vanilla 4.10-rc6 does *not* have is indeed the fix.

For my reference, and I know you guys aren't distro-specific, what is
the best way to get this fix into Fedora 24 (currently 4.9)?  Can it be
backported or made as a patch they can apply to 4.9?  Or 4.10?  If this
fix only goes into 4.11 then I fear we'll never see it in Fedora and us
rhbz guys will not have a stock-Fedora fix for this until F25 or F26.
Again, I'm not trying to force this out of scope, I'm just wondering
about the logistics in these situations.

Once again, thanks to all for your great work and help!  P.S. I'll try
a couple of the other ideas Mel had about ramping the RAM back up, etc.


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-02-03 Thread Trevor Cordes
On 2017-02-01 Michal Hocko wrote:
> On Wed 01-02-17 03:29:28, Trevor Cordes wrote:
> > On 2017-01-30 Michal Hocko wrote:  
> [...]
> > > Testing with Valinall rc6 released just yesterday would be a good
> > > fit. There are some more fixes sitting on mmotm on top and maybe
> > > we want some of them in finall 4.10. Anyway all those pending
> > > changes should be merged in the next merge window - aka 4.11  
> > 
> > After 30 hours of running vanilla 4.10.0-rc6, the box started to go
> > bonkers at 3am, so vanilla does not fix the bug :-(  But, the bug
> > hit differently this time, the box just bogged down like crazy and
> > gave really weird top output.  Starting nano would take 10s, then
> > would run full speed, then when saving a file would take 5s.
> > Starting any prog not in cache took equally as long.  
> 
> Could you try with to_test/linus-tree/oom_hickups branch on the same
> git tree? I have cherry-picked "mm, vmscan: consider eligible zones in
> get_scan_count" which might be the missing part.

I ran to_test/linus-tree/oom_hickups branch (4.10.0-rc6+) for 50 hours
and it does NOT have the bug!  No problems at all so far.

So I think whatever to_test/linus-tree/oom_hickups has that since-4.9
has that vanilla 4.10-rc6 does *not* have is indeed the fix.

For my reference, and I know you guys aren't distro-specific, what is
the best way to get this fix into Fedora 24 (currently 4.9)?  Can it be
backported or made as a patch they can apply to 4.9?  Or 4.10?  If this
fix only goes into 4.11 then I fear we'll never see it in Fedora and us
rhbz guys will not have a stock-Fedora fix for this until F25 or F26.
Again, I'm not trying to force this out of scope, I'm just wondering
about the logistics in these situations.

Once again, thanks to all for your great work and help!  P.S. I'll try
a couple of the other ideas Mel had about ramping the RAM back up, etc.


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-02-01 Thread Michal Hocko
On Wed 01-02-17 03:29:28, Trevor Cordes wrote:
> On 2017-01-30 Michal Hocko wrote:
[...]
> > Testing with Valinall rc6 released just yesterday would be a good fit.
> > There are some more fixes sitting on mmotm on top and maybe we want
> > some of them in finall 4.10. Anyway all those pending changes should
> > be merged in the next merge window - aka 4.11
> 
> After 30 hours of running vanilla 4.10.0-rc6, the box started to go
> bonkers at 3am, so vanilla does not fix the bug :-(  But, the bug hit
> differently this time, the box just bogged down like crazy and gave
> really weird top output.  Starting nano would take 10s, then would run
> full speed, then when saving a file would take 5s.  Starting any prog
> not in cache took equally as long.

Could you try with to_test/linus-tree/oom_hickups branch on the same git
tree? I have cherry-picked "mm, vmscan: consider eligible zones in
get_scan_count" which might be the missing part.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-02-01 Thread Michal Hocko
On Wed 01-02-17 03:29:28, Trevor Cordes wrote:
> On 2017-01-30 Michal Hocko wrote:
[...]
> > Testing with Valinall rc6 released just yesterday would be a good fit.
> > There are some more fixes sitting on mmotm on top and maybe we want
> > some of them in finall 4.10. Anyway all those pending changes should
> > be merged in the next merge window - aka 4.11
> 
> After 30 hours of running vanilla 4.10.0-rc6, the box started to go
> bonkers at 3am, so vanilla does not fix the bug :-(  But, the bug hit
> differently this time, the box just bogged down like crazy and gave
> really weird top output.  Starting nano would take 10s, then would run
> full speed, then when saving a file would take 5s.  Starting any prog
> not in cache took equally as long.

Could you try with to_test/linus-tree/oom_hickups branch on the same git
tree? I have cherry-picked "mm, vmscan: consider eligible zones in
get_scan_count" which might be the missing part.

Thanks!
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-02-01 Thread Trevor Cordes
On 2017-01-30 Michal Hocko wrote:
> On Sun 29-01-17 16:50:03, Trevor Cordes wrote:
> > On 2017-01-25 Michal Hocko wrote:  
> > > On Wed 25-01-17 04:02:46, Trevor Cordes wrote:  
> > > > OK, I patched & compiled mhocko's git tree from the other day
> > > > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using
> > > > from a couple of weeks ago shows the newest commit (git log) is
> > > > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"?  Let me
> > > > know if I'm doing something wrong, see below.)
> > > 
> > > My fault. I should have noted that you should use since-4.9
> > > branch.  
> > 
> > OK, I have good news.  I compiled your mhocko git tree (properly
> > this tim!) using since-4.9 branch (last commit
> > ca63ff9b11f958efafd8c8fa60fda14baec6149c Jan 25) and the box
> > survived 3 3am's, over 60 hours, and I made sure all the usual oom
> > culprits ran, and I ran extras (finds on the whole tree, extra
> > rdiff-backups) to try to tax it.  Based on my previous criteria I
> > would say your since-4.9 as of the above commit solves my bug, at
> > least over a 3 day test span (which it never survives when the bug
> > is present)!
> > 
> > I tested WITHOUT any cgroup/mem boot options.  I do still have my
> > mem=6G limiter on, though (I've never tested with it off, until I
> > solve the bug with it on, since I've had it on for many months for
> > other reasons).  
> 
> Good news indeed.

Even better, another guy on the rhbz reported the mhocko git tree
since-4.9 solves the bug for him too!  And it ran another night (4+
total) without problems on my box.  Whatever is in since-4.9 fixes it,
as I reported before.

But...

> Testing with Valinall rc6 released just yesterday would be a good fit.
> There are some more fixes sitting on mmotm on top and maybe we want
> some of them in finall 4.10. Anyway all those pending changes should
> be merged in the next merge window - aka 4.11

After 30 hours of running vanilla 4.10.0-rc6, the box started to go
bonkers at 3am, so vanilla does not fix the bug :-(  But, the bug hit
differently this time, the box just bogged down like crazy and gave
really weird top output.  Starting nano would take 10s, then would run
full speed, then when saving a file would take 5s.  Starting any prog
not in cache took equally as long.

However, no oom hit.  I waited about 15 minutes and things seemed to
bog more, so I rebooted into since-4.9.  Maybe if I had kept waiting
the box would have oom'd, but I didn't want to take the chance (it's
remote, and I can't reset it).

I did capture a lot of the weird top, meminfo and slabinfo data before
rebooting.  I'll attached the output to this email.  Messages show a
lot of "page allocation stalls" during the bogged-down time.

So my hunch at this moment is 4.10.0-rc6 might help alleviate the
problem somewhat, but it's other things you have in since-4.9 that
solve it completely.

Let me know if you need any more testing or some bisecting or
something.  I'll keep on running since-4.9 in the meantime.  Thanks!


4.10.rc6-bogged
Description: Binary data


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-02-01 Thread Trevor Cordes
On 2017-01-30 Michal Hocko wrote:
> On Sun 29-01-17 16:50:03, Trevor Cordes wrote:
> > On 2017-01-25 Michal Hocko wrote:  
> > > On Wed 25-01-17 04:02:46, Trevor Cordes wrote:  
> > > > OK, I patched & compiled mhocko's git tree from the other day
> > > > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using
> > > > from a couple of weeks ago shows the newest commit (git log) is
> > > > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"?  Let me
> > > > know if I'm doing something wrong, see below.)
> > > 
> > > My fault. I should have noted that you should use since-4.9
> > > branch.  
> > 
> > OK, I have good news.  I compiled your mhocko git tree (properly
> > this tim!) using since-4.9 branch (last commit
> > ca63ff9b11f958efafd8c8fa60fda14baec6149c Jan 25) and the box
> > survived 3 3am's, over 60 hours, and I made sure all the usual oom
> > culprits ran, and I ran extras (finds on the whole tree, extra
> > rdiff-backups) to try to tax it.  Based on my previous criteria I
> > would say your since-4.9 as of the above commit solves my bug, at
> > least over a 3 day test span (which it never survives when the bug
> > is present)!
> > 
> > I tested WITHOUT any cgroup/mem boot options.  I do still have my
> > mem=6G limiter on, though (I've never tested with it off, until I
> > solve the bug with it on, since I've had it on for many months for
> > other reasons).  
> 
> Good news indeed.

Even better, another guy on the rhbz reported the mhocko git tree
since-4.9 solves the bug for him too!  And it ran another night (4+
total) without problems on my box.  Whatever is in since-4.9 fixes it,
as I reported before.

But...

> Testing with Valinall rc6 released just yesterday would be a good fit.
> There are some more fixes sitting on mmotm on top and maybe we want
> some of them in finall 4.10. Anyway all those pending changes should
> be merged in the next merge window - aka 4.11

After 30 hours of running vanilla 4.10.0-rc6, the box started to go
bonkers at 3am, so vanilla does not fix the bug :-(  But, the bug hit
differently this time, the box just bogged down like crazy and gave
really weird top output.  Starting nano would take 10s, then would run
full speed, then when saving a file would take 5s.  Starting any prog
not in cache took equally as long.

However, no oom hit.  I waited about 15 minutes and things seemed to
bog more, so I rebooted into since-4.9.  Maybe if I had kept waiting
the box would have oom'd, but I didn't want to take the chance (it's
remote, and I can't reset it).

I did capture a lot of the weird top, meminfo and slabinfo data before
rebooting.  I'll attached the output to this email.  Messages show a
lot of "page allocation stalls" during the bogged-down time.

So my hunch at this moment is 4.10.0-rc6 might help alleviate the
problem somewhat, but it's other things you have in since-4.9 that
solve it completely.

Let me know if you need any more testing or some bisecting or
something.  I'll keep on running since-4.9 in the meantime.  Thanks!


4.10.rc6-bogged
Description: Binary data


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-30 Thread Mel Gorman
On Sun, Jan 29, 2017 at 04:50:03PM -0600, Trevor Cordes wrote:
> On 2017-01-25 Michal Hocko wrote:
> > On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> > > OK, I patched & compiled mhocko's git tree from the other day
> > > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using from a
> > > couple of weeks ago shows the newest commit (git log) is
> > > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"?  Let me know
> > > if I'm doing something wrong, see below.)  
> > 
> > My fault. I should have noted that you should use since-4.9 branch.
> 
> OK, I have good news.  I compiled your mhocko git tree (properly this
> tim!) using since-4.9 branch (last commit
> ca63ff9b11f958efafd8c8fa60fda14baec6149c Jan 25) and the box survived 3
> 3am's, over 60 hours, and I made sure all the usual oom culprits ran,
> and I ran extras (finds on the whole tree, extra rdiff-backups) to try
> to tax it.  Based on my previous criteria I would say your since-4.9 as
> of the above commit solves my bug, at least over a 3 day test span
> (which it never survives when the bug is present)!
> 

That's good news. It means the more extreme options may not be
necessary.

> I tested WITHOUT any cgroup/mem boot options.  I do still have my
> mem=6G limiter on, though (I've never tested with it off, until I solve
> the bug with it on, since I've had it on for many months for other
> reasons).
> 

It may be an option to try relaxing that and see at what point it fails.
You may find at some point that memory is not utilised as there is not
enough lowmem for metadata to track data in highmem. That's not unexpected.

> What do I test next?  Does the since-4.9 stuff get pushed into vanilla
> (4.9 hopefully?) so it can find its way into Fedora's stuck F24
> kernel?
> 

Michal has already made suggestions here and I've nothing to add.

> I want to also note that the RHBZ
> https://bugzilla.redhat.com/show_bug.cgi?id=1401012 is garnering more
> interest as more people start me-too'ing.  The situation is almost
> always the same: large rsync's or similar tree-scan accesses cause oom
> on PAE boxes.  However, I wanted to note that many people there reported
> that cgroup_disable=memory doesn't fix anything for them, whereas that
> always makes the problem go away on my boxes.  Strange.
> 

It could simply be down to whether memcgs were actually in use or not.

> Thanks Michal and Mel, I really appreciate it!

I appreciate the detailed testing and reporting!

-- 
Mel Gorman
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-30 Thread Mel Gorman
On Sun, Jan 29, 2017 at 04:50:03PM -0600, Trevor Cordes wrote:
> On 2017-01-25 Michal Hocko wrote:
> > On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> > > OK, I patched & compiled mhocko's git tree from the other day
> > > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using from a
> > > couple of weeks ago shows the newest commit (git log) is
> > > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"?  Let me know
> > > if I'm doing something wrong, see below.)  
> > 
> > My fault. I should have noted that you should use since-4.9 branch.
> 
> OK, I have good news.  I compiled your mhocko git tree (properly this
> tim!) using since-4.9 branch (last commit
> ca63ff9b11f958efafd8c8fa60fda14baec6149c Jan 25) and the box survived 3
> 3am's, over 60 hours, and I made sure all the usual oom culprits ran,
> and I ran extras (finds on the whole tree, extra rdiff-backups) to try
> to tax it.  Based on my previous criteria I would say your since-4.9 as
> of the above commit solves my bug, at least over a 3 day test span
> (which it never survives when the bug is present)!
> 

That's good news. It means the more extreme options may not be
necessary.

> I tested WITHOUT any cgroup/mem boot options.  I do still have my
> mem=6G limiter on, though (I've never tested with it off, until I solve
> the bug with it on, since I've had it on for many months for other
> reasons).
> 

It may be an option to try relaxing that and see at what point it fails.
You may find at some point that memory is not utilised as there is not
enough lowmem for metadata to track data in highmem. That's not unexpected.

> What do I test next?  Does the since-4.9 stuff get pushed into vanilla
> (4.9 hopefully?) so it can find its way into Fedora's stuck F24
> kernel?
> 

Michal has already made suggestions here and I've nothing to add.

> I want to also note that the RHBZ
> https://bugzilla.redhat.com/show_bug.cgi?id=1401012 is garnering more
> interest as more people start me-too'ing.  The situation is almost
> always the same: large rsync's or similar tree-scan accesses cause oom
> on PAE boxes.  However, I wanted to note that many people there reported
> that cgroup_disable=memory doesn't fix anything for them, whereas that
> always makes the problem go away on my boxes.  Strange.
> 

It could simply be down to whether memcgs were actually in use or not.

> Thanks Michal and Mel, I really appreciate it!

I appreciate the detailed testing and reporting!

-- 
Mel Gorman
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-29 Thread Michal Hocko
On Sun 29-01-17 16:50:03, Trevor Cordes wrote:
> On 2017-01-25 Michal Hocko wrote:
> > On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> > > OK, I patched & compiled mhocko's git tree from the other day
> > > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using from a
> > > couple of weeks ago shows the newest commit (git log) is
> > > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"?  Let me know
> > > if I'm doing something wrong, see below.)  
> > 
> > My fault. I should have noted that you should use since-4.9 branch.
> 
> OK, I have good news.  I compiled your mhocko git tree (properly this
> tim!) using since-4.9 branch (last commit
> ca63ff9b11f958efafd8c8fa60fda14baec6149c Jan 25) and the box survived 3
> 3am's, over 60 hours, and I made sure all the usual oom culprits ran,
> and I ran extras (finds on the whole tree, extra rdiff-backups) to try
> to tax it.  Based on my previous criteria I would say your since-4.9 as
> of the above commit solves my bug, at least over a 3 day test span
> (which it never survives when the bug is present)!
> 
> I tested WITHOUT any cgroup/mem boot options.  I do still have my
> mem=6G limiter on, though (I've never tested with it off, until I solve
> the bug with it on, since I've had it on for many months for other
> reasons).

Good news indeed.

> 
> On 2017-01-27 Michal Hocko wrote:
> > OK, that matches the theory that these OOMs are caused by the
> > incorrect active list aging fixed by b4536f0c829c ("mm, memcg: fix
> > the active list aging for lowmem requests when memcg is enabled")
> 
> b4536f0c829c isn't in the since-4.9 I tested above though?

Yes this is a sha1 from Linus tree. The same commit is in the since-4.9
branch under 0759e73ee689f2066a4d64dd90ec5cc3fed28f86. There are some
more fixes on top of course.

> So
> something else you did must have fixed it (also)?  I don't think I've
> run any tests yet with b4536f0c829c in them?  I think the vanillas I
> was doing a couple of weeks ago were before b4536f0c829c, but I can't
> be sure.
> 
> What do I test next?  Does the since-4.9 stuff get pushed into vanilla
> (4.9 hopefully?) so it can find its way into Fedora's stuck F24
> kernel?

Testing with Valinall rc6 released just yesterday would be a good fit.
There are some more fixes sitting on mmotm on top and maybe we want some of them
in finall 4.10. Anyway all those pending changes should be merged in the
next merge window - aka 4.11

> I want to also note that the RHBZ
> https://bugzilla.redhat.com/show_bug.cgi?id=1401012 is garnering more
> interest as more people start me-too'ing.  The situation is almost
> always the same: large rsync's or similar tree-scan accesses cause oom
> on PAE boxes.

I believe your instructions in comment 20 covers it nicely. If the
problem still persists with the current mmotm tree I would suggest
writing to the mailing list (feel free to CC me) and we will have a
look. Thanks!

> However, I wanted to note that many people there reported
> that cgroup_disable=memory doesn't fix anything for them, whereas that
> always makes the problem go away on my boxes.  Strange.
> 
> Thanks Michal and Mel, I really appreciate it!

-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-29 Thread Michal Hocko
On Sun 29-01-17 16:50:03, Trevor Cordes wrote:
> On 2017-01-25 Michal Hocko wrote:
> > On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> > > OK, I patched & compiled mhocko's git tree from the other day
> > > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using from a
> > > couple of weeks ago shows the newest commit (git log) is
> > > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"?  Let me know
> > > if I'm doing something wrong, see below.)  
> > 
> > My fault. I should have noted that you should use since-4.9 branch.
> 
> OK, I have good news.  I compiled your mhocko git tree (properly this
> tim!) using since-4.9 branch (last commit
> ca63ff9b11f958efafd8c8fa60fda14baec6149c Jan 25) and the box survived 3
> 3am's, over 60 hours, and I made sure all the usual oom culprits ran,
> and I ran extras (finds on the whole tree, extra rdiff-backups) to try
> to tax it.  Based on my previous criteria I would say your since-4.9 as
> of the above commit solves my bug, at least over a 3 day test span
> (which it never survives when the bug is present)!
> 
> I tested WITHOUT any cgroup/mem boot options.  I do still have my
> mem=6G limiter on, though (I've never tested with it off, until I solve
> the bug with it on, since I've had it on for many months for other
> reasons).

Good news indeed.

> 
> On 2017-01-27 Michal Hocko wrote:
> > OK, that matches the theory that these OOMs are caused by the
> > incorrect active list aging fixed by b4536f0c829c ("mm, memcg: fix
> > the active list aging for lowmem requests when memcg is enabled")
> 
> b4536f0c829c isn't in the since-4.9 I tested above though?

Yes this is a sha1 from Linus tree. The same commit is in the since-4.9
branch under 0759e73ee689f2066a4d64dd90ec5cc3fed28f86. There are some
more fixes on top of course.

> So
> something else you did must have fixed it (also)?  I don't think I've
> run any tests yet with b4536f0c829c in them?  I think the vanillas I
> was doing a couple of weeks ago were before b4536f0c829c, but I can't
> be sure.
> 
> What do I test next?  Does the since-4.9 stuff get pushed into vanilla
> (4.9 hopefully?) so it can find its way into Fedora's stuck F24
> kernel?

Testing with Valinall rc6 released just yesterday would be a good fit.
There are some more fixes sitting on mmotm on top and maybe we want some of them
in finall 4.10. Anyway all those pending changes should be merged in the
next merge window - aka 4.11

> I want to also note that the RHBZ
> https://bugzilla.redhat.com/show_bug.cgi?id=1401012 is garnering more
> interest as more people start me-too'ing.  The situation is almost
> always the same: large rsync's or similar tree-scan accesses cause oom
> on PAE boxes.

I believe your instructions in comment 20 covers it nicely. If the
problem still persists with the current mmotm tree I would suggest
writing to the mailing list (feel free to CC me) and we will have a
look. Thanks!

> However, I wanted to note that many people there reported
> that cgroup_disable=memory doesn't fix anything for them, whereas that
> always makes the problem go away on my boxes.  Strange.
> 
> Thanks Michal and Mel, I really appreciate it!

-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-29 Thread Trevor Cordes
On 2017-01-25 Michal Hocko wrote:
> On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> > OK, I patched & compiled mhocko's git tree from the other day
> > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using from a
> > couple of weeks ago shows the newest commit (git log) is
> > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"?  Let me know
> > if I'm doing something wrong, see below.)  
> 
> My fault. I should have noted that you should use since-4.9 branch.

OK, I have good news.  I compiled your mhocko git tree (properly this
tim!) using since-4.9 branch (last commit
ca63ff9b11f958efafd8c8fa60fda14baec6149c Jan 25) and the box survived 3
3am's, over 60 hours, and I made sure all the usual oom culprits ran,
and I ran extras (finds on the whole tree, extra rdiff-backups) to try
to tax it.  Based on my previous criteria I would say your since-4.9 as
of the above commit solves my bug, at least over a 3 day test span
(which it never survives when the bug is present)!

I tested WITHOUT any cgroup/mem boot options.  I do still have my
mem=6G limiter on, though (I've never tested with it off, until I solve
the bug with it on, since I've had it on for many months for other
reasons).

On 2017-01-27 Michal Hocko wrote:
> OK, that matches the theory that these OOMs are caused by the
> incorrect active list aging fixed by b4536f0c829c ("mm, memcg: fix
> the active list aging for lowmem requests when memcg is enabled")

b4536f0c829c isn't in the since-4.9 I tested above though?  So
something else you did must have fixed it (also)?  I don't think I've
run any tests yet with b4536f0c829c in them?  I think the vanillas I
was doing a couple of weeks ago were before b4536f0c829c, but I can't
be sure.

What do I test next?  Does the since-4.9 stuff get pushed into vanilla
(4.9 hopefully?) so it can find its way into Fedora's stuck F24
kernel?

I want to also note that the RHBZ
https://bugzilla.redhat.com/show_bug.cgi?id=1401012 is garnering more
interest as more people start me-too'ing.  The situation is almost
always the same: large rsync's or similar tree-scan accesses cause oom
on PAE boxes.  However, I wanted to note that many people there reported
that cgroup_disable=memory doesn't fix anything for them, whereas that
always makes the problem go away on my boxes.  Strange.

Thanks Michal and Mel, I really appreciate it!


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-29 Thread Trevor Cordes
On 2017-01-25 Michal Hocko wrote:
> On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> > OK, I patched & compiled mhocko's git tree from the other day
> > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using from a
> > couple of weeks ago shows the newest commit (git log) is
> > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"?  Let me know
> > if I'm doing something wrong, see below.)  
> 
> My fault. I should have noted that you should use since-4.9 branch.

OK, I have good news.  I compiled your mhocko git tree (properly this
tim!) using since-4.9 branch (last commit
ca63ff9b11f958efafd8c8fa60fda14baec6149c Jan 25) and the box survived 3
3am's, over 60 hours, and I made sure all the usual oom culprits ran,
and I ran extras (finds on the whole tree, extra rdiff-backups) to try
to tax it.  Based on my previous criteria I would say your since-4.9 as
of the above commit solves my bug, at least over a 3 day test span
(which it never survives when the bug is present)!

I tested WITHOUT any cgroup/mem boot options.  I do still have my
mem=6G limiter on, though (I've never tested with it off, until I solve
the bug with it on, since I've had it on for many months for other
reasons).

On 2017-01-27 Michal Hocko wrote:
> OK, that matches the theory that these OOMs are caused by the
> incorrect active list aging fixed by b4536f0c829c ("mm, memcg: fix
> the active list aging for lowmem requests when memcg is enabled")

b4536f0c829c isn't in the since-4.9 I tested above though?  So
something else you did must have fixed it (also)?  I don't think I've
run any tests yet with b4536f0c829c in them?  I think the vanillas I
was doing a couple of weeks ago were before b4536f0c829c, but I can't
be sure.

What do I test next?  Does the since-4.9 stuff get pushed into vanilla
(4.9 hopefully?) so it can find its way into Fedora's stuck F24
kernel?

I want to also note that the RHBZ
https://bugzilla.redhat.com/show_bug.cgi?id=1401012 is garnering more
interest as more people start me-too'ing.  The situation is almost
always the same: large rsync's or similar tree-scan accesses cause oom
on PAE boxes.  However, I wanted to note that many people there reported
that cgroup_disable=memory doesn't fix anything for them, whereas that
always makes the problem go away on my boxes.  Strange.

Thanks Michal and Mel, I really appreciate it!


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-26 Thread Michal Hocko
On Thu 26-01-17 17:18:58, Trevor Cordes wrote:
> On 2017-01-24 Michal Hocko wrote:
> > On Sun 22-01-17 18:45:59, Trevor Cordes wrote:
> > [...]
> > > Also, completely separate from your patch I ran mhocko's 4.9 tree
> > > with mem=2G to see if lower ram amount would help, but it didn't.
> > > Even with 2G the system oom and hung same as usual.  So far the
> > > only thing that helps at all was the cgroup_disable=memory option,
> > > which makes the problem disappear completely for me.  
> > 
> > OK, can we reduce the problem space slightly more and could you boot
> > with kmem accounting enabled? cgroup.memory=nokmem,nosocket
> 
> I ran for 30 hours with cgroup.memory=nokmem,nosocket using vanilla
> 4.9.0+ and it oom'd during a big rdiff-backup at 9am.  My script was
> able to reboot it before it hung.  Only one oom occurred before the
> reboot, which is a bit odd, usually there is 5-50.  See attached
> messages log (oom6).
> 
> So, still, only cgroup_disable=memory mitigates this bug (so far).  If
> you need me to test cgroup.memory=nokmem,nosocket with your since-4.9
> branch specifically, let me know and I'll add it to the to-test list.

OK, that matches the theory that these OOMs are caused by the incorrect
active list aging fixed by b4536f0c829c ("mm, memcg: fix the active list
aging for lowmem requests when memcg is enabled")
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-26 Thread Michal Hocko
On Thu 26-01-17 17:18:58, Trevor Cordes wrote:
> On 2017-01-24 Michal Hocko wrote:
> > On Sun 22-01-17 18:45:59, Trevor Cordes wrote:
> > [...]
> > > Also, completely separate from your patch I ran mhocko's 4.9 tree
> > > with mem=2G to see if lower ram amount would help, but it didn't.
> > > Even with 2G the system oom and hung same as usual.  So far the
> > > only thing that helps at all was the cgroup_disable=memory option,
> > > which makes the problem disappear completely for me.  
> > 
> > OK, can we reduce the problem space slightly more and could you boot
> > with kmem accounting enabled? cgroup.memory=nokmem,nosocket
> 
> I ran for 30 hours with cgroup.memory=nokmem,nosocket using vanilla
> 4.9.0+ and it oom'd during a big rdiff-backup at 9am.  My script was
> able to reboot it before it hung.  Only one oom occurred before the
> reboot, which is a bit odd, usually there is 5-50.  See attached
> messages log (oom6).
> 
> So, still, only cgroup_disable=memory mitigates this bug (so far).  If
> you need me to test cgroup.memory=nokmem,nosocket with your since-4.9
> branch specifically, let me know and I'll add it to the to-test list.

OK, that matches the theory that these OOMs are caused by the incorrect
active list aging fixed by b4536f0c829c ("mm, memcg: fix the active list
aging for lowmem requests when memcg is enabled")
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-26 Thread Trevor Cordes
On 2017-01-24 Michal Hocko wrote:
> On Sun 22-01-17 18:45:59, Trevor Cordes wrote:
> [...]
> > Also, completely separate from your patch I ran mhocko's 4.9 tree
> > with mem=2G to see if lower ram amount would help, but it didn't.
> > Even with 2G the system oom and hung same as usual.  So far the
> > only thing that helps at all was the cgroup_disable=memory option,
> > which makes the problem disappear completely for me.  
> 
> OK, can we reduce the problem space slightly more and could you boot
> with kmem accounting enabled? cgroup.memory=nokmem,nosocket

I ran for 30 hours with cgroup.memory=nokmem,nosocket using vanilla
4.9.0+ and it oom'd during a big rdiff-backup at 9am.  My script was
able to reboot it before it hung.  Only one oom occurred before the
reboot, which is a bit odd, usually there is 5-50.  See attached
messages log (oom6).

So, still, only cgroup_disable=memory mitigates this bug (so far).  If
you need me to test cgroup.memory=nokmem,nosocket with your since-4.9
branch specifically, let me know and I'll add it to the to-test list.

On 2017-01-25 Michal Hocko wrote:
> On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> > OK, I patched & compiled mhocko's git tree from the other day
> > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using from a
> > couple of weeks ago shows the newest commit (git log) is
> > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"?  Let me know
> > if I'm doing something wrong, see below.)  
> 
> My fault. I should have noted that you should use since-4.9 branch.

OK, I got it now, I'm retesting the runs I did (with/without the
various patches) on your git tree and will re-report the (correct)
results.  Will take a few days.  Thanks!


oom6
Description: Binary data


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-26 Thread Trevor Cordes
On 2017-01-24 Michal Hocko wrote:
> On Sun 22-01-17 18:45:59, Trevor Cordes wrote:
> [...]
> > Also, completely separate from your patch I ran mhocko's 4.9 tree
> > with mem=2G to see if lower ram amount would help, but it didn't.
> > Even with 2G the system oom and hung same as usual.  So far the
> > only thing that helps at all was the cgroup_disable=memory option,
> > which makes the problem disappear completely for me.  
> 
> OK, can we reduce the problem space slightly more and could you boot
> with kmem accounting enabled? cgroup.memory=nokmem,nosocket

I ran for 30 hours with cgroup.memory=nokmem,nosocket using vanilla
4.9.0+ and it oom'd during a big rdiff-backup at 9am.  My script was
able to reboot it before it hung.  Only one oom occurred before the
reboot, which is a bit odd, usually there is 5-50.  See attached
messages log (oom6).

So, still, only cgroup_disable=memory mitigates this bug (so far).  If
you need me to test cgroup.memory=nokmem,nosocket with your since-4.9
branch specifically, let me know and I'll add it to the to-test list.

On 2017-01-25 Michal Hocko wrote:
> On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> > OK, I patched & compiled mhocko's git tree from the other day
> > 4.9.0+. (To confirm, weird, but mhocko's git tree I'm using from a
> > couple of weeks ago shows the newest commit (git log) is
> > 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"?  Let me know
> > if I'm doing something wrong, see below.)  
> 
> My fault. I should have noted that you should use since-4.9 branch.

OK, I got it now, I'm retesting the runs I did (with/without the
various patches) on your git tree and will re-report the (correct)
results.  Will take a few days.  Thanks!


oom6
Description: Binary data


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-25 Thread Michal Hocko
On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> On 2017-01-23 Mel Gorman wrote:
> > On Sun, Jan 22, 2017 at 06:45:59PM -0600, Trevor Cordes wrote:
> > > On 2017-01-20 Mel Gorman wrote:  
> > > > > 
> > > > > Thanks for the OOM report. I was expecting it to be a particular
> > > > > shape and my expectations were not matched so it took time to
> > > > > consider it further. Can you try the cumulative patch below? It
> > > > > combines three patches that
> > > > > 
> > > > > 1. Allow slab shrinking even if the LRU patches are
> > > > > unreclaimable in direct reclaim
> > > > > 2. Shrinks slab based once based on the contents of all memcgs
> > > > > instead of shrinking one at a time
> > > > > 3. Tries to shrink slabs if the lowmem usage is too high
> > > > > 
> > > > > Unfortunately it's only boot tested on x86-64 as I didn't get
> > > > > the chance to setup an i386 test bed.
> > > > > 
> > > > 
> > > > There was one major flaw in that patch. This version fixes it and
> > > > addresses other minor issues. It may still be too agressive
> > > > shrinking slab but worth trying out. Thanks.  
> > > 
> > > I ran with your patch below and it oom'd on the first night.  It was
> > > weird, it didn't hang the system, and my rebooter script started a
> > > reboot but the system never got more than half down before it just
> > > sat there in a weird state where a local console user could still
> > > login but not much was working.  So the patches don't seem to solve
> > > the problem.
> > > 
> > > For the above compile I applied your patches to 4.10.0-rc4+, I hope
> > > that's ok.
> > >   
> > 
> > It would be strongly preferred to run them on top of Michal's other
> > fixes. The main reason it's preferred is because this OOM differs from
> > earlier ones in that it OOM killed from GFP_NOFS|__GFP_NOFAIL context.
> > That meant that the slab shrinking could not happen from direct
> > reclaim so the balancing from my patches would not occur.  As
> > Michal's other patches affect how kswapd behaves, it's important.
> 
> OK, I patched & compiled mhocko's git tree from the other day 4.9.0+.
> (To confirm, weird, but mhocko's git tree I'm using from a couple of
> weeks ago shows the newest commit (git log) is
> 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"?  Let me know if
> I'm doing something wrong, see below.)

My fault. I should have noted that you should use since-4.9 branch.
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-25 Thread Michal Hocko
On Wed 25-01-17 04:02:46, Trevor Cordes wrote:
> On 2017-01-23 Mel Gorman wrote:
> > On Sun, Jan 22, 2017 at 06:45:59PM -0600, Trevor Cordes wrote:
> > > On 2017-01-20 Mel Gorman wrote:  
> > > > > 
> > > > > Thanks for the OOM report. I was expecting it to be a particular
> > > > > shape and my expectations were not matched so it took time to
> > > > > consider it further. Can you try the cumulative patch below? It
> > > > > combines three patches that
> > > > > 
> > > > > 1. Allow slab shrinking even if the LRU patches are
> > > > > unreclaimable in direct reclaim
> > > > > 2. Shrinks slab based once based on the contents of all memcgs
> > > > > instead of shrinking one at a time
> > > > > 3. Tries to shrink slabs if the lowmem usage is too high
> > > > > 
> > > > > Unfortunately it's only boot tested on x86-64 as I didn't get
> > > > > the chance to setup an i386 test bed.
> > > > > 
> > > > 
> > > > There was one major flaw in that patch. This version fixes it and
> > > > addresses other minor issues. It may still be too agressive
> > > > shrinking slab but worth trying out. Thanks.  
> > > 
> > > I ran with your patch below and it oom'd on the first night.  It was
> > > weird, it didn't hang the system, and my rebooter script started a
> > > reboot but the system never got more than half down before it just
> > > sat there in a weird state where a local console user could still
> > > login but not much was working.  So the patches don't seem to solve
> > > the problem.
> > > 
> > > For the above compile I applied your patches to 4.10.0-rc4+, I hope
> > > that's ok.
> > >   
> > 
> > It would be strongly preferred to run them on top of Michal's other
> > fixes. The main reason it's preferred is because this OOM differs from
> > earlier ones in that it OOM killed from GFP_NOFS|__GFP_NOFAIL context.
> > That meant that the slab shrinking could not happen from direct
> > reclaim so the balancing from my patches would not occur.  As
> > Michal's other patches affect how kswapd behaves, it's important.
> 
> OK, I patched & compiled mhocko's git tree from the other day 4.9.0+.
> (To confirm, weird, but mhocko's git tree I'm using from a couple of
> weeks ago shows the newest commit (git log) is
> 69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"?  Let me know if
> I'm doing something wrong, see below.)

My fault. I should have noted that you should use since-4.9 branch.
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-25 Thread Trevor Cordes
On 2017-01-23 Mel Gorman wrote:
> On Sun, Jan 22, 2017 at 06:45:59PM -0600, Trevor Cordes wrote:
> > On 2017-01-20 Mel Gorman wrote:  
> > > > 
> > > > Thanks for the OOM report. I was expecting it to be a particular
> > > > shape and my expectations were not matched so it took time to
> > > > consider it further. Can you try the cumulative patch below? It
> > > > combines three patches that
> > > > 
> > > > 1. Allow slab shrinking even if the LRU patches are
> > > > unreclaimable in direct reclaim
> > > > 2. Shrinks slab based once based on the contents of all memcgs
> > > > instead of shrinking one at a time
> > > > 3. Tries to shrink slabs if the lowmem usage is too high
> > > > 
> > > > Unfortunately it's only boot tested on x86-64 as I didn't get
> > > > the chance to setup an i386 test bed.
> > > > 
> > > 
> > > There was one major flaw in that patch. This version fixes it and
> > > addresses other minor issues. It may still be too agressive
> > > shrinking slab but worth trying out. Thanks.  
> > 
> > I ran with your patch below and it oom'd on the first night.  It was
> > weird, it didn't hang the system, and my rebooter script started a
> > reboot but the system never got more than half down before it just
> > sat there in a weird state where a local console user could still
> > login but not much was working.  So the patches don't seem to solve
> > the problem.
> > 
> > For the above compile I applied your patches to 4.10.0-rc4+, I hope
> > that's ok.
> >   
> 
> It would be strongly preferred to run them on top of Michal's other
> fixes. The main reason it's preferred is because this OOM differs from
> earlier ones in that it OOM killed from GFP_NOFS|__GFP_NOFAIL context.
> That meant that the slab shrinking could not happen from direct
> reclaim so the balancing from my patches would not occur.  As
> Michal's other patches affect how kswapd behaves, it's important.

OK, I patched & compiled mhocko's git tree from the other day 4.9.0+.
(To confirm, weird, but mhocko's git tree I'm using from a couple of
weeks ago shows the newest commit (git log) is
69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"?  Let me know if
I'm doing something wrong, see below.)

Anyhow, it oom'd as usual at ~3am, system froze after 20 ooms hit in 7
secs.  So no help there.  Attached is the oom log from the first oom
hit.

On 2017-01-24 Michal Hocko wrote:
> On Sun 22-01-17 18:45:59, Trevor Cordes wrote:
> [...]
> > Also, completely separate from your patch I ran mhocko's 4.9 tree
> > with mem=2G to see if lower ram amount would help, but it didn't.
> > Even with 2G the system oom and hung same as usual.  So far the
> > only thing that helps at all was the cgroup_disable=memory option,
> > which makes the problem disappear completely for me.  
> 
> OK, can we reduce the problem space slightly more and could you boot
> with kmem accounting enabled? cgroup.memory=nokmem,nosocket

I will try that right now, I'll use the mhocko git tree without Mel's
emailed patch, and I'll refresh the git tree from origin first (let me
know that's a bad move).  As usual, I'll report back within 24-48 hours.

Actually, on my tests with mhocko git tree, I'm a bit confused and want
to make sure I'm compiling the right thing.  His tree doesn't seem to
have recent commits?  I did "git fetch origin" and "git reset --hard
origin/master" to refresh the tree just now and the latest commit is
still the one shown above "Linux 4.9"?  Is Michal making changes but
not comitting?  How do I ensure I'm compiling the version you guys want
me to test?  ("git log mm/vmscan.c" shows newest commit is Dec 2??)  Am
I supposed to be testing a specific branch?

If I've been testing the wrong branch, this *only* affects my mhocko
tree tests (not the vanilla or fedora-patched tests).  Thankfully I
think I've only done 1 or 2 mhocko tree tests, and I can easily redo
them.  If this turns out to be the case, I'm so sorry for the
confusion, the non-vanilla git tree thing is all new to me.

In any event, I'm still trying the above, and will adjust if necessary
if it's confirmed I'm doing something wrong with the mhocko git tree.
Thanks!


oom5
Description: Binary data


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-25 Thread Trevor Cordes
On 2017-01-23 Mel Gorman wrote:
> On Sun, Jan 22, 2017 at 06:45:59PM -0600, Trevor Cordes wrote:
> > On 2017-01-20 Mel Gorman wrote:  
> > > > 
> > > > Thanks for the OOM report. I was expecting it to be a particular
> > > > shape and my expectations were not matched so it took time to
> > > > consider it further. Can you try the cumulative patch below? It
> > > > combines three patches that
> > > > 
> > > > 1. Allow slab shrinking even if the LRU patches are
> > > > unreclaimable in direct reclaim
> > > > 2. Shrinks slab based once based on the contents of all memcgs
> > > > instead of shrinking one at a time
> > > > 3. Tries to shrink slabs if the lowmem usage is too high
> > > > 
> > > > Unfortunately it's only boot tested on x86-64 as I didn't get
> > > > the chance to setup an i386 test bed.
> > > > 
> > > 
> > > There was one major flaw in that patch. This version fixes it and
> > > addresses other minor issues. It may still be too agressive
> > > shrinking slab but worth trying out. Thanks.  
> > 
> > I ran with your patch below and it oom'd on the first night.  It was
> > weird, it didn't hang the system, and my rebooter script started a
> > reboot but the system never got more than half down before it just
> > sat there in a weird state where a local console user could still
> > login but not much was working.  So the patches don't seem to solve
> > the problem.
> > 
> > For the above compile I applied your patches to 4.10.0-rc4+, I hope
> > that's ok.
> >   
> 
> It would be strongly preferred to run them on top of Michal's other
> fixes. The main reason it's preferred is because this OOM differs from
> earlier ones in that it OOM killed from GFP_NOFS|__GFP_NOFAIL context.
> That meant that the slab shrinking could not happen from direct
> reclaim so the balancing from my patches would not occur.  As
> Michal's other patches affect how kswapd behaves, it's important.

OK, I patched & compiled mhocko's git tree from the other day 4.9.0+.
(To confirm, weird, but mhocko's git tree I'm using from a couple of
weeks ago shows the newest commit (git log) is
69973b830859bc6529a7a0468ba0d80ee5117826 "Linux 4.9"?  Let me know if
I'm doing something wrong, see below.)

Anyhow, it oom'd as usual at ~3am, system froze after 20 ooms hit in 7
secs.  So no help there.  Attached is the oom log from the first oom
hit.

On 2017-01-24 Michal Hocko wrote:
> On Sun 22-01-17 18:45:59, Trevor Cordes wrote:
> [...]
> > Also, completely separate from your patch I ran mhocko's 4.9 tree
> > with mem=2G to see if lower ram amount would help, but it didn't.
> > Even with 2G the system oom and hung same as usual.  So far the
> > only thing that helps at all was the cgroup_disable=memory option,
> > which makes the problem disappear completely for me.  
> 
> OK, can we reduce the problem space slightly more and could you boot
> with kmem accounting enabled? cgroup.memory=nokmem,nosocket

I will try that right now, I'll use the mhocko git tree without Mel's
emailed patch, and I'll refresh the git tree from origin first (let me
know that's a bad move).  As usual, I'll report back within 24-48 hours.

Actually, on my tests with mhocko git tree, I'm a bit confused and want
to make sure I'm compiling the right thing.  His tree doesn't seem to
have recent commits?  I did "git fetch origin" and "git reset --hard
origin/master" to refresh the tree just now and the latest commit is
still the one shown above "Linux 4.9"?  Is Michal making changes but
not comitting?  How do I ensure I'm compiling the version you guys want
me to test?  ("git log mm/vmscan.c" shows newest commit is Dec 2??)  Am
I supposed to be testing a specific branch?

If I've been testing the wrong branch, this *only* affects my mhocko
tree tests (not the vanilla or fedora-patched tests).  Thankfully I
think I've only done 1 or 2 mhocko tree tests, and I can easily redo
them.  If this turns out to be the case, I'm so sorry for the
confusion, the non-vanilla git tree thing is all new to me.

In any event, I'm still trying the above, and will adjust if necessary
if it's confirmed I'm doing something wrong with the mhocko git tree.
Thanks!


oom5
Description: Binary data


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-25 Thread Michal Hocko
On Mon 23-01-17 11:04:12, Mel Gorman wrote:
[...]
> 1. In should_reclaim_retry, account for SLAB_RECLAIMABLE as available
>pages when deciding to retry reclaim

I am pretty sure I have considered this but then decided to not go that
way. I do not remember details so I will think about this some more. It
might have been just "let's wait for the real issue here". Anyway we can
give it a try and it would be as simple as
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 94ebd30d0f09..87221491be84 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3566,7 +3566,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
unsigned long min_wmark = min_wmark_pages(zone);
bool wmark;
 
-   available = reclaimable = zone_reclaimable_pages(zone);
+   available = reclaimable = zone_reclaimable_pages(zone) + 
zone_page_state_snapshot(zone, NR_SLAB_RECLAIMABLE);
available -= DIV_ROUND_UP((*no_progress_loops) * available,
  MAX_RECLAIM_RETRIES);
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);

I am not sure it would really help much on its own without further
changes to how we scale LRU->slab scanning. Could you give this a try
on top of the mmotm or linux-next tree?

> 2. Stall in should_reclaim_retry for __GFP_NOFAIL|__GFP_NOFS with a
>comment stating that the intent is to allow kswapd make progress
>with the shrinker

The current mmotm tree doesnt need this because we no longer trigger the
oom killer for this combinations of flags.

> 3. Stall __GFP_NOFS in direct reclaimer on a workqueue when it's
>failing to make progress to allow kswapd to do some work. This
>may be impaired if kswapd is locked up waiting for a lock held
>by the direct reclaimer
> 4. Schedule the system workqueue to drain slab for
>__GFP_NOFS|__GFP_NOFAIL.
> 
> 3 and 4 are extremely heavy handed so we should try them one at a time.

I am not even sure they are really necessary.
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-25 Thread Michal Hocko
On Mon 23-01-17 11:04:12, Mel Gorman wrote:
[...]
> 1. In should_reclaim_retry, account for SLAB_RECLAIMABLE as available
>pages when deciding to retry reclaim

I am pretty sure I have considered this but then decided to not go that
way. I do not remember details so I will think about this some more. It
might have been just "let's wait for the real issue here". Anyway we can
give it a try and it would be as simple as
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 94ebd30d0f09..87221491be84 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3566,7 +3566,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
unsigned long min_wmark = min_wmark_pages(zone);
bool wmark;
 
-   available = reclaimable = zone_reclaimable_pages(zone);
+   available = reclaimable = zone_reclaimable_pages(zone) + 
zone_page_state_snapshot(zone, NR_SLAB_RECLAIMABLE);
available -= DIV_ROUND_UP((*no_progress_loops) * available,
  MAX_RECLAIM_RETRIES);
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);

I am not sure it would really help much on its own without further
changes to how we scale LRU->slab scanning. Could you give this a try
on top of the mmotm or linux-next tree?

> 2. Stall in should_reclaim_retry for __GFP_NOFAIL|__GFP_NOFS with a
>comment stating that the intent is to allow kswapd make progress
>with the shrinker

The current mmotm tree doesnt need this because we no longer trigger the
oom killer for this combinations of flags.

> 3. Stall __GFP_NOFS in direct reclaimer on a workqueue when it's
>failing to make progress to allow kswapd to do some work. This
>may be impaired if kswapd is locked up waiting for a lock held
>by the direct reclaimer
> 4. Schedule the system workqueue to drain slab for
>__GFP_NOFS|__GFP_NOFAIL.
> 
> 3 and 4 are extremely heavy handed so we should try them one at a time.

I am not even sure they are really necessary.
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-24 Thread Michal Hocko
On Mon 23-01-17 10:48:58, Mel Gorman wrote:
[...]
> Unfortunately, even that will be race prone for GFP_NOFS callers as
> they'll effectively be racing to see if kswapd or another direct
> reclaimer can reclaim before the OOM conditions are hit. It is by
> design, but it's apparent that a __GFP_NOFAIL request can trigger OOM
> relatively easily as it's not necessarily throttled or waiting on kswapd
> to complete any work. I'll keep thinking about it.

Yes, we shouldn't trigger the OOM for GFP_NOFS as the memory reclaim is
really weaker. And that might really matter here. So the mmomt tree will
behave differently in this regards as we have [1]

[1] http://lkml.kernel.org/r/20161220134904.21023-3-mho...@kernel.org
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-24 Thread Michal Hocko
On Mon 23-01-17 10:48:58, Mel Gorman wrote:
[...]
> Unfortunately, even that will be race prone for GFP_NOFS callers as
> they'll effectively be racing to see if kswapd or another direct
> reclaimer can reclaim before the OOM conditions are hit. It is by
> design, but it's apparent that a __GFP_NOFAIL request can trigger OOM
> relatively easily as it's not necessarily throttled or waiting on kswapd
> to complete any work. I'll keep thinking about it.

Yes, we shouldn't trigger the OOM for GFP_NOFS as the memory reclaim is
really weaker. And that might really matter here. So the mmomt tree will
behave differently in this regards as we have [1]

[1] http://lkml.kernel.org/r/20161220134904.21023-3-mho...@kernel.org
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-24 Thread Michal Hocko
On Sun 22-01-17 18:45:59, Trevor Cordes wrote:
[...]
> Also, completely separate from your patch I ran mhocko's 4.9 tree with
> mem=2G to see if lower ram amount would help, but it didn't.  Even with
> 2G the system oom and hung same as usual.  So far the only thing that
> helps at all was the cgroup_disable=memory option, which makes the
> problem disappear completely for me.

OK, can we reduce the problem space slightly more and could you boot
with kmem accounting enabled? cgroup.memory=nokmem,nosocket
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-24 Thread Michal Hocko
On Sun 22-01-17 18:45:59, Trevor Cordes wrote:
[...]
> Also, completely separate from your patch I ran mhocko's 4.9 tree with
> mem=2G to see if lower ram amount would help, but it didn't.  Even with
> 2G the system oom and hung same as usual.  So far the only thing that
> helps at all was the cgroup_disable=memory option, which makes the
> problem disappear completely for me.

OK, can we reduce the problem space slightly more and could you boot
with kmem accounting enabled? cgroup.memory=nokmem,nosocket
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-24 Thread Michal Hocko
On Fri 20-01-17 00:35:44, Trevor Cordes wrote:
> On 2017-01-19 Michal Hocko wrote:
> > On Thu 19-01-17 03:48:50, Trevor Cordes wrote:
> > > On 2017-01-17 Michal Hocko wrote:  
> > > > On Tue 17-01-17 14:21:14, Mel Gorman wrote:  
> > > > > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko
> > > > > wrote:
> > > > > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > > > > [...]
> > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > > index 532a2a750952..46aac487b89a 100644
> > > > > > > --- a/mm/vmscan.c
> > > > > > > +++ b/mm/vmscan.c
> > > > > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct
> > > > > > > zonelist *zonelist, struct scan_control *sc) continue;
> > > > > > >  
> > > > > > >   if (sc->priority != DEF_PRIORITY &&
> > > > > > > + !buffer_heads_over_limit &&
> > > > > > >   !pgdat_reclaimable(zone->zone_pgdat))
> > > > > > >   continue;   /* Let
> > > > > > > kswapd poll it */
> > > > > > 
> > > > > > I think we should rather remove pgdat_reclaimable here. This
> > > > > > sounds like a wrong layer to decide whether we want to reclaim
> > > > > > and how much.   
> > > > > 
> > > > > I had considered that but it'd also be important to add the
> > > > > other 32-bit patches you have posted to see the impact. Because
> > > > > of the ratio of LRU pages to slab pages, it may not have an
> > > > > impact but it'd need to be eliminated.
> > > > 
> > > > OK, Trevor you can pull from
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> > > > fixes/highmem-node-fixes branch. This contains the current mmotm
> > > > tree
> > > > + the latest highmem fixes. I also do not expect this would help
> > > > much in your case but as Mel've said we should rule that out at
> > > > least.  
> > > 
> > > Hi!  The git tree above version oom'd after < 24 hours (3:02am) so
> > > it doesn't solve the bug.  If you need a oom messages dump let me
> > > know.  
> > 
> > Yes please.
> 
> The first oom from that night attached.  Note, the oom wasn't as dire
> with your mhocko/4.9.0+ as it usually is with stock 4.8.x: my oom
> detector and reboot script was able to do its thing cleanly before the
> system became unusable.

Just for reference. This oom was due to bug with the active LRU aging
fixed in the Linus tree (b4536f0c829c ("mm, memcg: fix the active list
aging for lowmem requests when memcg is enabled") 4.10-rc4)

Jan 19 03:02:19 firewallfsi kernel: [85602.858232] Normal free:3436kB 
min:3532kB low:4412kB high:5292kB active_anon:4kB inactive_anon:8kB 
active_file:193340kB inactive_file:120kB unevictable
:0kB writepending:2516kB present:892920kB managed:816932kB mlocked:0kB 
slab_reclaimable:522292kB slab_unreclaimable:46724kB kernel_stack:2560kB 
pagetables:0kB bounce:0kB free_pcp:3468kB loca
l_pcp:176kB free_cma:0kB

Look at how all the reclaimable memory is on the inactive_file...
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-24 Thread Michal Hocko
On Fri 20-01-17 00:35:44, Trevor Cordes wrote:
> On 2017-01-19 Michal Hocko wrote:
> > On Thu 19-01-17 03:48:50, Trevor Cordes wrote:
> > > On 2017-01-17 Michal Hocko wrote:  
> > > > On Tue 17-01-17 14:21:14, Mel Gorman wrote:  
> > > > > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko
> > > > > wrote:
> > > > > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > > > > [...]
> > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > > index 532a2a750952..46aac487b89a 100644
> > > > > > > --- a/mm/vmscan.c
> > > > > > > +++ b/mm/vmscan.c
> > > > > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct
> > > > > > > zonelist *zonelist, struct scan_control *sc) continue;
> > > > > > >  
> > > > > > >   if (sc->priority != DEF_PRIORITY &&
> > > > > > > + !buffer_heads_over_limit &&
> > > > > > >   !pgdat_reclaimable(zone->zone_pgdat))
> > > > > > >   continue;   /* Let
> > > > > > > kswapd poll it */
> > > > > > 
> > > > > > I think we should rather remove pgdat_reclaimable here. This
> > > > > > sounds like a wrong layer to decide whether we want to reclaim
> > > > > > and how much.   
> > > > > 
> > > > > I had considered that but it'd also be important to add the
> > > > > other 32-bit patches you have posted to see the impact. Because
> > > > > of the ratio of LRU pages to slab pages, it may not have an
> > > > > impact but it'd need to be eliminated.
> > > > 
> > > > OK, Trevor you can pull from
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> > > > fixes/highmem-node-fixes branch. This contains the current mmotm
> > > > tree
> > > > + the latest highmem fixes. I also do not expect this would help
> > > > much in your case but as Mel've said we should rule that out at
> > > > least.  
> > > 
> > > Hi!  The git tree above version oom'd after < 24 hours (3:02am) so
> > > it doesn't solve the bug.  If you need a oom messages dump let me
> > > know.  
> > 
> > Yes please.
> 
> The first oom from that night attached.  Note, the oom wasn't as dire
> with your mhocko/4.9.0+ as it usually is with stock 4.8.x: my oom
> detector and reboot script was able to do its thing cleanly before the
> system became unusable.

Just for reference. This oom was due to bug with the active LRU aging
fixed in the Linus tree (b4536f0c829c ("mm, memcg: fix the active list
aging for lowmem requests when memcg is enabled") 4.10-rc4)

Jan 19 03:02:19 firewallfsi kernel: [85602.858232] Normal free:3436kB 
min:3532kB low:4412kB high:5292kB active_anon:4kB inactive_anon:8kB 
active_file:193340kB inactive_file:120kB unevictable
:0kB writepending:2516kB present:892920kB managed:816932kB mlocked:0kB 
slab_reclaimable:522292kB slab_unreclaimable:46724kB kernel_stack:2560kB 
pagetables:0kB bounce:0kB free_pcp:3468kB loca
l_pcp:176kB free_cma:0kB

Look at how all the reclaimable memory is on the inactive_file...
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-23 Thread Mel Gorman
On Mon, Jan 23, 2017 at 10:48:58AM +, Mel Gorman wrote:
> On Sun, Jan 22, 2017 at 06:45:59PM -0600, Trevor Cordes wrote:
> > On 2017-01-20 Mel Gorman wrote:
> > > > 
> > > > Thanks for the OOM report. I was expecting it to be a particular
> > > > shape and my expectations were not matched so it took time to
> > > > consider it further. Can you try the cumulative patch below? It
> > > > combines three patches that
> > > > 
> > > > 1. Allow slab shrinking even if the LRU patches are unreclaimable in
> > > >direct reclaim
> > > > 2. Shrinks slab based once based on the contents of all memcgs
> > > > instead of shrinking one at a time
> > > > 3. Tries to shrink slabs if the lowmem usage is too high
> > > > 
> > > > Unfortunately it's only boot tested on x86-64 as I didn't get the
> > > > chance to setup an i386 test bed.
> > > >   
> > > 
> > > There was one major flaw in that patch. This version fixes it and
> > > addresses other minor issues. It may still be too agressive shrinking
> > > slab but worth trying out. Thanks.
> > 
> > I ran with your patch below and it oom'd on the first night.  It was
> > weird, it didn't hang the system, and my rebooter script started a
> > reboot but the system never got more than half down before it just sat
> > there in a weird state where a local console user could still login but
> > not much was working.  So the patches don't seem to solve the problem.
> > 
> > For the above compile I applied your patches to 4.10.0-rc4+, I hope
> > that's ok.
> > 
> 
> It would be strongly preferred to run them on top of Michal's other
> fixes. The main reason it's preferred is because this OOM differs from
> earlier ones in that it OOM killed from GFP_NOFS|__GFP_NOFAIL context.
> That meant that the slab shrinking could not happen from direct reclaim so
> the balancing from my patches would not occur.  As Michal's other patches
> affect how kswapd behaves, it's important.
> 
> Unfortunately, even that will be race prone for GFP_NOFS callers as
> they'll effectively be racing to see if kswapd or another direct
> reclaimer can reclaim before the OOM conditions are hit. It is by
> design, but it's apparent that a __GFP_NOFAIL request can trigger OOM
> relatively easily as it's not necessarily throttled or waiting on kswapd
> to complete any work. I'll keep thinking about it.
> 

As a slight follow-up albeit without patches, further options are to;

1. In should_reclaim_retry, account for SLAB_RECLAIMABLE as available
   pages when deciding to retry reclaim
2. Stall in should_reclaim_retry for __GFP_NOFAIL|__GFP_NOFS with a
   comment stating that the intent is to allow kswapd make progress
   with the shrinker
3. Stall __GFP_NOFS in direct reclaimer on a workqueue when it's
   failing to make progress to allow kswapd to do some work. This
   may be impaired if kswapd is locked up waiting for a lock held
   by the direct reclaimer
4. Schedule the system workqueue to drain slab for
   __GFP_NOFS|__GFP_NOFAIL.

3 and 4 are extremely heavy handed so we should try them one at a time.

-- 
Mel Gorman
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-23 Thread Mel Gorman
On Mon, Jan 23, 2017 at 10:48:58AM +, Mel Gorman wrote:
> On Sun, Jan 22, 2017 at 06:45:59PM -0600, Trevor Cordes wrote:
> > On 2017-01-20 Mel Gorman wrote:
> > > > 
> > > > Thanks for the OOM report. I was expecting it to be a particular
> > > > shape and my expectations were not matched so it took time to
> > > > consider it further. Can you try the cumulative patch below? It
> > > > combines three patches that
> > > > 
> > > > 1. Allow slab shrinking even if the LRU patches are unreclaimable in
> > > >direct reclaim
> > > > 2. Shrinks slab based once based on the contents of all memcgs
> > > > instead of shrinking one at a time
> > > > 3. Tries to shrink slabs if the lowmem usage is too high
> > > > 
> > > > Unfortunately it's only boot tested on x86-64 as I didn't get the
> > > > chance to setup an i386 test bed.
> > > >   
> > > 
> > > There was one major flaw in that patch. This version fixes it and
> > > addresses other minor issues. It may still be too agressive shrinking
> > > slab but worth trying out. Thanks.
> > 
> > I ran with your patch below and it oom'd on the first night.  It was
> > weird, it didn't hang the system, and my rebooter script started a
> > reboot but the system never got more than half down before it just sat
> > there in a weird state where a local console user could still login but
> > not much was working.  So the patches don't seem to solve the problem.
> > 
> > For the above compile I applied your patches to 4.10.0-rc4+, I hope
> > that's ok.
> > 
> 
> It would be strongly preferred to run them on top of Michal's other
> fixes. The main reason it's preferred is because this OOM differs from
> earlier ones in that it OOM killed from GFP_NOFS|__GFP_NOFAIL context.
> That meant that the slab shrinking could not happen from direct reclaim so
> the balancing from my patches would not occur.  As Michal's other patches
> affect how kswapd behaves, it's important.
> 
> Unfortunately, even that will be race prone for GFP_NOFS callers as
> they'll effectively be racing to see if kswapd or another direct
> reclaimer can reclaim before the OOM conditions are hit. It is by
> design, but it's apparent that a __GFP_NOFAIL request can trigger OOM
> relatively easily as it's not necessarily throttled or waiting on kswapd
> to complete any work. I'll keep thinking about it.
> 

As a slight follow-up albeit without patches, further options are to;

1. In should_reclaim_retry, account for SLAB_RECLAIMABLE as available
   pages when deciding to retry reclaim
2. Stall in should_reclaim_retry for __GFP_NOFAIL|__GFP_NOFS with a
   comment stating that the intent is to allow kswapd make progress
   with the shrinker
3. Stall __GFP_NOFS in direct reclaimer on a workqueue when it's
   failing to make progress to allow kswapd to do some work. This
   may be impaired if kswapd is locked up waiting for a lock held
   by the direct reclaimer
4. Schedule the system workqueue to drain slab for
   __GFP_NOFS|__GFP_NOFAIL.

3 and 4 are extremely heavy handed so we should try them one at a time.

-- 
Mel Gorman
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-23 Thread Mel Gorman
On Sun, Jan 22, 2017 at 06:45:59PM -0600, Trevor Cordes wrote:
> On 2017-01-20 Mel Gorman wrote:
> > > 
> > > Thanks for the OOM report. I was expecting it to be a particular
> > > shape and my expectations were not matched so it took time to
> > > consider it further. Can you try the cumulative patch below? It
> > > combines three patches that
> > > 
> > > 1. Allow slab shrinking even if the LRU patches are unreclaimable in
> > >direct reclaim
> > > 2. Shrinks slab based once based on the contents of all memcgs
> > > instead of shrinking one at a time
> > > 3. Tries to shrink slabs if the lowmem usage is too high
> > > 
> > > Unfortunately it's only boot tested on x86-64 as I didn't get the
> > > chance to setup an i386 test bed.
> > >   
> > 
> > There was one major flaw in that patch. This version fixes it and
> > addresses other minor issues. It may still be too agressive shrinking
> > slab but worth trying out. Thanks.
> 
> I ran with your patch below and it oom'd on the first night.  It was
> weird, it didn't hang the system, and my rebooter script started a
> reboot but the system never got more than half down before it just sat
> there in a weird state where a local console user could still login but
> not much was working.  So the patches don't seem to solve the problem.
> 
> For the above compile I applied your patches to 4.10.0-rc4+, I hope
> that's ok.
> 

It would be strongly preferred to run them on top of Michal's other
fixes. The main reason it's preferred is because this OOM differs from
earlier ones in that it OOM killed from GFP_NOFS|__GFP_NOFAIL context.
That meant that the slab shrinking could not happen from direct reclaim so
the balancing from my patches would not occur.  As Michal's other patches
affect how kswapd behaves, it's important.

Unfortunately, even that will be race prone for GFP_NOFS callers as
they'll effectively be racing to see if kswapd or another direct
reclaimer can reclaim before the OOM conditions are hit. It is by
design, but it's apparent that a __GFP_NOFAIL request can trigger OOM
relatively easily as it's not necessarily throttled or waiting on kswapd
to complete any work. I'll keep thinking about it.

-- 
Mel Gorman
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-23 Thread Mel Gorman
On Sun, Jan 22, 2017 at 06:45:59PM -0600, Trevor Cordes wrote:
> On 2017-01-20 Mel Gorman wrote:
> > > 
> > > Thanks for the OOM report. I was expecting it to be a particular
> > > shape and my expectations were not matched so it took time to
> > > consider it further. Can you try the cumulative patch below? It
> > > combines three patches that
> > > 
> > > 1. Allow slab shrinking even if the LRU patches are unreclaimable in
> > >direct reclaim
> > > 2. Shrinks slab based once based on the contents of all memcgs
> > > instead of shrinking one at a time
> > > 3. Tries to shrink slabs if the lowmem usage is too high
> > > 
> > > Unfortunately it's only boot tested on x86-64 as I didn't get the
> > > chance to setup an i386 test bed.
> > >   
> > 
> > There was one major flaw in that patch. This version fixes it and
> > addresses other minor issues. It may still be too agressive shrinking
> > slab but worth trying out. Thanks.
> 
> I ran with your patch below and it oom'd on the first night.  It was
> weird, it didn't hang the system, and my rebooter script started a
> reboot but the system never got more than half down before it just sat
> there in a weird state where a local console user could still login but
> not much was working.  So the patches don't seem to solve the problem.
> 
> For the above compile I applied your patches to 4.10.0-rc4+, I hope
> that's ok.
> 

It would be strongly preferred to run them on top of Michal's other
fixes. The main reason it's preferred is because this OOM differs from
earlier ones in that it OOM killed from GFP_NOFS|__GFP_NOFAIL context.
That meant that the slab shrinking could not happen from direct reclaim so
the balancing from my patches would not occur.  As Michal's other patches
affect how kswapd behaves, it's important.

Unfortunately, even that will be race prone for GFP_NOFS callers as
they'll effectively be racing to see if kswapd or another direct
reclaimer can reclaim before the OOM conditions are hit. It is by
design, but it's apparent that a __GFP_NOFAIL request can trigger OOM
relatively easily as it's not necessarily throttled or waiting on kswapd
to complete any work. I'll keep thinking about it.

-- 
Mel Gorman
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-22 Thread Trevor Cordes
On 2017-01-20 Mel Gorman wrote:
> > 
> > Thanks for the OOM report. I was expecting it to be a particular
> > shape and my expectations were not matched so it took time to
> > consider it further. Can you try the cumulative patch below? It
> > combines three patches that
> > 
> > 1. Allow slab shrinking even if the LRU patches are unreclaimable in
> >direct reclaim
> > 2. Shrinks slab based once based on the contents of all memcgs
> > instead of shrinking one at a time
> > 3. Tries to shrink slabs if the lowmem usage is too high
> > 
> > Unfortunately it's only boot tested on x86-64 as I didn't get the
> > chance to setup an i386 test bed.
> >   
> 
> There was one major flaw in that patch. This version fixes it and
> addresses other minor issues. It may still be too agressive shrinking
> slab but worth trying out. Thanks.

I ran with your patch below and it oom'd on the first night.  It was
weird, it didn't hang the system, and my rebooter script started a
reboot but the system never got more than half down before it just sat
there in a weird state where a local console user could still login but
not much was working.  So the patches don't seem to solve the problem.

For the above compile I applied your patches to 4.10.0-rc4+, I hope
that's ok.

Attached is the first oom from that night.  I include some stuff below
the oom where the kernel is obviously having issues and dumping more
strange output.  I don't think I've seen that before.  That probably
explains the strange state it was left in.

Also, completely separate from your patch I ran mhocko's 4.9 tree with
mem=2G to see if lower ram amount would help, but it didn't.  Even with
2G the system oom and hung same as usual.  So far the only thing that
helps at all was the cgroup_disable=memory option, which makes the
problem disappear completely for me.  I added that option to 3 other
boxes I admin with PAE and that plus limiting ram to <4GB gets rid of
the bug.  However, on the RHBZ on this bug I am commenting on, someone
there reports that cgroup_disable=memory doesn't help him at all.

Hopefully the oom attached can help you figure out a next step.  Thanks!

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2281ad310d06..2c735ea24a85 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2318,6 +2318,59 @@ static void get_scan_count(struct lruvec
> *lruvec, struct mem_cgroup *memcg, }
>  }
>  
> +#ifdef CONFIG_HIGHMEM
> +static void balance_slab_lowmem(struct pglist_data *pgdat,
> + struct scan_control *sc)
> +{
> + unsigned long lru_pages = 0;
> + unsigned long slab_pages = 0;
> + unsigned long managed_pages = 0;
> + int zid;
> +
> + for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> + struct zone *zone = >node_zones[zid];
> +
> + if (!populated_zone(zone) || is_highmem_idx(zid))
> + continue;
> +
> + lru_pages += zone_page_state(zone,
> NR_ZONE_INACTIVE_FILE);
> + lru_pages += zone_page_state(zone,
> NR_ZONE_ACTIVE_FILE);
> + lru_pages += zone_page_state(zone,
> NR_ZONE_INACTIVE_ANON);
> + lru_pages += zone_page_state(zone,
> NR_ZONE_ACTIVE_ANON);
> + slab_pages += zone_page_state(zone,
> NR_SLAB_RECLAIMABLE);
> + slab_pages += zone_page_state(zone,
> NR_SLAB_UNRECLAIMABLE);
> + }
> +
> + /* Do not balance until LRU and slab exceeds 50% of lowmem */
> + if (lru_pages + slab_pages < (managed_pages >> 1))
> + return;
> +
> + /*
> +  * Shrink reclaimable slabs if the number of lowmem slab
> pages is
> +  * over twice the size of LRU pages. Apply pressure relative
> to
> +  * the imbalance between LRU and slab pages.
> +  */
> + if (slab_pages > lru_pages << 1) {
> + struct reclaim_state *reclaim_state =
> current->reclaim_state;
> + unsigned long exceed = slab_pages - (lru_pages << 1);
> + int nid = pgdat->node_id;
> +
> + exceed = min(exceed, slab_pages);
> + shrink_slab(sc->gfp_mask, nid, NULL, exceed >> 3,
> slab_pages);
> + if (reclaim_state) {
> + sc->nr_reclaimed +=
> reclaim_state->reclaimed_slab;
> + reclaim_state->reclaimed_slab = 0;
> + }
> + }
> +}
> +#else
> +static void balance_slab_lowmem(struct pglist_data *pgdat,
> + struct scan_control *sc)
> +{
> + return;
> +}
> +#endif
> +
>  /*
>   * This is a basic per-node page freer.  Used by both kswapd and
> direct reclaim. */
> @@ -2336,6 +2389,27 @@ static void shrink_node_memcg(struct
> pglist_data *pgdat, struct mem_cgroup *memc 
>   get_scan_count(lruvec, memcg, sc, nr, lru_pages);
>  
> + /*
> +  * If direct reclaiming at elevated priority and the node is
> +  * unreclaimable then skip LRU reclaim and let kswapd poll
> it.
> +  */
> + if (!current_is_kswapd() &&
> + sc->priority != DEF_PRIORITY 

Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-22 Thread Trevor Cordes
On 2017-01-20 Mel Gorman wrote:
> > 
> > Thanks for the OOM report. I was expecting it to be a particular
> > shape and my expectations were not matched so it took time to
> > consider it further. Can you try the cumulative patch below? It
> > combines three patches that
> > 
> > 1. Allow slab shrinking even if the LRU patches are unreclaimable in
> >direct reclaim
> > 2. Shrinks slab based once based on the contents of all memcgs
> > instead of shrinking one at a time
> > 3. Tries to shrink slabs if the lowmem usage is too high
> > 
> > Unfortunately it's only boot tested on x86-64 as I didn't get the
> > chance to setup an i386 test bed.
> >   
> 
> There was one major flaw in that patch. This version fixes it and
> addresses other minor issues. It may still be too agressive shrinking
> slab but worth trying out. Thanks.

I ran with your patch below and it oom'd on the first night.  It was
weird, it didn't hang the system, and my rebooter script started a
reboot but the system never got more than half down before it just sat
there in a weird state where a local console user could still login but
not much was working.  So the patches don't seem to solve the problem.

For the above compile I applied your patches to 4.10.0-rc4+, I hope
that's ok.

Attached is the first oom from that night.  I include some stuff below
the oom where the kernel is obviously having issues and dumping more
strange output.  I don't think I've seen that before.  That probably
explains the strange state it was left in.

Also, completely separate from your patch I ran mhocko's 4.9 tree with
mem=2G to see if lower ram amount would help, but it didn't.  Even with
2G the system oom and hung same as usual.  So far the only thing that
helps at all was the cgroup_disable=memory option, which makes the
problem disappear completely for me.  I added that option to 3 other
boxes I admin with PAE and that plus limiting ram to <4GB gets rid of
the bug.  However, on the RHBZ on this bug I am commenting on, someone
there reports that cgroup_disable=memory doesn't help him at all.

Hopefully the oom attached can help you figure out a next step.  Thanks!

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2281ad310d06..2c735ea24a85 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2318,6 +2318,59 @@ static void get_scan_count(struct lruvec
> *lruvec, struct mem_cgroup *memcg, }
>  }
>  
> +#ifdef CONFIG_HIGHMEM
> +static void balance_slab_lowmem(struct pglist_data *pgdat,
> + struct scan_control *sc)
> +{
> + unsigned long lru_pages = 0;
> + unsigned long slab_pages = 0;
> + unsigned long managed_pages = 0;
> + int zid;
> +
> + for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> + struct zone *zone = >node_zones[zid];
> +
> + if (!populated_zone(zone) || is_highmem_idx(zid))
> + continue;
> +
> + lru_pages += zone_page_state(zone,
> NR_ZONE_INACTIVE_FILE);
> + lru_pages += zone_page_state(zone,
> NR_ZONE_ACTIVE_FILE);
> + lru_pages += zone_page_state(zone,
> NR_ZONE_INACTIVE_ANON);
> + lru_pages += zone_page_state(zone,
> NR_ZONE_ACTIVE_ANON);
> + slab_pages += zone_page_state(zone,
> NR_SLAB_RECLAIMABLE);
> + slab_pages += zone_page_state(zone,
> NR_SLAB_UNRECLAIMABLE);
> + }
> +
> + /* Do not balance until LRU and slab exceeds 50% of lowmem */
> + if (lru_pages + slab_pages < (managed_pages >> 1))
> + return;
> +
> + /*
> +  * Shrink reclaimable slabs if the number of lowmem slab
> pages is
> +  * over twice the size of LRU pages. Apply pressure relative
> to
> +  * the imbalance between LRU and slab pages.
> +  */
> + if (slab_pages > lru_pages << 1) {
> + struct reclaim_state *reclaim_state =
> current->reclaim_state;
> + unsigned long exceed = slab_pages - (lru_pages << 1);
> + int nid = pgdat->node_id;
> +
> + exceed = min(exceed, slab_pages);
> + shrink_slab(sc->gfp_mask, nid, NULL, exceed >> 3,
> slab_pages);
> + if (reclaim_state) {
> + sc->nr_reclaimed +=
> reclaim_state->reclaimed_slab;
> + reclaim_state->reclaimed_slab = 0;
> + }
> + }
> +}
> +#else
> +static void balance_slab_lowmem(struct pglist_data *pgdat,
> + struct scan_control *sc)
> +{
> + return;
> +}
> +#endif
> +
>  /*
>   * This is a basic per-node page freer.  Used by both kswapd and
> direct reclaim. */
> @@ -2336,6 +2389,27 @@ static void shrink_node_memcg(struct
> pglist_data *pgdat, struct mem_cgroup *memc 
>   get_scan_count(lruvec, memcg, sc, nr, lru_pages);
>  
> + /*
> +  * If direct reclaiming at elevated priority and the node is
> +  * unreclaimable then skip LRU reclaim and let kswapd poll
> it.
> +  */
> + if (!current_is_kswapd() &&
> + sc->priority != DEF_PRIORITY 

Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-20 Thread Mel Gorman
On Fri, Jan 20, 2017 at 11:02:32AM +, Mel Gorman wrote:
> On Fri, Jan 20, 2017 at 12:35:44AM -0600, Trevor Cordes wrote:
> > > > Hi!  The git tree above version oom'd after < 24 hours (3:02am) so
> > > > it doesn't solve the bug.  If you need a oom messages dump let me
> > > > know.  
> > > 
> > > Yes please.
> > 
> > The first oom from that night attached.  Note, the oom wasn't as dire
> > with your mhocko/4.9.0+ as it usually is with stock 4.8.x: my oom
> > detector and reboot script was able to do its thing cleanly before the
> > system became unusable.
> > 
> > I'll await further instructions and test right away.  Maybe I'll try a
> > few tuning ideas until then.  Thanks!
> > 
> 
> Thanks for the OOM report. I was expecting it to be a particular shape and
> my expectations were not matched so it took time to consider it further. Can
> you try the cumulative patch below? It combines three patches that
> 
> 1. Allow slab shrinking even if the LRU patches are unreclaimable in
>direct reclaim
> 2. Shrinks slab based once based on the contents of all memcgs instead
>of shrinking one at a time
> 3. Tries to shrink slabs if the lowmem usage is too high
> 
> Unfortunately it's only boot tested on x86-64 as I didn't get the chance
> to setup an i386 test bed.
> 

There was one major flaw in that patch. This version fixes it and
addresses other minor issues. It may still be too agressive shrinking
slab but worth trying out. Thanks.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2281ad310d06..2c735ea24a85 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2318,6 +2318,59 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
}
 }
 
+#ifdef CONFIG_HIGHMEM
+static void balance_slab_lowmem(struct pglist_data *pgdat,
+   struct scan_control *sc)
+{
+   unsigned long lru_pages = 0;
+   unsigned long slab_pages = 0;
+   unsigned long managed_pages = 0;
+   int zid;
+
+   for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+   struct zone *zone = >node_zones[zid];
+
+   if (!populated_zone(zone) || is_highmem_idx(zid))
+   continue;
+
+   lru_pages += zone_page_state(zone, NR_ZONE_INACTIVE_FILE);
+   lru_pages += zone_page_state(zone, NR_ZONE_ACTIVE_FILE);
+   lru_pages += zone_page_state(zone, NR_ZONE_INACTIVE_ANON);
+   lru_pages += zone_page_state(zone, NR_ZONE_ACTIVE_ANON);
+   slab_pages += zone_page_state(zone, NR_SLAB_RECLAIMABLE);
+   slab_pages += zone_page_state(zone, NR_SLAB_UNRECLAIMABLE);
+   }
+
+   /* Do not balance until LRU and slab exceeds 50% of lowmem */
+   if (lru_pages + slab_pages < (managed_pages >> 1))
+   return;
+
+   /*
+* Shrink reclaimable slabs if the number of lowmem slab pages is
+* over twice the size of LRU pages. Apply pressure relative to
+* the imbalance between LRU and slab pages.
+*/
+   if (slab_pages > lru_pages << 1) {
+   struct reclaim_state *reclaim_state = current->reclaim_state;
+   unsigned long exceed = slab_pages - (lru_pages << 1);
+   int nid = pgdat->node_id;
+
+   exceed = min(exceed, slab_pages);
+   shrink_slab(sc->gfp_mask, nid, NULL, exceed >> 3, slab_pages);
+   if (reclaim_state) {
+   sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+   reclaim_state->reclaimed_slab = 0;
+   }
+   }
+}
+#else
+static void balance_slab_lowmem(struct pglist_data *pgdat,
+   struct scan_control *sc)
+{
+   return;
+}
+#endif
+
 /*
  * This is a basic per-node page freer.  Used by both kswapd and direct 
reclaim.
  */
@@ -2336,6 +2389,27 @@ static void shrink_node_memcg(struct pglist_data *pgdat, 
struct mem_cgroup *memc
 
get_scan_count(lruvec, memcg, sc, nr, lru_pages);
 
+   /*
+* If direct reclaiming at elevated priority and the node is
+* unreclaimable then skip LRU reclaim and let kswapd poll it.
+*/
+   if (!current_is_kswapd() &&
+   sc->priority != DEF_PRIORITY &&
+   !pgdat_reclaimable(pgdat)) {
+   unsigned long nr_scanned;
+
+   /*
+* Fake scanning so that slab shrinking will continue. For
+* lowmem restricted allocations, shrink aggressively.
+*/
+   nr_scanned = SWAP_CLUSTER_MAX << (DEF_PRIORITY - sc->priority);
+   if (!(sc->gfp_mask & __GFP_HIGHMEM))
+   nr_scanned = max(nr_scanned, *lru_pages);
+   sc->nr_scanned += nr_scanned;
+
+   return;
+   }
+
/* Record the original scan target for proportional adjustments later */
memcpy(targets, nr, sizeof(nr));
 
@@ -2435,6 +2509,8 @@ static void shrink_node_memcg(struct pglist_data *pgdat, 

Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-20 Thread Mel Gorman
On Fri, Jan 20, 2017 at 11:02:32AM +, Mel Gorman wrote:
> On Fri, Jan 20, 2017 at 12:35:44AM -0600, Trevor Cordes wrote:
> > > > Hi!  The git tree above version oom'd after < 24 hours (3:02am) so
> > > > it doesn't solve the bug.  If you need a oom messages dump let me
> > > > know.  
> > > 
> > > Yes please.
> > 
> > The first oom from that night attached.  Note, the oom wasn't as dire
> > with your mhocko/4.9.0+ as it usually is with stock 4.8.x: my oom
> > detector and reboot script was able to do its thing cleanly before the
> > system became unusable.
> > 
> > I'll await further instructions and test right away.  Maybe I'll try a
> > few tuning ideas until then.  Thanks!
> > 
> 
> Thanks for the OOM report. I was expecting it to be a particular shape and
> my expectations were not matched so it took time to consider it further. Can
> you try the cumulative patch below? It combines three patches that
> 
> 1. Allow slab shrinking even if the LRU patches are unreclaimable in
>direct reclaim
> 2. Shrinks slab based once based on the contents of all memcgs instead
>of shrinking one at a time
> 3. Tries to shrink slabs if the lowmem usage is too high
> 
> Unfortunately it's only boot tested on x86-64 as I didn't get the chance
> to setup an i386 test bed.
> 

There was one major flaw in that patch. This version fixes it and
addresses other minor issues. It may still be too agressive shrinking
slab but worth trying out. Thanks.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2281ad310d06..2c735ea24a85 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2318,6 +2318,59 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
}
 }
 
+#ifdef CONFIG_HIGHMEM
+static void balance_slab_lowmem(struct pglist_data *pgdat,
+   struct scan_control *sc)
+{
+   unsigned long lru_pages = 0;
+   unsigned long slab_pages = 0;
+   unsigned long managed_pages = 0;
+   int zid;
+
+   for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+   struct zone *zone = >node_zones[zid];
+
+   if (!populated_zone(zone) || is_highmem_idx(zid))
+   continue;
+
+   lru_pages += zone_page_state(zone, NR_ZONE_INACTIVE_FILE);
+   lru_pages += zone_page_state(zone, NR_ZONE_ACTIVE_FILE);
+   lru_pages += zone_page_state(zone, NR_ZONE_INACTIVE_ANON);
+   lru_pages += zone_page_state(zone, NR_ZONE_ACTIVE_ANON);
+   slab_pages += zone_page_state(zone, NR_SLAB_RECLAIMABLE);
+   slab_pages += zone_page_state(zone, NR_SLAB_UNRECLAIMABLE);
+   }
+
+   /* Do not balance until LRU and slab exceeds 50% of lowmem */
+   if (lru_pages + slab_pages < (managed_pages >> 1))
+   return;
+
+   /*
+* Shrink reclaimable slabs if the number of lowmem slab pages is
+* over twice the size of LRU pages. Apply pressure relative to
+* the imbalance between LRU and slab pages.
+*/
+   if (slab_pages > lru_pages << 1) {
+   struct reclaim_state *reclaim_state = current->reclaim_state;
+   unsigned long exceed = slab_pages - (lru_pages << 1);
+   int nid = pgdat->node_id;
+
+   exceed = min(exceed, slab_pages);
+   shrink_slab(sc->gfp_mask, nid, NULL, exceed >> 3, slab_pages);
+   if (reclaim_state) {
+   sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+   reclaim_state->reclaimed_slab = 0;
+   }
+   }
+}
+#else
+static void balance_slab_lowmem(struct pglist_data *pgdat,
+   struct scan_control *sc)
+{
+   return;
+}
+#endif
+
 /*
  * This is a basic per-node page freer.  Used by both kswapd and direct 
reclaim.
  */
@@ -2336,6 +2389,27 @@ static void shrink_node_memcg(struct pglist_data *pgdat, 
struct mem_cgroup *memc
 
get_scan_count(lruvec, memcg, sc, nr, lru_pages);
 
+   /*
+* If direct reclaiming at elevated priority and the node is
+* unreclaimable then skip LRU reclaim and let kswapd poll it.
+*/
+   if (!current_is_kswapd() &&
+   sc->priority != DEF_PRIORITY &&
+   !pgdat_reclaimable(pgdat)) {
+   unsigned long nr_scanned;
+
+   /*
+* Fake scanning so that slab shrinking will continue. For
+* lowmem restricted allocations, shrink aggressively.
+*/
+   nr_scanned = SWAP_CLUSTER_MAX << (DEF_PRIORITY - sc->priority);
+   if (!(sc->gfp_mask & __GFP_HIGHMEM))
+   nr_scanned = max(nr_scanned, *lru_pages);
+   sc->nr_scanned += nr_scanned;
+
+   return;
+   }
+
/* Record the original scan target for proportional adjustments later */
memcpy(targets, nr, sizeof(nr));
 
@@ -2435,6 +2509,8 @@ static void shrink_node_memcg(struct pglist_data *pgdat, 

Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-20 Thread Mel Gorman
On Fri, Jan 20, 2017 at 12:35:44AM -0600, Trevor Cordes wrote:
> > > Hi!  The git tree above version oom'd after < 24 hours (3:02am) so
> > > it doesn't solve the bug.  If you need a oom messages dump let me
> > > know.  
> > 
> > Yes please.
> 
> The first oom from that night attached.  Note, the oom wasn't as dire
> with your mhocko/4.9.0+ as it usually is with stock 4.8.x: my oom
> detector and reboot script was able to do its thing cleanly before the
> system became unusable.
> 
> I'll await further instructions and test right away.  Maybe I'll try a
> few tuning ideas until then.  Thanks!
> 

Thanks for the OOM report. I was expecting it to be a particular shape and
my expectations were not matched so it took time to consider it further. Can
you try the cumulative patch below? It combines three patches that

1. Allow slab shrinking even if the LRU patches are unreclaimable in
   direct reclaim
2. Shrinks slab based once based on the contents of all memcgs instead
   of shrinking one at a time
3. Tries to shrink slabs if the lowmem usage is too high

Unfortunately it's only boot tested on x86-64 as I didn't get the chance
to setup an i386 test bed.

> > This is why not only Linus hates 32b systems on a large memory
> > systems.
> 
> Completely off-topic: it would be great if rather than pretending PAE
> should work with large RAM (which seems more broken every day), the
> kernel guys put out an officially stated policy of a maximum RAM you
> can use, and try to have the kernel behave for <= that size, and then
> people could use more RAM but clearly "at your own risk, don't bug us
> about problems!".  Other than a few posts about Linus hating it,
> there's nothing official I can find about it in documentation, etc.  It
> gives the (mis)impression that it's perfectly fine to run PAE on a
> zillion GB modern system.  Then we later learn the hard way :-)

The unfortunate reality is that the behaviour is workload dependant so
it's impossible to make a general statement other than "your mileage may
vary considerably".

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2281ad310d06..76d68a8872c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2318,6 +2318,52 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
}
 }
 
+#ifdef CONFIG_HIGHMEM
+static void balance_slab_lowmem(struct pglist_data *pgdat,
+   struct scan_control *sc)
+{
+   unsigned long lru_pages = 0;
+   unsigned long slab_pages = 0;
+   int zid;
+
+   for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+   struct zone *zone = >node_zones[zid];
+
+   if (!populated_zone(zone) || !is_highmem_idx(zid))
+   continue;
+
+   lru_pages += zone_page_state(zone, NR_ZONE_INACTIVE_FILE);
+   lru_pages += zone_page_state(zone, NR_ZONE_ACTIVE_FILE);
+   slab_pages += zone_page_state(zone, NR_SLAB_RECLAIMABLE);
+   slab_pages += zone_page_state(zone, NR_SLAB_UNRECLAIMABLE);
+   }
+
+   /*
+* Shrink reclaimable slabs if the number of lowmem slab pages is
+* over twice the size of LRU pages. Apply pressure relative to
+* the imbalance between LRU and slab pages.
+*/
+   if (slab_pages > lru_pages << 1) {
+   struct reclaim_state *reclaim_state = current->reclaim_state;
+   unsigned long exceed = (lru_pages << 1) - slab_pages;
+   int nid = pgdat->node_id;
+
+   exceed = min(exceed, slab_pages);
+   shrink_slab(sc->gfp_mask, nid, NULL, exceed, slab_pages);
+   if (reclaim_state) {
+   sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+   reclaim_state->reclaimed_slab = 0;
+   }
+   }
+}
+#else
+static void balance_slab_lowmem(struct pglist_data *pgdat,
+   struct scan_control *sc)
+{
+   return false;
+}
+#endif
+
 /*
  * This is a basic per-node page freer.  Used by both kswapd and direct 
reclaim.
  */
@@ -2336,6 +2382,27 @@ static void shrink_node_memcg(struct pglist_data *pgdat, 
struct mem_cgroup *memc
 
get_scan_count(lruvec, memcg, sc, nr, lru_pages);
 
+   /*
+* If direct reclaiming at elevated priority and the node is
+* unreclaimable then skip LRU reclaim and let kswapd poll it.
+*/
+   if (!current_is_kswapd() &&
+   sc->priority != DEF_PRIORITY &&
+   !pgdat_reclaimable(pgdat)) {
+   unsigned long nr_scanned;
+
+   /*
+* Fake scanning so that slab shrinking will continue. For
+* lowmem restricted allocations, shrink aggressively.
+*/
+   nr_scanned = SWAP_CLUSTER_MAX << (DEF_PRIORITY - sc->priority);
+   if (!(sc->gfp_mask & __GFP_HIGHMEM))
+   nr_scanned = max(nr_scanned, *lru_pages);
+   sc->nr_scanned += 

Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-20 Thread Mel Gorman
On Fri, Jan 20, 2017 at 12:35:44AM -0600, Trevor Cordes wrote:
> > > Hi!  The git tree above version oom'd after < 24 hours (3:02am) so
> > > it doesn't solve the bug.  If you need a oom messages dump let me
> > > know.  
> > 
> > Yes please.
> 
> The first oom from that night attached.  Note, the oom wasn't as dire
> with your mhocko/4.9.0+ as it usually is with stock 4.8.x: my oom
> detector and reboot script was able to do its thing cleanly before the
> system became unusable.
> 
> I'll await further instructions and test right away.  Maybe I'll try a
> few tuning ideas until then.  Thanks!
> 

Thanks for the OOM report. I was expecting it to be a particular shape and
my expectations were not matched so it took time to consider it further. Can
you try the cumulative patch below? It combines three patches that

1. Allow slab shrinking even if the LRU patches are unreclaimable in
   direct reclaim
2. Shrinks slab based once based on the contents of all memcgs instead
   of shrinking one at a time
3. Tries to shrink slabs if the lowmem usage is too high

Unfortunately it's only boot tested on x86-64 as I didn't get the chance
to setup an i386 test bed.

> > This is why not only Linus hates 32b systems on a large memory
> > systems.
> 
> Completely off-topic: it would be great if rather than pretending PAE
> should work with large RAM (which seems more broken every day), the
> kernel guys put out an officially stated policy of a maximum RAM you
> can use, and try to have the kernel behave for <= that size, and then
> people could use more RAM but clearly "at your own risk, don't bug us
> about problems!".  Other than a few posts about Linus hating it,
> there's nothing official I can find about it in documentation, etc.  It
> gives the (mis)impression that it's perfectly fine to run PAE on a
> zillion GB modern system.  Then we later learn the hard way :-)

The unfortunate reality is that the behaviour is workload dependant so
it's impossible to make a general statement other than "your mileage may
vary considerably".

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2281ad310d06..76d68a8872c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2318,6 +2318,52 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
}
 }
 
+#ifdef CONFIG_HIGHMEM
+static void balance_slab_lowmem(struct pglist_data *pgdat,
+   struct scan_control *sc)
+{
+   unsigned long lru_pages = 0;
+   unsigned long slab_pages = 0;
+   int zid;
+
+   for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+   struct zone *zone = >node_zones[zid];
+
+   if (!populated_zone(zone) || !is_highmem_idx(zid))
+   continue;
+
+   lru_pages += zone_page_state(zone, NR_ZONE_INACTIVE_FILE);
+   lru_pages += zone_page_state(zone, NR_ZONE_ACTIVE_FILE);
+   slab_pages += zone_page_state(zone, NR_SLAB_RECLAIMABLE);
+   slab_pages += zone_page_state(zone, NR_SLAB_UNRECLAIMABLE);
+   }
+
+   /*
+* Shrink reclaimable slabs if the number of lowmem slab pages is
+* over twice the size of LRU pages. Apply pressure relative to
+* the imbalance between LRU and slab pages.
+*/
+   if (slab_pages > lru_pages << 1) {
+   struct reclaim_state *reclaim_state = current->reclaim_state;
+   unsigned long exceed = (lru_pages << 1) - slab_pages;
+   int nid = pgdat->node_id;
+
+   exceed = min(exceed, slab_pages);
+   shrink_slab(sc->gfp_mask, nid, NULL, exceed, slab_pages);
+   if (reclaim_state) {
+   sc->nr_reclaimed += reclaim_state->reclaimed_slab;
+   reclaim_state->reclaimed_slab = 0;
+   }
+   }
+}
+#else
+static void balance_slab_lowmem(struct pglist_data *pgdat,
+   struct scan_control *sc)
+{
+   return false;
+}
+#endif
+
 /*
  * This is a basic per-node page freer.  Used by both kswapd and direct 
reclaim.
  */
@@ -2336,6 +2382,27 @@ static void shrink_node_memcg(struct pglist_data *pgdat, 
struct mem_cgroup *memc
 
get_scan_count(lruvec, memcg, sc, nr, lru_pages);
 
+   /*
+* If direct reclaiming at elevated priority and the node is
+* unreclaimable then skip LRU reclaim and let kswapd poll it.
+*/
+   if (!current_is_kswapd() &&
+   sc->priority != DEF_PRIORITY &&
+   !pgdat_reclaimable(pgdat)) {
+   unsigned long nr_scanned;
+
+   /*
+* Fake scanning so that slab shrinking will continue. For
+* lowmem restricted allocations, shrink aggressively.
+*/
+   nr_scanned = SWAP_CLUSTER_MAX << (DEF_PRIORITY - sc->priority);
+   if (!(sc->gfp_mask & __GFP_HIGHMEM))
+   nr_scanned = max(nr_scanned, *lru_pages);
+   sc->nr_scanned += 

Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-19 Thread Trevor Cordes
On 2017-01-19 Michal Hocko wrote:
> On Thu 19-01-17 03:48:50, Trevor Cordes wrote:
> > On 2017-01-17 Michal Hocko wrote:  
> > > On Tue 17-01-17 14:21:14, Mel Gorman wrote:  
> > > > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko
> > > > wrote:
> > > > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > > > [...]
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index 532a2a750952..46aac487b89a 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct
> > > > > > zonelist *zonelist, struct scan_control *sc) continue;
> > > > > >  
> > > > > > if (sc->priority != DEF_PRIORITY &&
> > > > > > +   !buffer_heads_over_limit &&
> > > > > > !pgdat_reclaimable(zone->zone_pgdat))
> > > > > > continue;   /* Let
> > > > > > kswapd poll it */
> > > > > 
> > > > > I think we should rather remove pgdat_reclaimable here. This
> > > > > sounds like a wrong layer to decide whether we want to reclaim
> > > > > and how much.   
> > > > 
> > > > I had considered that but it'd also be important to add the
> > > > other 32-bit patches you have posted to see the impact. Because
> > > > of the ratio of LRU pages to slab pages, it may not have an
> > > > impact but it'd need to be eliminated.
> > > 
> > > OK, Trevor you can pull from
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> > > fixes/highmem-node-fixes branch. This contains the current mmotm
> > > tree
> > > + the latest highmem fixes. I also do not expect this would help
> > > much in your case but as Mel've said we should rule that out at
> > > least.  
> > 
> > Hi!  The git tree above version oom'd after < 24 hours (3:02am) so
> > it doesn't solve the bug.  If you need a oom messages dump let me
> > know.  
> 
> Yes please.

The first oom from that night attached.  Note, the oom wasn't as dire
with your mhocko/4.9.0+ as it usually is with stock 4.8.x: my oom
detector and reboot script was able to do its thing cleanly before the
system became unusable.

I'll await further instructions and test right away.  Maybe I'll try a
few tuning ideas until then.  Thanks!

> > Let me know what to try next, guys, and I'll test it out.
> >   
> > > > Before prototyping such a thing, I'd like to hear the outcome of
> > > > this heavy hack and then add your 32-bit patches onto the list.
> > > > If the problem is still there then I'd next look at taking slab
> > > > pages into account in pgdat_reclaimable() instead of an
> > > > outright removal that has a much wider impact. If that doesn't
> > > > work then I'll prototype a heavy-handed forced slab reclaim
> > > > when lower zones are almost all slab pages.  
> > 
> > I don't think I've tried the "heavy hack" patch yet?  It's not in
> > the mhocko tree I just tried?  Should I try the heavy hack on top
> > of mhocko git or on vanilla or what?
> > 
> > I also want to mention that these PAE boxes suffer from another
> > problem/bug that I've worked around for almost a year now.  For some
> > reason it keeps gnawing at me that it might be related.  The disk
> > I/O goes to pot on this/these PAE boxes after a certain amount of
> > disk writes (like some unknown number of GB, around 10-ish maybe).
> > Like writes go from 500MB/s to 10MB/s!! Reboot and it's magically
> > 500MB/s again.  I detail this here:
> > https://muug.ca/pipermail/roundtable/2016-June/004669.html
> > My fix was to mem=XG where X is <8 (like 4 or 6) to force the PAE
> > kernel to be more sane about highmem choices.  I never filed a bug
> > because I read a ton of stuff saying Linus hates PAE, don't use over
> > 4G, blah blah.  But the other fix is to:
> > set /proc/sys/vm/highmem_is_dirtyable to 1  
> 
> Yes this sounds like a dirty memory throttling and there were some
> changes in that area. I do not remember when exactly.

I think my PAE-slow-IO bug started way back in Fedora 22 (4.0?), hard
to know exactly when as I didn't discover the bug for maybe a year as I
didn't realize IO was the problem right away.  Too late to bisect that
one, I guess.  I guess it's not related so we can ignore my tangent!

> > I'm not bringing this up to get attention to a new bug, I bring
> > this up because it smells like it might be related.  If something
> > slowly eats away at the box's vm to the point that I/O gets
> > horribly slow, perhaps it's related to the slab and high/lomem
> > issue we have here?  And if related, it may help to solve the oom
> > bug.  If I'm way off base here, just ignore my tangent!  
> 
> >From your OOM reports so far it doesn't really seem related because
> >you  
> never had large number of pages under the writeback when OOM.
> 
> The situation with the PAE kernel is unfortunate but it is really hard
> to do anything about that considering that the kernel and most its
> allocations have to live in a small and scarce lowmem memory. 

Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-19 Thread Trevor Cordes
On 2017-01-19 Michal Hocko wrote:
> On Thu 19-01-17 03:48:50, Trevor Cordes wrote:
> > On 2017-01-17 Michal Hocko wrote:  
> > > On Tue 17-01-17 14:21:14, Mel Gorman wrote:  
> > > > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko
> > > > wrote:
> > > > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > > > [...]
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index 532a2a750952..46aac487b89a 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct
> > > > > > zonelist *zonelist, struct scan_control *sc) continue;
> > > > > >  
> > > > > > if (sc->priority != DEF_PRIORITY &&
> > > > > > +   !buffer_heads_over_limit &&
> > > > > > !pgdat_reclaimable(zone->zone_pgdat))
> > > > > > continue;   /* Let
> > > > > > kswapd poll it */
> > > > > 
> > > > > I think we should rather remove pgdat_reclaimable here. This
> > > > > sounds like a wrong layer to decide whether we want to reclaim
> > > > > and how much.   
> > > > 
> > > > I had considered that but it'd also be important to add the
> > > > other 32-bit patches you have posted to see the impact. Because
> > > > of the ratio of LRU pages to slab pages, it may not have an
> > > > impact but it'd need to be eliminated.
> > > 
> > > OK, Trevor you can pull from
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> > > fixes/highmem-node-fixes branch. This contains the current mmotm
> > > tree
> > > + the latest highmem fixes. I also do not expect this would help
> > > much in your case but as Mel've said we should rule that out at
> > > least.  
> > 
> > Hi!  The git tree above version oom'd after < 24 hours (3:02am) so
> > it doesn't solve the bug.  If you need a oom messages dump let me
> > know.  
> 
> Yes please.

The first oom from that night attached.  Note, the oom wasn't as dire
with your mhocko/4.9.0+ as it usually is with stock 4.8.x: my oom
detector and reboot script was able to do its thing cleanly before the
system became unusable.

I'll await further instructions and test right away.  Maybe I'll try a
few tuning ideas until then.  Thanks!

> > Let me know what to try next, guys, and I'll test it out.
> >   
> > > > Before prototyping such a thing, I'd like to hear the outcome of
> > > > this heavy hack and then add your 32-bit patches onto the list.
> > > > If the problem is still there then I'd next look at taking slab
> > > > pages into account in pgdat_reclaimable() instead of an
> > > > outright removal that has a much wider impact. If that doesn't
> > > > work then I'll prototype a heavy-handed forced slab reclaim
> > > > when lower zones are almost all slab pages.  
> > 
> > I don't think I've tried the "heavy hack" patch yet?  It's not in
> > the mhocko tree I just tried?  Should I try the heavy hack on top
> > of mhocko git or on vanilla or what?
> > 
> > I also want to mention that these PAE boxes suffer from another
> > problem/bug that I've worked around for almost a year now.  For some
> > reason it keeps gnawing at me that it might be related.  The disk
> > I/O goes to pot on this/these PAE boxes after a certain amount of
> > disk writes (like some unknown number of GB, around 10-ish maybe).
> > Like writes go from 500MB/s to 10MB/s!! Reboot and it's magically
> > 500MB/s again.  I detail this here:
> > https://muug.ca/pipermail/roundtable/2016-June/004669.html
> > My fix was to mem=XG where X is <8 (like 4 or 6) to force the PAE
> > kernel to be more sane about highmem choices.  I never filed a bug
> > because I read a ton of stuff saying Linus hates PAE, don't use over
> > 4G, blah blah.  But the other fix is to:
> > set /proc/sys/vm/highmem_is_dirtyable to 1  
> 
> Yes this sounds like a dirty memory throttling and there were some
> changes in that area. I do not remember when exactly.

I think my PAE-slow-IO bug started way back in Fedora 22 (4.0?), hard
to know exactly when as I didn't discover the bug for maybe a year as I
didn't realize IO was the problem right away.  Too late to bisect that
one, I guess.  I guess it's not related so we can ignore my tangent!

> > I'm not bringing this up to get attention to a new bug, I bring
> > this up because it smells like it might be related.  If something
> > slowly eats away at the box's vm to the point that I/O gets
> > horribly slow, perhaps it's related to the slab and high/lomem
> > issue we have here?  And if related, it may help to solve the oom
> > bug.  If I'm way off base here, just ignore my tangent!  
> 
> >From your OOM reports so far it doesn't really seem related because
> >you  
> never had large number of pages under the writeback when OOM.
> 
> The situation with the PAE kernel is unfortunate but it is really hard
> to do anything about that considering that the kernel and most its
> allocations have to live in a small and scarce lowmem memory. 

Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-19 Thread Michal Hocko
On Thu 19-01-17 03:48:50, Trevor Cordes wrote:
> On 2017-01-17 Michal Hocko wrote:
> > On Tue 17-01-17 14:21:14, Mel Gorman wrote:
> > > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko wrote:  
> > > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > > [...]  
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index 532a2a750952..46aac487b89a 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist
> > > > > *zonelist, struct scan_control *sc) continue;
> > > > >  
> > > > >   if (sc->priority != DEF_PRIORITY &&
> > > > > + !buffer_heads_over_limit &&
> > > > >   !pgdat_reclaimable(zone->zone_pgdat))
> > > > >   continue;   /* Let kswapd
> > > > > poll it */  
> > > > 
> > > > I think we should rather remove pgdat_reclaimable here. This
> > > > sounds like a wrong layer to decide whether we want to reclaim
> > > > and how much. 
> > > 
> > > I had considered that but it'd also be important to add the other
> > > 32-bit patches you have posted to see the impact. Because of the
> > > ratio of LRU pages to slab pages, it may not have an impact but
> > > it'd need to be eliminated.  
> > 
> > OK, Trevor you can pull from
> > git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> > fixes/highmem-node-fixes branch. This contains the current mmotm tree
> > + the latest highmem fixes. I also do not expect this would help much
> > in your case but as Mel've said we should rule that out at least.
> 
> Hi!  The git tree above version oom'd after < 24 hours (3:02am) so
> it doesn't solve the bug.  If you need a oom messages dump let me know.

Yes please.

> Let me know what to try next, guys, and I'll test it out.
> 
> > > Before prototyping such a thing, I'd like to hear the outcome of
> > > this heavy hack and then add your 32-bit patches onto the list. If
> > > the problem is still there then I'd next look at taking slab pages
> > > into account in pgdat_reclaimable() instead of an outright removal
> > > that has a much wider impact. If that doesn't work then I'll
> > > prototype a heavy-handed forced slab reclaim when lower zones are
> > > almost all slab pages.
> 
> I don't think I've tried the "heavy hack" patch yet?  It's not in the
> mhocko tree I just tried?  Should I try the heavy hack on top of mhocko
> git or on vanilla or what?
> 
> I also want to mention that these PAE boxes suffer from another
> problem/bug that I've worked around for almost a year now.  For some
> reason it keeps gnawing at me that it might be related.  The disk I/O
> goes to pot on this/these PAE boxes after a certain amount of disk
> writes (like some unknown number of GB, around 10-ish maybe).  Like
> writes go from 500MB/s to 10MB/s!! Reboot and it's magically 500MB/s
> again.  I detail this here:
> https://muug.ca/pipermail/roundtable/2016-June/004669.html
> My fix was to mem=XG where X is <8 (like 4 or 6) to force the PAE
> kernel to be more sane about highmem choices.  I never filed a bug
> because I read a ton of stuff saying Linus hates PAE, don't use over
> 4G, blah blah.  But the other fix is to:
> set /proc/sys/vm/highmem_is_dirtyable to 1

Yes this sounds like a dirty memory throttling and there were some
changes in that area. I do not remember when exactly.

> I'm not bringing this up to get attention to a new bug, I bring this up
> because it smells like it might be related.  If something slowly eats
> away at the box's vm to the point that I/O gets horribly slow, perhaps
> it's related to the slab and high/lomem issue we have here?  And if
> related, it may help to solve the oom bug.  If I'm way off base here,
> just ignore my tangent!

>From your OOM reports so far it doesn't really seem related because you
never had large number of pages under the writeback when OOM.

The situation with the PAE kernel is unfortunate but it is really hard
to do anything about that considering that the kernel and most its
allocations have to live in a small and scarce lowmem memory. Moreover
the more memory you have to more you have to allocated from that memory.

This is why not only Linus hates 32b systems on a large memory systems.
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-19 Thread Michal Hocko
On Thu 19-01-17 03:48:50, Trevor Cordes wrote:
> On 2017-01-17 Michal Hocko wrote:
> > On Tue 17-01-17 14:21:14, Mel Gorman wrote:
> > > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko wrote:  
> > > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > > [...]  
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index 532a2a750952..46aac487b89a 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist
> > > > > *zonelist, struct scan_control *sc) continue;
> > > > >  
> > > > >   if (sc->priority != DEF_PRIORITY &&
> > > > > + !buffer_heads_over_limit &&
> > > > >   !pgdat_reclaimable(zone->zone_pgdat))
> > > > >   continue;   /* Let kswapd
> > > > > poll it */  
> > > > 
> > > > I think we should rather remove pgdat_reclaimable here. This
> > > > sounds like a wrong layer to decide whether we want to reclaim
> > > > and how much. 
> > > 
> > > I had considered that but it'd also be important to add the other
> > > 32-bit patches you have posted to see the impact. Because of the
> > > ratio of LRU pages to slab pages, it may not have an impact but
> > > it'd need to be eliminated.  
> > 
> > OK, Trevor you can pull from
> > git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> > fixes/highmem-node-fixes branch. This contains the current mmotm tree
> > + the latest highmem fixes. I also do not expect this would help much
> > in your case but as Mel've said we should rule that out at least.
> 
> Hi!  The git tree above version oom'd after < 24 hours (3:02am) so
> it doesn't solve the bug.  If you need a oom messages dump let me know.

Yes please.

> Let me know what to try next, guys, and I'll test it out.
> 
> > > Before prototyping such a thing, I'd like to hear the outcome of
> > > this heavy hack and then add your 32-bit patches onto the list. If
> > > the problem is still there then I'd next look at taking slab pages
> > > into account in pgdat_reclaimable() instead of an outright removal
> > > that has a much wider impact. If that doesn't work then I'll
> > > prototype a heavy-handed forced slab reclaim when lower zones are
> > > almost all slab pages.
> 
> I don't think I've tried the "heavy hack" patch yet?  It's not in the
> mhocko tree I just tried?  Should I try the heavy hack on top of mhocko
> git or on vanilla or what?
> 
> I also want to mention that these PAE boxes suffer from another
> problem/bug that I've worked around for almost a year now.  For some
> reason it keeps gnawing at me that it might be related.  The disk I/O
> goes to pot on this/these PAE boxes after a certain amount of disk
> writes (like some unknown number of GB, around 10-ish maybe).  Like
> writes go from 500MB/s to 10MB/s!! Reboot and it's magically 500MB/s
> again.  I detail this here:
> https://muug.ca/pipermail/roundtable/2016-June/004669.html
> My fix was to mem=XG where X is <8 (like 4 or 6) to force the PAE
> kernel to be more sane about highmem choices.  I never filed a bug
> because I read a ton of stuff saying Linus hates PAE, don't use over
> 4G, blah blah.  But the other fix is to:
> set /proc/sys/vm/highmem_is_dirtyable to 1

Yes this sounds like a dirty memory throttling and there were some
changes in that area. I do not remember when exactly.

> I'm not bringing this up to get attention to a new bug, I bring this up
> because it smells like it might be related.  If something slowly eats
> away at the box's vm to the point that I/O gets horribly slow, perhaps
> it's related to the slab and high/lomem issue we have here?  And if
> related, it may help to solve the oom bug.  If I'm way off base here,
> just ignore my tangent!

>From your OOM reports so far it doesn't really seem related because you
never had large number of pages under the writeback when OOM.

The situation with the PAE kernel is unfortunate but it is really hard
to do anything about that considering that the kernel and most its
allocations have to live in a small and scarce lowmem memory. Moreover
the more memory you have to more you have to allocated from that memory.

This is why not only Linus hates 32b systems on a large memory systems.
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-19 Thread Trevor Cordes
On 2017-01-17 Michal Hocko wrote:
> On Tue 17-01-17 14:21:14, Mel Gorman wrote:
> > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko wrote:  
> > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > [...]  
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 532a2a750952..46aac487b89a 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist
> > > > *zonelist, struct scan_control *sc) continue;
> > > >  
> > > > if (sc->priority != DEF_PRIORITY &&
> > > > +   !buffer_heads_over_limit &&
> > > > !pgdat_reclaimable(zone->zone_pgdat))
> > > > continue;   /* Let kswapd
> > > > poll it */  
> > > 
> > > I think we should rather remove pgdat_reclaimable here. This
> > > sounds like a wrong layer to decide whether we want to reclaim
> > > and how much. 
> > 
> > I had considered that but it'd also be important to add the other
> > 32-bit patches you have posted to see the impact. Because of the
> > ratio of LRU pages to slab pages, it may not have an impact but
> > it'd need to be eliminated.  
> 
> OK, Trevor you can pull from
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> fixes/highmem-node-fixes branch. This contains the current mmotm tree
> + the latest highmem fixes. I also do not expect this would help much
> in your case but as Mel've said we should rule that out at least.

Hi!  The git tree above version oom'd after < 24 hours (3:02am) so
it doesn't solve the bug.  If you need a oom messages dump let me know.

Let me know what to try next, guys, and I'll test it out.

> > Before prototyping such a thing, I'd like to hear the outcome of
> > this heavy hack and then add your 32-bit patches onto the list. If
> > the problem is still there then I'd next look at taking slab pages
> > into account in pgdat_reclaimable() instead of an outright removal
> > that has a much wider impact. If that doesn't work then I'll
> > prototype a heavy-handed forced slab reclaim when lower zones are
> > almost all slab pages.

I don't think I've tried the "heavy hack" patch yet?  It's not in the
mhocko tree I just tried?  Should I try the heavy hack on top of mhocko
git or on vanilla or what?

I also want to mention that these PAE boxes suffer from another
problem/bug that I've worked around for almost a year now.  For some
reason it keeps gnawing at me that it might be related.  The disk I/O
goes to pot on this/these PAE boxes after a certain amount of disk
writes (like some unknown number of GB, around 10-ish maybe).  Like
writes go from 500MB/s to 10MB/s!! Reboot and it's magically 500MB/s
again.  I detail this here:
https://muug.ca/pipermail/roundtable/2016-June/004669.html
My fix was to mem=XG where X is <8 (like 4 or 6) to force the PAE
kernel to be more sane about highmem choices.  I never filed a bug
because I read a ton of stuff saying Linus hates PAE, don't use over
4G, blah blah.  But the other fix is to:
set /proc/sys/vm/highmem_is_dirtyable to 1

I'm not bringing this up to get attention to a new bug, I bring this up
because it smells like it might be related.  If something slowly eats
away at the box's vm to the point that I/O gets horribly slow, perhaps
it's related to the slab and high/lomem issue we have here?  And if
related, it may help to solve the oom bug.  If I'm way off base here,
just ignore my tangent!

The funny thing is I thought mem=XG where X<8 solved the problem, but
it doesn't!  It greatly mitigates it, but I still get subtle slowdown
that gets worse over time (like weeks instead of days).  I now use the
highmem_is_dirtyable on most boxes and that seems to solve it for good
in combo with mem=XG.  Let me note, however, that I have NOT set
highmem_is_dirtyable=1 on the test box I am using for all of this
building/testing, as I wanted the config to stay static while I work
through this oom bug.  (I'm real curious to see if
highmem_is_dirtyable=1 would have any impact on the oom though!)
Thanks!


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-19 Thread Trevor Cordes
On 2017-01-17 Michal Hocko wrote:
> On Tue 17-01-17 14:21:14, Mel Gorman wrote:
> > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko wrote:  
> > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > [...]  
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 532a2a750952..46aac487b89a 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist
> > > > *zonelist, struct scan_control *sc) continue;
> > > >  
> > > > if (sc->priority != DEF_PRIORITY &&
> > > > +   !buffer_heads_over_limit &&
> > > > !pgdat_reclaimable(zone->zone_pgdat))
> > > > continue;   /* Let kswapd
> > > > poll it */  
> > > 
> > > I think we should rather remove pgdat_reclaimable here. This
> > > sounds like a wrong layer to decide whether we want to reclaim
> > > and how much. 
> > 
> > I had considered that but it'd also be important to add the other
> > 32-bit patches you have posted to see the impact. Because of the
> > ratio of LRU pages to slab pages, it may not have an impact but
> > it'd need to be eliminated.  
> 
> OK, Trevor you can pull from
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> fixes/highmem-node-fixes branch. This contains the current mmotm tree
> + the latest highmem fixes. I also do not expect this would help much
> in your case but as Mel've said we should rule that out at least.

Hi!  The git tree above version oom'd after < 24 hours (3:02am) so
it doesn't solve the bug.  If you need a oom messages dump let me know.

Let me know what to try next, guys, and I'll test it out.

> > Before prototyping such a thing, I'd like to hear the outcome of
> > this heavy hack and then add your 32-bit patches onto the list. If
> > the problem is still there then I'd next look at taking slab pages
> > into account in pgdat_reclaimable() instead of an outright removal
> > that has a much wider impact. If that doesn't work then I'll
> > prototype a heavy-handed forced slab reclaim when lower zones are
> > almost all slab pages.

I don't think I've tried the "heavy hack" patch yet?  It's not in the
mhocko tree I just tried?  Should I try the heavy hack on top of mhocko
git or on vanilla or what?

I also want to mention that these PAE boxes suffer from another
problem/bug that I've worked around for almost a year now.  For some
reason it keeps gnawing at me that it might be related.  The disk I/O
goes to pot on this/these PAE boxes after a certain amount of disk
writes (like some unknown number of GB, around 10-ish maybe).  Like
writes go from 500MB/s to 10MB/s!! Reboot and it's magically 500MB/s
again.  I detail this here:
https://muug.ca/pipermail/roundtable/2016-June/004669.html
My fix was to mem=XG where X is <8 (like 4 or 6) to force the PAE
kernel to be more sane about highmem choices.  I never filed a bug
because I read a ton of stuff saying Linus hates PAE, don't use over
4G, blah blah.  But the other fix is to:
set /proc/sys/vm/highmem_is_dirtyable to 1

I'm not bringing this up to get attention to a new bug, I bring this up
because it smells like it might be related.  If something slowly eats
away at the box's vm to the point that I/O gets horribly slow, perhaps
it's related to the slab and high/lomem issue we have here?  And if
related, it may help to solve the oom bug.  If I'm way off base here,
just ignore my tangent!

The funny thing is I thought mem=XG where X<8 solved the problem, but
it doesn't!  It greatly mitigates it, but I still get subtle slowdown
that gets worse over time (like weeks instead of days).  I now use the
highmem_is_dirtyable on most boxes and that seems to solve it for good
in combo with mem=XG.  Let me note, however, that I have NOT set
highmem_is_dirtyable=1 on the test box I am using for all of this
building/testing, as I wanted the config to stay static while I work
through this oom bug.  (I'm real curious to see if
highmem_is_dirtyable=1 would have any impact on the oom though!)
Thanks!


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-18 Thread Mel Gorman
On Tue, Jan 17, 2017 at 03:54:51PM +0100, Michal Hocko wrote:
> On Tue 17-01-17 14:21:14, Mel Gorman wrote:
> > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko wrote:
> > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > [...]
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 532a2a750952..46aac487b89a 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist 
> > > > *zonelist, struct scan_control *sc)
> > > > continue;
> > > >  
> > > > if (sc->priority != DEF_PRIORITY &&
> > > > +   !buffer_heads_over_limit &&
> > > > !pgdat_reclaimable(zone->zone_pgdat))
> > > > continue;   /* Let kswapd poll it */
> > > 
> > > I think we should rather remove pgdat_reclaimable here. This sounds like
> > > a wrong layer to decide whether we want to reclaim and how much.
> > > 
> > 
> > I had considered that but it'd also be important to add the other 32-bit
> > patches you have posted to see the impact. Because of the ratio of LRU pages
> > to slab pages, it may not have an impact but it'd need to be eliminated.
> 
> OK, Trevor you can pull from
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> fixes/highmem-node-fixes branch. This contains the current mmotm tree +
> the latest highmem fixes. I also do not expect this would help much in
> your case but as Mel've said we should rule that out at least.
> 

After considering slab shrinking of lower nodes, it occured to me
that your fixes may have a bigger impact than I believed this morning.
For lowmem-constrained allocations, we account for scans on the lower
zones but shrink proportionally to the LRU size for the entire node. If
the lower zones had few LRU pages and were mostly slab pages then the
proportional calculation would be way off so direct reclaim would barely
touch slab caches. That is fixed up by "mm, vmscan: consider eligible zones
in get_scan_count" so that the slab shrinking will be proportional to the
LRU pages on the lower zones.

-- 
Mel Gorman
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-18 Thread Mel Gorman
On Tue, Jan 17, 2017 at 03:54:51PM +0100, Michal Hocko wrote:
> On Tue 17-01-17 14:21:14, Mel Gorman wrote:
> > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko wrote:
> > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > [...]
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 532a2a750952..46aac487b89a 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist 
> > > > *zonelist, struct scan_control *sc)
> > > > continue;
> > > >  
> > > > if (sc->priority != DEF_PRIORITY &&
> > > > +   !buffer_heads_over_limit &&
> > > > !pgdat_reclaimable(zone->zone_pgdat))
> > > > continue;   /* Let kswapd poll it */
> > > 
> > > I think we should rather remove pgdat_reclaimable here. This sounds like
> > > a wrong layer to decide whether we want to reclaim and how much.
> > > 
> > 
> > I had considered that but it'd also be important to add the other 32-bit
> > patches you have posted to see the impact. Because of the ratio of LRU pages
> > to slab pages, it may not have an impact but it'd need to be eliminated.
> 
> OK, Trevor you can pull from
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> fixes/highmem-node-fixes branch. This contains the current mmotm tree +
> the latest highmem fixes. I also do not expect this would help much in
> your case but as Mel've said we should rule that out at least.
> 

After considering slab shrinking of lower nodes, it occured to me
that your fixes may have a bigger impact than I believed this morning.
For lowmem-constrained allocations, we account for scans on the lower
zones but shrink proportionally to the LRU size for the entire node. If
the lower zones had few LRU pages and were mostly slab pages then the
proportional calculation would be way off so direct reclaim would barely
touch slab caches. That is fixed up by "mm, vmscan: consider eligible zones
in get_scan_count" so that the slab shrinking will be proportional to the
LRU pages on the lower zones.

-- 
Mel Gorman
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-18 Thread Mel Gorman
On Tue, Jan 17, 2017 at 03:54:51PM +0100, Michal Hocko wrote:
> On Tue 17-01-17 14:21:14, Mel Gorman wrote:
> > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko wrote:
> > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > [...]
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 532a2a750952..46aac487b89a 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist 
> > > > *zonelist, struct scan_control *sc)
> > > > continue;
> > > >  
> > > > if (sc->priority != DEF_PRIORITY &&
> > > > +   !buffer_heads_over_limit &&
> > > > !pgdat_reclaimable(zone->zone_pgdat))
> > > > continue;   /* Let kswapd poll it */
> > > 
> > > I think we should rather remove pgdat_reclaimable here. This sounds like
> > > a wrong layer to decide whether we want to reclaim and how much.
> > > 
> > 
> > I had considered that but it'd also be important to add the other 32-bit
> > patches you have posted to see the impact. Because of the ratio of LRU pages
> > to slab pages, it may not have an impact but it'd need to be eliminated.
> 
> OK, Trevor you can pull from
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> fixes/highmem-node-fixes branch. This contains the current mmotm tree +
> the latest highmem fixes. I also do not expect this would help much in
> your case but as Mel've said we should rule that out at least.
> 

After considering slab shrinking of lower nodes, it occurs to me that your
fixes also impacts slab shrinking. For lowmem-constrained allocations,
we accounted for scans on the lower zones but shrunk slabs proportional to
the total LRU size. If the lower zones had few LRU pages and were mostly
slab pages then the proportional calculation would be way off. This may
have a bigger impact on Trevor Cordes' situation that I had imagined at
the start of today.

-- 
Mel Gorman
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-18 Thread Mel Gorman
On Tue, Jan 17, 2017 at 03:54:51PM +0100, Michal Hocko wrote:
> On Tue 17-01-17 14:21:14, Mel Gorman wrote:
> > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko wrote:
> > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > [...]
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 532a2a750952..46aac487b89a 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist 
> > > > *zonelist, struct scan_control *sc)
> > > > continue;
> > > >  
> > > > if (sc->priority != DEF_PRIORITY &&
> > > > +   !buffer_heads_over_limit &&
> > > > !pgdat_reclaimable(zone->zone_pgdat))
> > > > continue;   /* Let kswapd poll it */
> > > 
> > > I think we should rather remove pgdat_reclaimable here. This sounds like
> > > a wrong layer to decide whether we want to reclaim and how much.
> > > 
> > 
> > I had considered that but it'd also be important to add the other 32-bit
> > patches you have posted to see the impact. Because of the ratio of LRU pages
> > to slab pages, it may not have an impact but it'd need to be eliminated.
> 
> OK, Trevor you can pull from
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> fixes/highmem-node-fixes branch. This contains the current mmotm tree +
> the latest highmem fixes. I also do not expect this would help much in
> your case but as Mel've said we should rule that out at least.
> 

After considering slab shrinking of lower nodes, it occurs to me that your
fixes also impacts slab shrinking. For lowmem-constrained allocations,
we accounted for scans on the lower zones but shrunk slabs proportional to
the total LRU size. If the lower zones had few LRU pages and were mostly
slab pages then the proportional calculation would be way off. This may
have a bigger impact on Trevor Cordes' situation that I had imagined at
the start of today.

-- 
Mel Gorman
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-17 Thread Trevor Cordes
On 2017-01-17 Michal Hocko wrote:
> On Tue 17-01-17 14:21:14, Mel Gorman wrote:
> > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko wrote:  
> > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > [...]  
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 532a2a750952..46aac487b89a 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist
> > > > *zonelist, struct scan_control *sc) continue;
> > > >  
> > > > if (sc->priority != DEF_PRIORITY &&
> > > > +   !buffer_heads_over_limit &&
> > > > !pgdat_reclaimable(zone->zone_pgdat))
> > > > continue;   /* Let kswapd
> > > > poll it */  
> > > 
> > > I think we should rather remove pgdat_reclaimable here. This
> > > sounds like a wrong layer to decide whether we want to reclaim
> > > and how much. 
> > 
> > I had considered that but it'd also be important to add the other
> > 32-bit patches you have posted to see the impact. Because of the
> > ratio of LRU pages to slab pages, it may not have an impact but
> > it'd need to be eliminated.  
> 
> OK, Trevor you can pull from
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> fixes/highmem-node-fixes branch. This contains the current mmotm tree
> + the latest highmem fixes. I also do not expect this would help much
> in your case but as Mel've said we should rule that out at least.

OK, ignore my last question re: what to do next.  I am building
this mhocko git tree now per your above instructions and will reboot
into it in a few hours with*out* the cgroup_disable=memory option.
Might take ~50 hours for a result.

I should note that the workload on the box with the bug is mostly as a
file server and iptables firewall/router.  It routes around 8GB(ytes) a
day, and periodic file server loads.  That's about it.  Everything else
that is running is not doing much, and not using much RAM; except
maybe clamav, by far the biggest RAM.

I don't see this bug on other nearly identical boxes, including:
F24 4.8.15 32-bit (no PAE) 1GB ram P4
F24 4.8.15 32-bit (no PAE) 2GB ram Core2 Quad

However, just noticed for the first time today that one other box is
also seeing this bug (gets an oom message), though with much less
frequency: twice in 2 months since upgrading to 4.8.  However, it
recovers from the oom without a reboot and hasn't hanged (yet).  That
could be because this box does not do as much file serving or I/O as
the one I've been building/testing on. Also, this box is a much older
Pentium-D with 4GB (PAE on).  If it would be helpful to see its oom
log, let me know.  (Scanning all my boxes now, I also found 1 single oom
on yet another 1 computer with the same story; but this is a Xeon
E3-1220 32-bit with PAE, 4GB.)

So far the commonality seems to be >2GB RAM and PAE on.  Might be
interesting to boot my build/test box with mem=2G and isolate it to
small RAM vs PAE.  "mem=2G" would make a great, easy, immediate
workaround for this problem for me (as cgroup_disable=memory also seems
to do, so far).  Thanks!


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-17 Thread Trevor Cordes
On 2017-01-17 Michal Hocko wrote:
> On Tue 17-01-17 14:21:14, Mel Gorman wrote:
> > On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko wrote:  
> > > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > > [...]  
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 532a2a750952..46aac487b89a 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist
> > > > *zonelist, struct scan_control *sc) continue;
> > > >  
> > > > if (sc->priority != DEF_PRIORITY &&
> > > > +   !buffer_heads_over_limit &&
> > > > !pgdat_reclaimable(zone->zone_pgdat))
> > > > continue;   /* Let kswapd
> > > > poll it */  
> > > 
> > > I think we should rather remove pgdat_reclaimable here. This
> > > sounds like a wrong layer to decide whether we want to reclaim
> > > and how much. 
> > 
> > I had considered that but it'd also be important to add the other
> > 32-bit patches you have posted to see the impact. Because of the
> > ratio of LRU pages to slab pages, it may not have an impact but
> > it'd need to be eliminated.  
> 
> OK, Trevor you can pull from
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
> fixes/highmem-node-fixes branch. This contains the current mmotm tree
> + the latest highmem fixes. I also do not expect this would help much
> in your case but as Mel've said we should rule that out at least.

OK, ignore my last question re: what to do next.  I am building
this mhocko git tree now per your above instructions and will reboot
into it in a few hours with*out* the cgroup_disable=memory option.
Might take ~50 hours for a result.

I should note that the workload on the box with the bug is mostly as a
file server and iptables firewall/router.  It routes around 8GB(ytes) a
day, and periodic file server loads.  That's about it.  Everything else
that is running is not doing much, and not using much RAM; except
maybe clamav, by far the biggest RAM.

I don't see this bug on other nearly identical boxes, including:
F24 4.8.15 32-bit (no PAE) 1GB ram P4
F24 4.8.15 32-bit (no PAE) 2GB ram Core2 Quad

However, just noticed for the first time today that one other box is
also seeing this bug (gets an oom message), though with much less
frequency: twice in 2 months since upgrading to 4.8.  However, it
recovers from the oom without a reboot and hasn't hanged (yet).  That
could be because this box does not do as much file serving or I/O as
the one I've been building/testing on. Also, this box is a much older
Pentium-D with 4GB (PAE on).  If it would be helpful to see its oom
log, let me know.  (Scanning all my boxes now, I also found 1 single oom
on yet another 1 computer with the same story; but this is a Xeon
E3-1220 32-bit with PAE, 4GB.)

So far the commonality seems to be >2GB RAM and PAE on.  Might be
interesting to boot my build/test box with mem=2G and isolate it to
small RAM vs PAE.  "mem=2G" would make a great, easy, immediate
workaround for this problem for me (as cgroup_disable=memory also seems
to do, so far).  Thanks!


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-17 Thread Trevor Cordes
On 2017-01-16 Mel Gorman wrote:
> > > You can easily check whether this is memcg related by trying to
> > > run the same workload with cgroup_disable=memory kernel command
> > > line parameter. This will put all the memcg specifics out of the
> > > way.  
> > 
> > I will try booting now into cgroup_disable=memory to see if that
> > helps at all.  I'll reply back in 48 hours, or when it oom's,
> > whichever comes first.
> >   
> 
> Thanks.

It has successfully survived 70 hours and 2 3am cycles (when it
normally oom's) with your first patch *and* cgroup_disable=memory
grafted on Fedora's 4.8.13.  Since it has never survived 2 3am cycles,
I strongly suspect the cgroup_disable=memory mitigates my bug.

> > Also, should I bother trying the latest git HEAD to see if that
> > solves anything?  Thanks!  
> 
> That's worth trying. If that also fails then could you try the
> following hack to encourage direct reclaim to reclaim slab when
> buffers are over the limit please?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 532a2a750952..46aac487b89a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist
> *zonelist, struct scan_control *sc) continue;
>  
>   if (sc->priority != DEF_PRIORITY &&
> + !buffer_heads_over_limit &&
>   !pgdat_reclaimable(zone->zone_pgdat))
>   continue;   /* Let kswapd poll
> it */ 

What's the next best step?  HEAD?  HEAD + the above patch?  A new
patch?  I'll start a HEAD compile until I hear more.  I assume I should
test without cgroup_disable=memory as that's just a kludge/workaround,
right?

Also, is there a way to spot the slab pressure you are talking about
before oom's occur?  slabinfo?  I suppose I'd be able to see some
counter slowly getting too high or low?  Thanks!


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-17 Thread Trevor Cordes
On 2017-01-16 Mel Gorman wrote:
> > > You can easily check whether this is memcg related by trying to
> > > run the same workload with cgroup_disable=memory kernel command
> > > line parameter. This will put all the memcg specifics out of the
> > > way.  
> > 
> > I will try booting now into cgroup_disable=memory to see if that
> > helps at all.  I'll reply back in 48 hours, or when it oom's,
> > whichever comes first.
> >   
> 
> Thanks.

It has successfully survived 70 hours and 2 3am cycles (when it
normally oom's) with your first patch *and* cgroup_disable=memory
grafted on Fedora's 4.8.13.  Since it has never survived 2 3am cycles,
I strongly suspect the cgroup_disable=memory mitigates my bug.

> > Also, should I bother trying the latest git HEAD to see if that
> > solves anything?  Thanks!  
> 
> That's worth trying. If that also fails then could you try the
> following hack to encourage direct reclaim to reclaim slab when
> buffers are over the limit please?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 532a2a750952..46aac487b89a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist
> *zonelist, struct scan_control *sc) continue;
>  
>   if (sc->priority != DEF_PRIORITY &&
> + !buffer_heads_over_limit &&
>   !pgdat_reclaimable(zone->zone_pgdat))
>   continue;   /* Let kswapd poll
> it */ 

What's the next best step?  HEAD?  HEAD + the above patch?  A new
patch?  I'll start a HEAD compile until I hear more.  I assume I should
test without cgroup_disable=memory as that's just a kludge/workaround,
right?

Also, is there a way to spot the slab pressure you are talking about
before oom's occur?  slabinfo?  I suppose I'd be able to see some
counter slowly getting too high or low?  Thanks!


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-17 Thread Michal Hocko
On Tue 17-01-17 14:21:14, Mel Gorman wrote:
> On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko wrote:
> > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > [...]
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 532a2a750952..46aac487b89a 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist *zonelist, 
> > > struct scan_control *sc)
> > >   continue;
> > >  
> > >   if (sc->priority != DEF_PRIORITY &&
> > > + !buffer_heads_over_limit &&
> > >   !pgdat_reclaimable(zone->zone_pgdat))
> > >   continue;   /* Let kswapd poll it */
> > 
> > I think we should rather remove pgdat_reclaimable here. This sounds like
> > a wrong layer to decide whether we want to reclaim and how much.
> > 
> 
> I had considered that but it'd also be important to add the other 32-bit
> patches you have posted to see the impact. Because of the ratio of LRU pages
> to slab pages, it may not have an impact but it'd need to be eliminated.

OK, Trevor you can pull from
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
fixes/highmem-node-fixes branch. This contains the current mmotm tree +
the latest highmem fixes. I also do not expect this would help much in
your case but as Mel've said we should rule that out at least.

> Right now, I don't either other than a heavy-handed approach of checking if
> a) it's a pgdat with a highmem node

I do not think this is a right approach because we have a similar
problem even without the highmem. I have already seen cases where the
slab memory has eaten the whole DMA32 zone.

> b) if the ratio of LRU pages to slab
> pages on the lower zones is out of whack and if so, ignore nr_scanned for
> the slab shrinker.

this sounds much more promissing.

> Before prototyping such a thing, I'd like to hear the outcome of this
> heavy hack and then add your 32-bit patches onto the list. If the problem
> is still there then I'd next look at taking slab pages into account in
> pgdat_reclaimable() instead of an outright removal that has a much wider
> impact. If that doesn't work then I'll prototype a heavy-handed forced
> slab reclaim when lower zones are almost all slab pages.

I would be really curious to hear whether pgdat_reclaimable removal
makes any bad side effects. It just smells wrong from a highlevel point
of view. Besides that I really _hate_ pgdat_reclaimable for any decision
making. It just behaves very randomly... I do not expect it help much in
this case, though, as the highmem can easily bias the decision.
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-17 Thread Michal Hocko
On Tue 17-01-17 14:21:14, Mel Gorman wrote:
> On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko wrote:
> > On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> > [...]
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 532a2a750952..46aac487b89a 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist *zonelist, 
> > > struct scan_control *sc)
> > >   continue;
> > >  
> > >   if (sc->priority != DEF_PRIORITY &&
> > > + !buffer_heads_over_limit &&
> > >   !pgdat_reclaimable(zone->zone_pgdat))
> > >   continue;   /* Let kswapd poll it */
> > 
> > I think we should rather remove pgdat_reclaimable here. This sounds like
> > a wrong layer to decide whether we want to reclaim and how much.
> > 
> 
> I had considered that but it'd also be important to add the other 32-bit
> patches you have posted to see the impact. Because of the ratio of LRU pages
> to slab pages, it may not have an impact but it'd need to be eliminated.

OK, Trevor you can pull from
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git tree
fixes/highmem-node-fixes branch. This contains the current mmotm tree +
the latest highmem fixes. I also do not expect this would help much in
your case but as Mel've said we should rule that out at least.

> Right now, I don't either other than a heavy-handed approach of checking if
> a) it's a pgdat with a highmem node

I do not think this is a right approach because we have a similar
problem even without the highmem. I have already seen cases where the
slab memory has eaten the whole DMA32 zone.

> b) if the ratio of LRU pages to slab
> pages on the lower zones is out of whack and if so, ignore nr_scanned for
> the slab shrinker.

this sounds much more promissing.

> Before prototyping such a thing, I'd like to hear the outcome of this
> heavy hack and then add your 32-bit patches onto the list. If the problem
> is still there then I'd next look at taking slab pages into account in
> pgdat_reclaimable() instead of an outright removal that has a much wider
> impact. If that doesn't work then I'll prototype a heavy-handed forced
> slab reclaim when lower zones are almost all slab pages.

I would be really curious to hear whether pgdat_reclaimable removal
makes any bad side effects. It just smells wrong from a highlevel point
of view. Besides that I really _hate_ pgdat_reclaimable for any decision
making. It just behaves very randomly... I do not expect it help much in
this case, though, as the highmem can easily bias the decision.
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-17 Thread Mel Gorman
On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko wrote:
> On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> [...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 532a2a750952..46aac487b89a 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist *zonelist, 
> > struct scan_control *sc)
> > continue;
> >  
> > if (sc->priority != DEF_PRIORITY &&
> > +   !buffer_heads_over_limit &&
> > !pgdat_reclaimable(zone->zone_pgdat))
> > continue;   /* Let kswapd poll it */
> 
> I think we should rather remove pgdat_reclaimable here. This sounds like
> a wrong layer to decide whether we want to reclaim and how much.
> 

I had considered that but it'd also be important to add the other 32-bit
patches you have posted to see the impact. Because of the ratio of LRU pages
to slab pages, it may not have an impact but it'd need to be eliminated.

> But even that won't help very much I am afraid. As I've noted in the
> other response as long as we will scale the slab shrinking based on
> nr_scanned we will have a problem with situations where slab outnumbers
> lru lists too much. I do not have a good idea how to fix that though...
> 

Right now, I don't either other than a heavy-handed approach of checking if
a) it's a pgdat with a highmem node b) if the ratio of LRU pages to slab
pages on the lower zones is out of whack and if so, ignore nr_scanned for
the slab shrinker.

Before prototyping such a thing, I'd like to hear the outcome of this
heavy hack and then add your 32-bit patches onto the list. If the problem
is still there then I'd next look at taking slab pages into account in
pgdat_reclaimable() instead of an outright removal that has a much wider
impact. If that doesn't work then I'll prototype a heavy-handed forced
slab reclaim when lower zones are almost all slab pages.

-- 
Mel Gorman
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-17 Thread Mel Gorman
On Tue, Jan 17, 2017 at 02:52:28PM +0100, Michal Hocko wrote:
> On Mon 16-01-17 11:09:34, Mel Gorman wrote:
> [...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 532a2a750952..46aac487b89a 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist *zonelist, 
> > struct scan_control *sc)
> > continue;
> >  
> > if (sc->priority != DEF_PRIORITY &&
> > +   !buffer_heads_over_limit &&
> > !pgdat_reclaimable(zone->zone_pgdat))
> > continue;   /* Let kswapd poll it */
> 
> I think we should rather remove pgdat_reclaimable here. This sounds like
> a wrong layer to decide whether we want to reclaim and how much.
> 

I had considered that but it'd also be important to add the other 32-bit
patches you have posted to see the impact. Because of the ratio of LRU pages
to slab pages, it may not have an impact but it'd need to be eliminated.

> But even that won't help very much I am afraid. As I've noted in the
> other response as long as we will scale the slab shrinking based on
> nr_scanned we will have a problem with situations where slab outnumbers
> lru lists too much. I do not have a good idea how to fix that though...
> 

Right now, I don't either other than a heavy-handed approach of checking if
a) it's a pgdat with a highmem node b) if the ratio of LRU pages to slab
pages on the lower zones is out of whack and if so, ignore nr_scanned for
the slab shrinker.

Before prototyping such a thing, I'd like to hear the outcome of this
heavy hack and then add your 32-bit patches onto the list. If the problem
is still there then I'd next look at taking slab pages into account in
pgdat_reclaimable() instead of an outright removal that has a much wider
impact. If that doesn't work then I'll prototype a heavy-handed forced
slab reclaim when lower zones are almost all slab pages.

-- 
Mel Gorman
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-17 Thread Michal Hocko
On Mon 16-01-17 11:09:34, Mel Gorman wrote:
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 532a2a750952..46aac487b89a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist *zonelist, 
> struct scan_control *sc)
>   continue;
>  
>   if (sc->priority != DEF_PRIORITY &&
> + !buffer_heads_over_limit &&
>   !pgdat_reclaimable(zone->zone_pgdat))
>   continue;   /* Let kswapd poll it */

I think we should rather remove pgdat_reclaimable here. This sounds like
a wrong layer to decide whether we want to reclaim and how much.

But even that won't help very much I am afraid. As I've noted in the
other response as long as we will scale the slab shrinking based on
nr_scanned we will have a problem with situations where slab outnumbers
lru lists too much. I do not have a good idea how to fix that though...

-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-17 Thread Michal Hocko
On Mon 16-01-17 11:09:34, Mel Gorman wrote:
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 532a2a750952..46aac487b89a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist *zonelist, 
> struct scan_control *sc)
>   continue;
>  
>   if (sc->priority != DEF_PRIORITY &&
> + !buffer_heads_over_limit &&
>   !pgdat_reclaimable(zone->zone_pgdat))
>   continue;   /* Let kswapd poll it */

I think we should rather remove pgdat_reclaimable here. This sounds like
a wrong layer to decide whether we want to reclaim and how much.

But even that won't help very much I am afraid. As I've noted in the
other response as long as we will scale the slab shrinking based on
nr_scanned we will have a problem with situations where slab outnumbers
lru lists too much. I do not have a good idea how to fix that though...

-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-17 Thread Michal Hocko
On Sun 15-01-17 00:27:52, Trevor Cordes wrote:
> On 2017-01-12 Michal Hocko wrote:
> > On Wed 11-01-17 16:52:32, Trevor Cordes wrote:
> > [...]
> > > I'm not sure how I can tell if my bug is because of memcgs so here
> > > is a full first oom example (attached).  
> > 
> > 4.7 kernel doesn't contain 71c799f4982d ("mm: add per-zone lru list
> > stat") so the OOM report will not tell us whether the Normal zone
> > doesn't age active lists, unfortunatelly.
> 
> I compiled the patch Mel provided into the stock F23 kernel
> 4.8.13-100.fc23.i686+PAE and it ran for 2 nights.  It didn't oom the
> first night, but did the second night.  So the bug persists even with
> that patch.  However, it does *seem* a bit "better" since it took 2
> nights (usually takes only one, but maybe 10% of the time it does take
> two) before oom'ing, *and* it allowed my reboot script to reboot it
> cleanly when it saw the oom (which happens only 25% of the time).
> 
> I'm attaching the 4.8.13 oom message which should have the memcg info
> (71c799f4982d) you are asking for above?

It doesn't have the memcg info which is neither a part of the current
vanilla kernel output. But we have per zone LRU counters which is what I
was after. So you have a correct patch. Sorry if I confused you.

>  Hopefully?

[167409.074463] nmbd invoked oom-killer: 
gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0

again lowmem request

[...]
[167409.074576] Normal free:3484kB min:3544kB low:4428kB high:5312kB 
active_anon:0kB inactive_anon:0kB active_file:3412kB inactive_file:1560kB 
unevictabl:0kB writepending:0kB present:892920kB managed:815216kB mlocked:0kB 
slab_reclaimable:711068kB slab_unreclaimable:49496kB kernel_stack:2904kB 
pagetables:0kB bounce:0kB free_pcp:240kB local_pcp:120kB free_cma:0kB

but have a look here. There are basically no pages on the Normal zone
LRU list. There is a huge amount of slab allocated here but we are not
able to reclaim it because we scale slab reclaimers based on the LRU
reclaim. This is an inherent problem of the current design and we should
address it. It is nothing really new. We just didn't have many users
affected because having a majority of memory consumed by SLAB is not a
usual situation. It seems you just hit a more aggressive slab user with
newer kernels.

Using the 32b kernel really makes all this worse because all those
allocations go to the Normal and DMA zones which will push LRU pages out
of that zone.

> > You can easily check whether this is memcg related by trying to run
> > the same workload with cgroup_disable=memory kernel command line
> > parameter. This will put all the memcg specifics out of the way.
> 
> I will try booting now into cgroup_disable=memory to see if that helps
> at all.  I'll reply back in 48 hours, or when it oom's, whichever comes
> first.

This will not help most probably.
 
> Also, should I bother trying the latest git HEAD to see if that solves
> anything?  Thanks!

It might help wrt. the slab consumers but there is nothing that I would
consider a fix for the general problem of the slab shrinking I am
afraid.

-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-17 Thread Michal Hocko
On Sun 15-01-17 00:27:52, Trevor Cordes wrote:
> On 2017-01-12 Michal Hocko wrote:
> > On Wed 11-01-17 16:52:32, Trevor Cordes wrote:
> > [...]
> > > I'm not sure how I can tell if my bug is because of memcgs so here
> > > is a full first oom example (attached).  
> > 
> > 4.7 kernel doesn't contain 71c799f4982d ("mm: add per-zone lru list
> > stat") so the OOM report will not tell us whether the Normal zone
> > doesn't age active lists, unfortunatelly.
> 
> I compiled the patch Mel provided into the stock F23 kernel
> 4.8.13-100.fc23.i686+PAE and it ran for 2 nights.  It didn't oom the
> first night, but did the second night.  So the bug persists even with
> that patch.  However, it does *seem* a bit "better" since it took 2
> nights (usually takes only one, but maybe 10% of the time it does take
> two) before oom'ing, *and* it allowed my reboot script to reboot it
> cleanly when it saw the oom (which happens only 25% of the time).
> 
> I'm attaching the 4.8.13 oom message which should have the memcg info
> (71c799f4982d) you are asking for above?

It doesn't have the memcg info which is neither a part of the current
vanilla kernel output. But we have per zone LRU counters which is what I
was after. So you have a correct patch. Sorry if I confused you.

>  Hopefully?

[167409.074463] nmbd invoked oom-killer: 
gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=1, oom_score_adj=0

again lowmem request

[...]
[167409.074576] Normal free:3484kB min:3544kB low:4428kB high:5312kB 
active_anon:0kB inactive_anon:0kB active_file:3412kB inactive_file:1560kB 
unevictabl:0kB writepending:0kB present:892920kB managed:815216kB mlocked:0kB 
slab_reclaimable:711068kB slab_unreclaimable:49496kB kernel_stack:2904kB 
pagetables:0kB bounce:0kB free_pcp:240kB local_pcp:120kB free_cma:0kB

but have a look here. There are basically no pages on the Normal zone
LRU list. There is a huge amount of slab allocated here but we are not
able to reclaim it because we scale slab reclaimers based on the LRU
reclaim. This is an inherent problem of the current design and we should
address it. It is nothing really new. We just didn't have many users
affected because having a majority of memory consumed by SLAB is not a
usual situation. It seems you just hit a more aggressive slab user with
newer kernels.

Using the 32b kernel really makes all this worse because all those
allocations go to the Normal and DMA zones which will push LRU pages out
of that zone.

> > You can easily check whether this is memcg related by trying to run
> > the same workload with cgroup_disable=memory kernel command line
> > parameter. This will put all the memcg specifics out of the way.
> 
> I will try booting now into cgroup_disable=memory to see if that helps
> at all.  I'll reply back in 48 hours, or when it oom's, whichever comes
> first.

This will not help most probably.
 
> Also, should I bother trying the latest git HEAD to see if that solves
> anything?  Thanks!

It might help wrt. the slab consumers but there is nothing that I would
consider a fix for the general problem of the slab shrinking I am
afraid.

-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-16 Thread Mel Gorman
On Sun, Jan 15, 2017 at 12:27:52AM -0600, Trevor Cordes wrote:
> On 2017-01-12 Michal Hocko wrote:
> > On Wed 11-01-17 16:52:32, Trevor Cordes wrote:
> > [...]
> > > I'm not sure how I can tell if my bug is because of memcgs so here
> > > is a full first oom example (attached).  
> > 
> > 4.7 kernel doesn't contain 71c799f4982d ("mm: add per-zone lru list
> > stat") so the OOM report will not tell us whether the Normal zone
> > doesn't age active lists, unfortunatelly.
> 
> I compiled the patch Mel provided into the stock F23 kernel
> 4.8.13-100.fc23.i686+PAE and it ran for 2 nights.  It didn't oom the
> first night, but did the second night.  So the bug persists even with
> that patch.  However, it does *seem* a bit "better" since it took 2
> nights (usually takes only one, but maybe 10% of the time it does take
> two) before oom'ing, *and* it allowed my reboot script to reboot it
> cleanly when it saw the oom (which happens only 25% of the time).
> 
> I'm attaching the 4.8.13 oom message which should have the memcg info
> (71c799f4982d) you are asking for above?  Hopefully?
> 

It shows that there are an extremely large number of reclaimable slab
pages in the lower zones. Other pages have been reclaimed as normal but
the failure to reclaim slab pages causes a high-order allocation to
fail.

> > You can easily check whether this is memcg related by trying to run
> > the same workload with cgroup_disable=memory kernel command line
> > parameter. This will put all the memcg specifics out of the way.
> 
> I will try booting now into cgroup_disable=memory to see if that helps
> at all.  I'll reply back in 48 hours, or when it oom's, whichever comes
> first.
> 

Thanks.

> Also, should I bother trying the latest git HEAD to see if that solves
> anything?  Thanks!

That's worth trying. If that also fails then could you try the following
hack to encourage direct reclaim to reclaim slab when buffers are over
the limit please?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 532a2a750952..46aac487b89a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist *zonelist, 
struct scan_control *sc)
continue;
 
if (sc->priority != DEF_PRIORITY &&
+   !buffer_heads_over_limit &&
!pgdat_reclaimable(zone->zone_pgdat))
continue;   /* Let kswapd poll it */
 

-- 
Mel Gorman
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-16 Thread Mel Gorman
On Sun, Jan 15, 2017 at 12:27:52AM -0600, Trevor Cordes wrote:
> On 2017-01-12 Michal Hocko wrote:
> > On Wed 11-01-17 16:52:32, Trevor Cordes wrote:
> > [...]
> > > I'm not sure how I can tell if my bug is because of memcgs so here
> > > is a full first oom example (attached).  
> > 
> > 4.7 kernel doesn't contain 71c799f4982d ("mm: add per-zone lru list
> > stat") so the OOM report will not tell us whether the Normal zone
> > doesn't age active lists, unfortunatelly.
> 
> I compiled the patch Mel provided into the stock F23 kernel
> 4.8.13-100.fc23.i686+PAE and it ran for 2 nights.  It didn't oom the
> first night, but did the second night.  So the bug persists even with
> that patch.  However, it does *seem* a bit "better" since it took 2
> nights (usually takes only one, but maybe 10% of the time it does take
> two) before oom'ing, *and* it allowed my reboot script to reboot it
> cleanly when it saw the oom (which happens only 25% of the time).
> 
> I'm attaching the 4.8.13 oom message which should have the memcg info
> (71c799f4982d) you are asking for above?  Hopefully?
> 

It shows that there are an extremely large number of reclaimable slab
pages in the lower zones. Other pages have been reclaimed as normal but
the failure to reclaim slab pages causes a high-order allocation to
fail.

> > You can easily check whether this is memcg related by trying to run
> > the same workload with cgroup_disable=memory kernel command line
> > parameter. This will put all the memcg specifics out of the way.
> 
> I will try booting now into cgroup_disable=memory to see if that helps
> at all.  I'll reply back in 48 hours, or when it oom's, whichever comes
> first.
> 

Thanks.

> Also, should I bother trying the latest git HEAD to see if that solves
> anything?  Thanks!

That's worth trying. If that also fails then could you try the following
hack to encourage direct reclaim to reclaim slab when buffers are over
the limit please?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 532a2a750952..46aac487b89a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2684,6 +2684,7 @@ static void shrink_zones(struct zonelist *zonelist, 
struct scan_control *sc)
continue;
 
if (sc->priority != DEF_PRIORITY &&
+   !buffer_heads_over_limit &&
!pgdat_reclaimable(zone->zone_pgdat))
continue;   /* Let kswapd poll it */
 

-- 
Mel Gorman
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-14 Thread Trevor Cordes
On 2017-01-12 Michal Hocko wrote:
> On Wed 11-01-17 16:52:32, Trevor Cordes wrote:
> [...]
> > I'm not sure how I can tell if my bug is because of memcgs so here
> > is a full first oom example (attached).  
> 
> 4.7 kernel doesn't contain 71c799f4982d ("mm: add per-zone lru list
> stat") so the OOM report will not tell us whether the Normal zone
> doesn't age active lists, unfortunatelly.

I compiled the patch Mel provided into the stock F23 kernel
4.8.13-100.fc23.i686+PAE and it ran for 2 nights.  It didn't oom the
first night, but did the second night.  So the bug persists even with
that patch.  However, it does *seem* a bit "better" since it took 2
nights (usually takes only one, but maybe 10% of the time it does take
two) before oom'ing, *and* it allowed my reboot script to reboot it
cleanly when it saw the oom (which happens only 25% of the time).

I'm attaching the 4.8.13 oom message which should have the memcg info
(71c799f4982d) you are asking for above?  Hopefully?

> You can easily check whether this is memcg related by trying to run
> the same workload with cgroup_disable=memory kernel command line
> parameter. This will put all the memcg specifics out of the way.

I will try booting now into cgroup_disable=memory to see if that helps
at all.  I'll reply back in 48 hours, or when it oom's, whichever comes
first.

Also, should I bother trying the latest git HEAD to see if that solves
anything?  Thanks!


oom2
Description: Binary data


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-14 Thread Trevor Cordes
On 2017-01-12 Michal Hocko wrote:
> On Wed 11-01-17 16:52:32, Trevor Cordes wrote:
> [...]
> > I'm not sure how I can tell if my bug is because of memcgs so here
> > is a full first oom example (attached).  
> 
> 4.7 kernel doesn't contain 71c799f4982d ("mm: add per-zone lru list
> stat") so the OOM report will not tell us whether the Normal zone
> doesn't age active lists, unfortunatelly.

I compiled the patch Mel provided into the stock F23 kernel
4.8.13-100.fc23.i686+PAE and it ran for 2 nights.  It didn't oom the
first night, but did the second night.  So the bug persists even with
that patch.  However, it does *seem* a bit "better" since it took 2
nights (usually takes only one, but maybe 10% of the time it does take
two) before oom'ing, *and* it allowed my reboot script to reboot it
cleanly when it saw the oom (which happens only 25% of the time).

I'm attaching the 4.8.13 oom message which should have the memcg info
(71c799f4982d) you are asking for above?  Hopefully?

> You can easily check whether this is memcg related by trying to run
> the same workload with cgroup_disable=memory kernel command line
> parameter. This will put all the memcg specifics out of the way.

I will try booting now into cgroup_disable=memory to see if that helps
at all.  I'll reply back in 48 hours, or when it oom's, whichever comes
first.

Also, should I bother trying the latest git HEAD to see if that solves
anything?  Thanks!


oom2
Description: Binary data


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-12 Thread Michal Hocko
On Wed 11-01-17 16:52:32, Trevor Cordes wrote:
[...]
> I'm not sure how I can tell if my bug is because of memcgs so here is
> a full first oom example (attached).

4.7 kernel doesn't contain 71c799f4982d ("mm: add per-zone lru list
stat") so the OOM report will not tell us whether the Normal zone
doesn't age active lists, unfortunatelly.

You can easily check whether this is memcg related by trying to run the
same workload with cgroup_disable=memory kernel command line parameter.
This will put all the memcg specifics out of the way.
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-12 Thread Michal Hocko
On Wed 11-01-17 16:52:32, Trevor Cordes wrote:
[...]
> I'm not sure how I can tell if my bug is because of memcgs so here is
> a full first oom example (attached).

4.7 kernel doesn't contain 71c799f4982d ("mm: add per-zone lru list
stat") so the OOM report will not tell us whether the Normal zone
doesn't age active lists, unfortunatelly.

You can easily check whether this is memcg related by trying to run the
same workload with cgroup_disable=memory kernel command line parameter.
This will put all the memcg specifics out of the way.
-- 
Michal Hocko
SUSE Labs


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-11 Thread Trevor Cordes
On 2017-01-11 Mel Gorman wrote:
> On Wed, Jan 11, 2017 at 12:11:46PM +, Mel Gorman wrote:
> > On Wed, Jan 11, 2017 at 04:32:43AM -0600, Trevor Cordes wrote:  
> > > Hi!  I have biected a nightly oom-killer flood and crash/hang on
> > > one of the boxes I admin.  It doesn't crash on Fedora 23/24
> > > 4.7.10 kernel but does on any 4.8 Fedora kernel.  I did a vanilla
> > > bisect and the bug is here:
> > > 
> > > commit b2e18757f2c9d1cdd746a882e9878852fdec9501
> > > Author: Mel Gorman 
> > > Date:   Thu Jul 28 15:45:37 2016 -0700
> > > 
> > > mm, vmscan: begin reclaiming pages on a per-node basis
> > >   
> > 
> > Michal Hocko recently worked on a bug similar to this. Can you test
> > the following patch that is currently queued in Andrew Morton's
> > tree? It applies cleanly to 4.9
> >   
> 
> I should have pointed out that this patch primarily affects memcg but
> the bug report did not include an OOM report and did not describe
> whether memcgs could be involved or not. If memcgs are not involved
> then please post the first full OOM kill.

I will apply your patch tonight and it will take 48 hours to confirm
that it is "good" (<24 hours if it's bad), and I will reply back.

I'm not sure how I can tell if my bug is because of memcgs so here is
a full first oom example (attached).

Thanks for the help!


oom-example
Description: Binary data


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-11 Thread Trevor Cordes
On 2017-01-11 Mel Gorman wrote:
> On Wed, Jan 11, 2017 at 12:11:46PM +, Mel Gorman wrote:
> > On Wed, Jan 11, 2017 at 04:32:43AM -0600, Trevor Cordes wrote:  
> > > Hi!  I have biected a nightly oom-killer flood and crash/hang on
> > > one of the boxes I admin.  It doesn't crash on Fedora 23/24
> > > 4.7.10 kernel but does on any 4.8 Fedora kernel.  I did a vanilla
> > > bisect and the bug is here:
> > > 
> > > commit b2e18757f2c9d1cdd746a882e9878852fdec9501
> > > Author: Mel Gorman 
> > > Date:   Thu Jul 28 15:45:37 2016 -0700
> > > 
> > > mm, vmscan: begin reclaiming pages on a per-node basis
> > >   
> > 
> > Michal Hocko recently worked on a bug similar to this. Can you test
> > the following patch that is currently queued in Andrew Morton's
> > tree? It applies cleanly to 4.9
> >   
> 
> I should have pointed out that this patch primarily affects memcg but
> the bug report did not include an OOM report and did not describe
> whether memcgs could be involved or not. If memcgs are not involved
> then please post the first full OOM kill.

I will apply your patch tonight and it will take 48 hours to confirm
that it is "good" (<24 hours if it's bad), and I will reply back.

I'm not sure how I can tell if my bug is because of memcgs so here is
a full first oom example (attached).

Thanks for the help!


oom-example
Description: Binary data


Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-11 Thread Mel Gorman
On Wed, Jan 11, 2017 at 12:11:46PM +, Mel Gorman wrote:
> On Wed, Jan 11, 2017 at 04:32:43AM -0600, Trevor Cordes wrote:
> > Hi!  I have biected a nightly oom-killer flood and crash/hang on one of 
> > the boxes I admin.  It doesn't crash on Fedora 23/24 4.7.10 kernel but 
> > does on any 4.8 Fedora kernel.  I did a vanilla bisect and the bug is 
> > here:
> > 
> > commit b2e18757f2c9d1cdd746a882e9878852fdec9501
> > Author: Mel Gorman 
> > Date:   Thu Jul 28 15:45:37 2016 -0700
> > 
> > mm, vmscan: begin reclaiming pages on a per-node basis
> > 
> 
> Michal Hocko recently worked on a bug similar to this. Can you test the
> following patch that is currently queued in Andrew Morton's tree? It
> applies cleanly to 4.9
> 

I should have pointed out that this patch primarily affects memcg but
the bug report did not include an OOM report and did not describe
whether memcgs could be involved or not. If memcgs are not involved then
please post the first full OOM kill.

> Thanks.
> 
> From: Michal Hocko 
> Subject: mm, memcg: fix the active list aging for lowmem requests when memcg 
> is enabled
> 
> Nils Holland and Klaus Ethgen have reported unexpected OOM killer
> invocations with 32b kernel starting with 4.8 kernels
> 
>   kworker/u4:5 invoked oom-killer: 
> gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, 
> oom_score_adj=0
>   kworker/u4:5 cpuset=/ mems_allowed=0
>   CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
>   [...]
>   Mem-Info:
>   active_anon:58685 inactive_anon:90 isolated_anon:0
>active_file:274324 inactive_file:281962 isolated_file:0
>unevictable:0 dirty:649 writeback:0 unstable:0
>slab_reclaimable:40662 slab_unreclaimable:17754
>mapped:7382 shmem:202 pagetables:351 bounce:0
>free:206736 free_pcp:332 free_cma:0
>   Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
> inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
> mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB 
> shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB 
> pages_scanned:0 all_unreclaimable? no
>   DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
> inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
> writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
> slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
> pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
>   lowmem_reserve[]: 0 813 3474 3474
>   Normal free:41332kB min:41368kB low:51708kB high:62048kB 
> active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
> unevictable:0kB writepending:24kB present:897016kB managed:836248kB 
> mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB 
> kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB 
> local_pcp:340kB free_cma:0kB
>   lowmem_reserve[]: 0 0 21292 21292
>   HighMem free:781660kB min:512kB low:34356kB high:68200kB 
> active_anon:234740kB inactive_anon:360kB active_file:557232kB 
> inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB 
> managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
> kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB 
> free_cma:0kB
> 
> the oom killer is clearly pre-mature because there there is still a lot of
> page cache in the zone Normal which should satisfy this lowmem request. 
> Further debugging has shown that the reclaim cannot make any forward
> progress because the page cache is hidden in the active list which doesn't
> get rotated because inactive_list_is_low is not memcg aware.
> 
> It simply subtracts per-zone highmem counters from the respective memcg's
> lru sizes which doesn't make any sense.  We can simply end up always
> seeing the resulting active and inactive counts 0 and return false.  This
> issue is not limited to 32b kernels but in practice the effect on systems
> without CONFIG_HIGHMEM would be much harder to notice because we do not
> invoke the OOM killer for allocations requests targeting < ZONE_NORMAL.
> 
> Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
> and subtract per-memcg highmem counts when memcg is enabled.  Introduce
> helper lruvec_zone_lru_size which redirects to either zone counters or
> mem_cgroup_get_zone_lru_size when appropriate.
> 
> We are losing empty LRU but non-zero lru size detection introduced by
> ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
> of the inherent zone vs.  node discrepancy.
> 
> Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible 
> zones inactive ratio")
> Link: http://lkml.kernel.org/r/20170104100825.3729-1-mho...@kernel.org
> Signed-off-by: Michal Hocko 
> Reported-by: Nils Holland 

Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-11 Thread Mel Gorman
On Wed, Jan 11, 2017 at 12:11:46PM +, Mel Gorman wrote:
> On Wed, Jan 11, 2017 at 04:32:43AM -0600, Trevor Cordes wrote:
> > Hi!  I have biected a nightly oom-killer flood and crash/hang on one of 
> > the boxes I admin.  It doesn't crash on Fedora 23/24 4.7.10 kernel but 
> > does on any 4.8 Fedora kernel.  I did a vanilla bisect and the bug is 
> > here:
> > 
> > commit b2e18757f2c9d1cdd746a882e9878852fdec9501
> > Author: Mel Gorman 
> > Date:   Thu Jul 28 15:45:37 2016 -0700
> > 
> > mm, vmscan: begin reclaiming pages on a per-node basis
> > 
> 
> Michal Hocko recently worked on a bug similar to this. Can you test the
> following patch that is currently queued in Andrew Morton's tree? It
> applies cleanly to 4.9
> 

I should have pointed out that this patch primarily affects memcg but
the bug report did not include an OOM report and did not describe
whether memcgs could be involved or not. If memcgs are not involved then
please post the first full OOM kill.

> Thanks.
> 
> From: Michal Hocko 
> Subject: mm, memcg: fix the active list aging for lowmem requests when memcg 
> is enabled
> 
> Nils Holland and Klaus Ethgen have reported unexpected OOM killer
> invocations with 32b kernel starting with 4.8 kernels
> 
>   kworker/u4:5 invoked oom-killer: 
> gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, 
> oom_score_adj=0
>   kworker/u4:5 cpuset=/ mems_allowed=0
>   CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
>   [...]
>   Mem-Info:
>   active_anon:58685 inactive_anon:90 isolated_anon:0
>active_file:274324 inactive_file:281962 isolated_file:0
>unevictable:0 dirty:649 writeback:0 unstable:0
>slab_reclaimable:40662 slab_unreclaimable:17754
>mapped:7382 shmem:202 pagetables:351 bounce:0
>free:206736 free_pcp:332 free_cma:0
>   Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
> inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
> mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB 
> shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB 
> pages_scanned:0 all_unreclaimable? no
>   DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
> inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
> writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
> slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
> pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
>   lowmem_reserve[]: 0 813 3474 3474
>   Normal free:41332kB min:41368kB low:51708kB high:62048kB 
> active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
> unevictable:0kB writepending:24kB present:897016kB managed:836248kB 
> mlocked:0kB slab_reclaimable:159448kB slab_unreclaimable:69608kB 
> kernel_stack:1112kB pagetables:1404kB bounce:0kB free_pcp:528kB 
> local_pcp:340kB free_cma:0kB
>   lowmem_reserve[]: 0 0 21292 21292
>   HighMem free:781660kB min:512kB low:34356kB high:68200kB 
> active_anon:234740kB inactive_anon:360kB active_file:557232kB 
> inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB 
> managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
> kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB 
> free_cma:0kB
> 
> the oom killer is clearly pre-mature because there there is still a lot of
> page cache in the zone Normal which should satisfy this lowmem request. 
> Further debugging has shown that the reclaim cannot make any forward
> progress because the page cache is hidden in the active list which doesn't
> get rotated because inactive_list_is_low is not memcg aware.
> 
> It simply subtracts per-zone highmem counters from the respective memcg's
> lru sizes which doesn't make any sense.  We can simply end up always
> seeing the resulting active and inactive counts 0 and return false.  This
> issue is not limited to 32b kernels but in practice the effect on systems
> without CONFIG_HIGHMEM would be much harder to notice because we do not
> invoke the OOM killer for allocations requests targeting < ZONE_NORMAL.
> 
> Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
> and subtract per-memcg highmem counts when memcg is enabled.  Introduce
> helper lruvec_zone_lru_size which redirects to either zone counters or
> mem_cgroup_get_zone_lru_size when appropriate.
> 
> We are losing empty LRU but non-zero lru size detection introduced by
> ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
> of the inherent zone vs.  node discrepancy.
> 
> Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible 
> zones inactive ratio")
> Link: http://lkml.kernel.org/r/20170104100825.3729-1-mho...@kernel.org
> Signed-off-by: Michal Hocko 
> Reported-by: Nils Holland 
> Tested-by: Nils Holland 
> Reported-by: Klaus Ethgen 
> Acked-by: Minchan 

Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-11 Thread Mel Gorman
On Wed, Jan 11, 2017 at 04:32:43AM -0600, Trevor Cordes wrote:
> Hi!  I have biected a nightly oom-killer flood and crash/hang on one of 
> the boxes I admin.  It doesn't crash on Fedora 23/24 4.7.10 kernel but 
> does on any 4.8 Fedora kernel.  I did a vanilla bisect and the bug is 
> here:
> 
> commit b2e18757f2c9d1cdd746a882e9878852fdec9501
> Author: Mel Gorman 
> Date:   Thu Jul 28 15:45:37 2016 -0700
> 
> mm, vmscan: begin reclaiming pages on a per-node basis
> 

Michal Hocko recently worked on a bug similar to this. Can you test the
following patch that is currently queued in Andrew Morton's tree? It
applies cleanly to 4.9

Thanks.

From: Michal Hocko 
Subject: mm, memcg: fix the active list aging for lowmem requests when memcg is 
enabled

Nils Holland and Klaus Ethgen have reported unexpected OOM killer
invocations with 32b kernel starting with 4.8 kernels

kworker/u4:5 invoked oom-killer: 
gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
[...]
Mem-Info:
active_anon:58685 inactive_anon:90 isolated_anon:0
 active_file:274324 inactive_file:281962 isolated_file:0
 unevictable:0 dirty:649 writeback:0 unstable:0
 slab_reclaimable:40662 slab_unreclaimable:17754
 mapped:7382 shmem:202 pagetables:351 bounce:0
 free:206736 free_pcp:332 free_cma:0
Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB 
shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB 
pages_scanned:0 all_unreclaimable? no
DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 813 3474 3474
Normal free:41332kB min:41368kB low:51708kB high:62048kB 
active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB 
slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB 
pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB 
active_anon:234740kB inactive_anon:360kB active_file:557232kB 
inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB 
managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB 
free_cma:0kB

the oom killer is clearly pre-mature because there there is still a lot of
page cache in the zone Normal which should satisfy this lowmem request. 
Further debugging has shown that the reclaim cannot make any forward
progress because the page cache is hidden in the active list which doesn't
get rotated because inactive_list_is_low is not memcg aware.

It simply subtracts per-zone highmem counters from the respective memcg's
lru sizes which doesn't make any sense.  We can simply end up always
seeing the resulting active and inactive counts 0 and return false.  This
issue is not limited to 32b kernels but in practice the effect on systems
without CONFIG_HIGHMEM would be much harder to notice because we do not
invoke the OOM killer for allocations requests targeting < ZONE_NORMAL.

Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
and subtract per-memcg highmem counts when memcg is enabled.  Introduce
helper lruvec_zone_lru_size which redirects to either zone counters or
mem_cgroup_get_zone_lru_size when appropriate.

We are losing empty LRU but non-zero lru size detection introduced by
ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
of the inherent zone vs.  node discrepancy.

Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones 
inactive ratio")
Link: http://lkml.kernel.org/r/20170104100825.3729-1-mho...@kernel.org
Signed-off-by: Michal Hocko 
Reported-by: Nils Holland 
Tested-by: Nils Holland 
Reported-by: Klaus Ethgen 
Acked-by: Minchan Kim 
Acked-by: Mel Gorman 
Acked-by: Johannes Weiner 
Reviewed-by: Vladimir Davydov 
Cc: [4.8+]
Signed-off-by: Andrew Morton 
---

 include/linux/memcontrol.h |   26 +++---
 

Re: mm, vmscan: commit makes PAE kernel crash nightly (bisected)

2017-01-11 Thread Mel Gorman
On Wed, Jan 11, 2017 at 04:32:43AM -0600, Trevor Cordes wrote:
> Hi!  I have biected a nightly oom-killer flood and crash/hang on one of 
> the boxes I admin.  It doesn't crash on Fedora 23/24 4.7.10 kernel but 
> does on any 4.8 Fedora kernel.  I did a vanilla bisect and the bug is 
> here:
> 
> commit b2e18757f2c9d1cdd746a882e9878852fdec9501
> Author: Mel Gorman 
> Date:   Thu Jul 28 15:45:37 2016 -0700
> 
> mm, vmscan: begin reclaiming pages on a per-node basis
> 

Michal Hocko recently worked on a bug similar to this. Can you test the
following patch that is currently queued in Andrew Morton's tree? It
applies cleanly to 4.9

Thanks.

From: Michal Hocko 
Subject: mm, memcg: fix the active list aging for lowmem requests when memcg is 
enabled

Nils Holland and Klaus Ethgen have reported unexpected OOM killer
invocations with 32b kernel starting with 4.8 kernels

kworker/u4:5 invoked oom-killer: 
gfp_mask=0x2400840(GFP_NOFS|__GFP_NOFAIL), nodemask=0, order=0, oom_score_adj=0
kworker/u4:5 cpuset=/ mems_allowed=0
CPU: 1 PID: 2603 Comm: kworker/u4:5 Not tainted 4.9.0-gentoo #2
[...]
Mem-Info:
active_anon:58685 inactive_anon:90 isolated_anon:0
 active_file:274324 inactive_file:281962 isolated_file:0
 unevictable:0 dirty:649 writeback:0 unstable:0
 slab_reclaimable:40662 slab_unreclaimable:17754
 mapped:7382 shmem:202 pagetables:351 bounce:0
 free:206736 free_pcp:332 free_cma:0
Node 0 active_anon:234740kB inactive_anon:360kB active_file:1097296kB 
inactive_file:1127848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
mapped:29528kB dirty:2596kB writeback:0kB shmem:0kB shmem_thp: 0kB 
shmem_pmdmapped: 184320kB anon_thp: 808kB writeback_tmp:0kB unstable:0kB 
pages_scanned:0 all_unreclaimable? no
DMA free:3952kB min:788kB low:984kB high:1180kB active_anon:0kB 
inactive_anon:0kB active_file:7316kB inactive_file:0kB unevictable:0kB 
writepending:96kB present:15992kB managed:15916kB mlocked:0kB 
slab_reclaimable:3200kB slab_unreclaimable:1408kB kernel_stack:0kB 
pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 813 3474 3474
Normal free:41332kB min:41368kB low:51708kB high:62048kB 
active_anon:0kB inactive_anon:0kB active_file:532748kB inactive_file:44kB 
unevictable:0kB writepending:24kB present:897016kB managed:836248kB mlocked:0kB 
slab_reclaimable:159448kB slab_unreclaimable:69608kB kernel_stack:1112kB 
pagetables:1404kB bounce:0kB free_pcp:528kB local_pcp:340kB free_cma:0kB
lowmem_reserve[]: 0 0 21292 21292
HighMem free:781660kB min:512kB low:34356kB high:68200kB 
active_anon:234740kB inactive_anon:360kB active_file:557232kB 
inactive_file:1127804kB unevictable:0kB writepending:2592kB present:2725384kB 
managed:2725384kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB 
kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:800kB local_pcp:608kB 
free_cma:0kB

the oom killer is clearly pre-mature because there there is still a lot of
page cache in the zone Normal which should satisfy this lowmem request. 
Further debugging has shown that the reclaim cannot make any forward
progress because the page cache is hidden in the active list which doesn't
get rotated because inactive_list_is_low is not memcg aware.

It simply subtracts per-zone highmem counters from the respective memcg's
lru sizes which doesn't make any sense.  We can simply end up always
seeing the resulting active and inactive counts 0 and return false.  This
issue is not limited to 32b kernels but in practice the effect on systems
without CONFIG_HIGHMEM would be much harder to notice because we do not
invoke the OOM killer for allocations requests targeting < ZONE_NORMAL.

Fix the issue by tracking per zone lru page counts in mem_cgroup_per_node
and subtract per-memcg highmem counts when memcg is enabled.  Introduce
helper lruvec_zone_lru_size which redirects to either zone counters or
mem_cgroup_get_zone_lru_size when appropriate.

We are losing empty LRU but non-zero lru size detection introduced by
ca707239e8a7 ("mm: update_lru_size warn and reset bad lru_size") because
of the inherent zone vs.  node discrepancy.

Fixes: f8d1a31163fc ("mm: consider whether to decivate based on eligible zones 
inactive ratio")
Link: http://lkml.kernel.org/r/20170104100825.3729-1-mho...@kernel.org
Signed-off-by: Michal Hocko 
Reported-by: Nils Holland 
Tested-by: Nils Holland 
Reported-by: Klaus Ethgen 
Acked-by: Minchan Kim 
Acked-by: Mel Gorman 
Acked-by: Johannes Weiner 
Reviewed-by: Vladimir Davydov 
Cc: [4.8+]
Signed-off-by: Andrew Morton 
---

 include/linux/memcontrol.h |   26 +++---
 include/linux/mm_inline.h  |2 +-
 mm/memcontrol.c|   18 --
 mm/vmscan.c|   27 +--
 4 files changed, 49 insertions(+), 24 deletions(-)

diff -puN