Bug#695182: [RFC] Reproducible OOM with partial workaround

2013-01-13 Thread paul . szabo
Dear Dave,

You wrote:

 ... 64-bit kernels should basically be drop-in replacements for 32-bit
 ones.  You can keep userspace 100% 32-bit, and just have a 64-bit
 kernel.

Any advice on how I would install a 64-bit kernel, particularly in the
Debian world? Seems to me that on a 32-bit machine, apt-get does not
see the amd64 kernels.

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#695182: [RFC] Reproducible OOM with partial workaround

2013-01-11 Thread Andrew Morton
On Fri, 11 Jan 2013 12:46:15 +1100 paul.sz...@sydney.edu.au wrote:

  ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
  kernel without either violating the ABI (3GB/1GB split) or doing
  something that never got merged upstream ...
 
 Sorry to be so contradictory:
 
 psz@como:~$ uname -a
 Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 
 18:34:25 EST 2013 i686 GNU/Linux
 psz@como:~$ free -l
  total   used   free sharedbuffers cached
 Mem:  644469004729292   59717608  0  15972 480520
 Low:375836 304400  71436
 High: 640710644424892   59646172
 -/+ buffers/cache:4232800   60214100
 Swap:134217724  0  134217724
 psz@como:~$ 
 
 (though I would not know about violations).
 
 But OK, I take your point that I should move with the times.

Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.

If so, you *may* be able to work around this by setting
/proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
amount of dirty pagecache around.  Then, with luck, if we haven't
broken the buffer_heads_over_limit logic it in the past decade (we
probably have), the VM should be able to reclaim those buffer_heads.

Alternatively, use a filesystem which doesn't attach buffer_heads to
dirty pages.  xfs or btrfs, perhaps.


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#695182: [RFC] Reproducible OOM with partial workaround

2013-01-11 Thread Simon Jeons
On Fri, 2013-01-11 at 00:01 -0800, Andrew Morton wrote:
 On Fri, 11 Jan 2013 12:46:15 +1100 paul.sz...@sydney.edu.au wrote:
 
   ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
   kernel without either violating the ABI (3GB/1GB split) or doing
   something that never got merged upstream ...
  
  Sorry to be so contradictory:
  
  psz@como:~$ uname -a
  Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 
  18:34:25 EST 2013 i686 GNU/Linux
  psz@como:~$ free -l
   total   used   free sharedbuffers cached
  Mem:  644469004729292   59717608  0  15972 480520
  Low:375836 304400  71436
  High: 640710644424892   59646172
  -/+ buffers/cache:4232800   60214100
  Swap:134217724  0  134217724
  psz@como:~$ 
  
  (though I would not know about violations).
  
  But OK, I take your point that I should move with the times.
 
 Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.
 
 If so, you *may* be able to work around this by setting
 /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
 amount of dirty pagecache around.  Then, with luck, if we haven't
 broken the buffer_heads_over_limit logic it in the past decade (we
 probably have), the VM should be able to reclaim those buffer_heads.
 
 Alternatively, use a filesystem which doesn't attach buffer_heads to
 dirty pages.  xfs or btrfs, perhaps.
 

Hi Andrew,

What's the meaning of attaching buffer_heads to dirty pages?

 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#695182: [RFC] Reproducible OOM with partial workaround

2013-01-11 Thread paul . szabo
Dear Andrew,

 Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.

Please see below: I do not know what any of that means. This machine has
been running just fine, with all my users logging in here via XDMCP from
X-terminals, dozens logged in simultaneously. (But, I think I could make
it go OOM with more processes or logins.)

 If so, you *may* be able to work around this by setting
 /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
 amount of dirty pagecache around.  Then, with luck, if we haven't
 broken the buffer_heads_over_limit logic it in the past decade (we
 probably have), the VM should be able to reclaim those buffer_heads.

I tried setting dirty_ratio to funny values, that did not seem to
help. Did you notice my patch about bdi_position_ratio(), how it was
plain wrong half the time (for negative x)? Anyway that did not help.

 Alternatively, use a filesystem which doesn't attach buffer_heads to
 dirty pages.  xfs or btrfs, perhaps.

Seems there is also a problem not related to filesystem... or rather,
the essence does not seem to be filesystem or caches. The filesystem
thing now seems OK with my patch doing drop_caches.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


---

root@como:~# free -lm
 total   used   free sharedbuffers cached
Mem: 62936   2317  60618  0 41635
Low:   367271 95
High:62569   2045  60523
-/+ buffers/cache:   1640  61295
Swap:   131071  0 131071
root@como:~# cat /proc/slabinfo
slabinfo - version: 2.1
# nameactive_objs num_objs objsize objperslab 
pagesperslab : tunables limit batchcount sharedfactor : slabdata 
active_slabs num_slabs sharedavail
fuse_request   0  0376   434 : tunables000 : 
slabdata  0  0  0
fuse_inode 0  0448   364 : tunables000 : 
slabdata  0  0  0
bsg_cmd0  0288   282 : tunables000 : 
slabdata  0  0  0
ntfs_big_inode_cache  0  0512   324 : tunables000 : 
slabdata  0  0  0
ntfs_inode_cache   0  0176   462 : tunables000 : 
slabdata  0  0  0
nfs_direct_cache   0  0 80   511 : tunables000 : 
slabdata  0  0  0
nfs_inode_cache 5404   5404584   284 : tunables000 : 
slabdata193193  0
isofs_inode_cache  0  0360   454 : tunables000 : 
slabdata  0  0  0
fat_inode_cache0  0408   404 : tunables000 : 
slabdata  0  0  0
fat_cache  0  0 24  1701 : tunables000 : 
slabdata  0  0  0
jbd2_revoke_record  0  0 32  1281 : tunables000 : 
slabdata  0  0  0
journal_handle  5440   5440 24  1701 : tunables000 : 
slabdata 32 32  0
journal_head   16768  16768 64   641 : tunables000 : 
slabdata262262  0
revoke_record  20224  20224 16  2561 : tunables000 : 
slabdata 79 79  0
ext4_inode_cache   0  0584   284 : tunables000 : 
slabdata  0  0  0
ext4_free_data 0  0 40  1021 : tunables000 : 
slabdata  0  0  0
ext4_allocation_context  0  0112   361 : tunables00
0 : slabdata  0  0  0
ext4_prealloc_space  0  0 72   561 : tunables000 : 
slabdata  0  0  0
ext4_io_end0  0576   284 : tunables000 : 
slabdata  0  0  0
ext4_io_page   0  0  8  5121 : tunables000 : 
slabdata  0  0  0
ext2_inode_cache   0  0480   344 : tunables000 : 
slabdata  0  0  0
ext3_inode_cache   16531  19965488   334 : tunables000 : 
slabdata605605  0
ext3_xattr 0  0 48   851 : tunables000 : 
slabdata  0  0  0
dquot840840192   422 : tunables000 : 
slabdata 20 20  0
rpc_inode_cache  144144448   364 : tunables000 : 
slabdata  4  4  0
UDP-Lite   0  0576   284 : tunables000 : 
slabdata  0  0  0
xfrm_dst_cache 0  0320   514 : tunables000 : 
slabdata  0  0  0
UDP  896896576   284 : tunables000 : 
slabdata 32 32  0
tw_sock_TCP 1344   1344128   321 : 

Bug#695182: [RFC] Reproducible OOM with partial workaround

2013-01-11 Thread Dave Hansen
On 01/10/2013 05:46 PM, paul.sz...@sydney.edu.au wrote:
  ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
  kernel without either violating the ABI (3GB/1GB split) or doing
  something that never got merged upstream ...
 Sorry to be so contradictory:
 
 psz@como:~$ uname -a
 Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 
 18:34:25 EST 2013 i686 GNU/Linux
 psz@como:~$ free -l
  total   used   free sharedbuffers cached
 Mem:  644469004729292   59717608  0  15972 480520
 Low:375836 304400  71436
 High: 640710644424892   59646172
 -/+ buffers/cache:4232800   60214100
 Swap:134217724  0  134217724

Hey, that's pretty cool!  I would swear that the mem_map[] overhead was
such that they wouldn't boot, but perhaps those brain cells died on me.


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#695182: [RFC] Reproducible OOM with partial workaround

2013-01-11 Thread Andrew Morton
On Fri, 11 Jan 2013 22:51:35 +1100
paul.sz...@sydney.edu.au wrote:

 Dear Andrew,
 
  Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.
 
 Please see below: I do not know what any of that means. This machine has
 been running just fine, with all my users logging in here via XDMCP from
 X-terminals, dozens logged in simultaneously. (But, I think I could make
 it go OOM with more processes or logins.)

I'm counting 107MB in slab there.  Was this dump taken when the system
was at or near oom?

Please send a copy of the oom-killer kernel message dump, if you still
have one.

  If so, you *may* be able to work around this by setting
  /proc/sys/vm/dirty_ratio really low, so the system keeps a minimum
  amount of dirty pagecache around.  Then, with luck, if we haven't
  broken the buffer_heads_over_limit logic it in the past decade (we
  probably have), the VM should be able to reclaim those buffer_heads.
 
 I tried setting dirty_ratio to funny values, that did not seem to
 help.

Did you try setting it as low as possible?

 Did you notice my patch about bdi_position_ratio(), how it was
 plain wrong half the time (for negative x)? 

Nope, please resend.

 Anyway that did not help.
 
  Alternatively, use a filesystem which doesn't attach buffer_heads to
  dirty pages.  xfs or btrfs, perhaps.
 
 Seems there is also a problem not related to filesystem... or rather,
 the essence does not seem to be filesystem or caches. The filesystem
 thing now seems OK with my patch doing drop_caches.

hm, if doing a regular drop_caches fixes things then that implies the
problem is not with dirty pagecache.  Odd.


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#695182: [RFC] Reproducible OOM with partial workaround

2013-01-11 Thread paul . szabo
Dear Andrew,

 Check /proc/slabinfo, see if all your lowmem got eaten up by buffer_heads.
 Please see below ...
 ... Was this dump taken when the system was at or near oom?

No, that was a quiescent machine. Please see a just-before-OOM dump in
my next message (in a little while).

 Please send a copy of the oom-killer kernel message dump, if you still
 have one.

Please see one in next message, or in
http://bugs.debian.org/695182

 I tried setting dirty_ratio to funny values, that did not seem to
 help.
 Did you try setting it as low as possible?

Probably. Maybe. Sorry, cannot say with certainty.

 Did you notice my patch about bdi_position_ratio(), how it was
 plain wrong half the time (for negative x)? 
 Nope, please resend.

Quoting from
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182
:
...
 - In bdi_position_ratio() get difference (setpoint-dirty) right even
   when it is negative, which happens often. Normally these numbers are
   small and even with left-shift I never observed a 32-bit overflow.
   I believe it should be possible to re-write the whole function in
   32-bit ints; maybe it is not worth the effort to make it efficient;
   seeing how this function was always wrong and we survived, it should
   simply be removed.
...
--- mm/page-writeback.c.old 2012-10-17 13:50:15.0 +1100
+++ mm/page-writeback.c 2013-01-06 21:54:59.0 +1100
[ Line numbers out because other patches not shown ]
...
@@ -559,7 +578,7 @@ static unsigned long bdi_position_ratio(
 * = fast response on large errors; small oscillation near setpoint
 */
setpoint = (freerun + limit) / 2;
-   x = div_s64((setpoint - dirty)  RATELIMIT_CALC_SHIFT,
+   x = div_s64(((s64)setpoint - (s64)dirty)  RATELIMIT_CALC_SHIFT,
limit - setpoint + 1);
pos_ratio = x;
pos_ratio = pos_ratio * x  RATELIMIT_CALC_SHIFT;
...

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#695182: [RFC] Reproducible OOM with partial workaround

2013-01-10 Thread paul . szabo
Dear Linux-MM,

On a machine with i386 kernel and over 32GB RAM, an OOM condition is
reliably obtained simply by writing a few files to some local disk
e.g. with:
  n=0; while [ $n -lt 99 ]; do dd bs=1M count=1024 if=/dev/zero of=x$n; 
((n=$n+1)); done
Crash usually occurs after 16 or 32 files written. Seems that the
problem may be avoided by using mem=32G on the kernel boot, and that
it occurs with any amount of RAM over 32GB.

I developed a workaround patch for this particular OOM demo, dropping
filesystem caches when about to exhaust lowmem. However, subsequently
I observed OOM when running many processes (as yet I do not have an
easy-to-reproduce demo of this); so as I suspected, the essence of the
problem is not with FS caches.

Could you please help in finding the cause of this OOM bug?

Please see
http://bugs.debian.org/695182
for details, in particular my workaround patch
http://bugs.debian.org/cgi-bin/bugreport.cgi?msg=101;att=1;bug=695182

(Please reply to me directly, as I am not a subscriber to the linux-mm
mailing list.)

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#695182: [RFC] Reproducible OOM with partial workaround

2013-01-10 Thread Dave Hansen
On 01/10/2013 01:58 PM, paul.sz...@sydney.edu.au wrote:
 I developed a workaround patch for this particular OOM demo, dropping
 filesystem caches when about to exhaust lowmem. However, subsequently
 I observed OOM when running many processes (as yet I do not have an
 easy-to-reproduce demo of this); so as I suspected, the essence of the
 problem is not with FS caches.
 
 Could you please help in finding the cause of this OOM bug?

As was mentioned in the bug, your 32GB of physical memory only ends up
giving ~900MB of low memory to the kernel.  Of that, around 600MB is
used for mem_map[], leaving only about 300MB available to the kernel
for *ALL* of its allocations at runtime.

Your configuration has never worked.  This isn't a regression, it's
simply something that we know never worked in Linux and it's a very hard
problem to solve.  One Linux vendor (at least) went to a huge amount of
trouble to develop, ship, and supported a kernel that supported large
32-bit machines, but it was never merged upstream and work stopped on it
when such machines became rare beasts:

http://lwn.net/Articles/39925/

I believe just about any Linux vendor would call your configuration
unsupported.  Just because the kernel can boot does not mean that we
expect it to work.

It's possible that some tweaks of the vm knobs (like lowmem_reserve)
could help you here.  But, really, you don't want to run a 32-bit kernel
on such a large machine.  Very, very few folks are running 32-bit
kernels on these systems and you're likely to keep running in to bugs
because this is such a rare configuration.

We've been very careful to ensure that 64-bit kernels shoul basically be
drop-in replacements for 32-bit ones.  You can keep userspace 100%
32-bit, and just have a 64-bit kernel.

If you're really set on staying 32-bit, I might have a NUMA-Q I can give
you. ;)


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#695182: [RFC] Reproducible OOM with partial workaround

2013-01-10 Thread paul . szabo
Dear Dave,

 Your configuration has never worked.  This isn't a regression ...
 ... does not mean that we expect it to work.

Do you mean that CONFIG_HIGHMEM64G is deprecated, should not be used;
that all development is for 64-bit only?

 ... 64-bit kernels should basically be drop-in replacements ...

Will think about that. I know all my servers are 64-bit capable, will
need to check all my desktops.

---

I find it puzzling that there seems to be a sharp cutoff at 32GB RAM,
no problem under but OOM just over; whereas I would have expected
lowmem starvation to be gradual, with OOM occuring much sooner with
64GB than with 34GB. Also, the kernel seems capable of reclaiming
lowmem, so I wonder why does that fail just over the 32GB threshhold.
(Obviously I have no idea what I am talking about.)

---

Thanks, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#695182: [RFC] Reproducible OOM with partial workaround

2013-01-10 Thread Dave Hansen
On 01/10/2013 04:46 PM, paul.sz...@sydney.edu.au wrote:
 Your configuration has never worked.  This isn't a regression ...
 ... does not mean that we expect it to work.
 
 Do you mean that CONFIG_HIGHMEM64G is deprecated, should not be used;
 that all development is for 64-bit only?

My last 4GB laptop had a 1GB hole and needed HIGHMEM64G since it had RAM
at 0-5GB.  That worked just fine, btw.  The problem isn't with
HIGHMEM64G itself.

I'm not saying HIGHMEM64G is inherently bad, just that it gets gradually
worse and worse as you add more RAM.  I don't believe 64GB of RAM has
_ever_ been booted on a 32-bit kernel without either violating the ABI
(3GB/1GB split) or doing something that never got merged upstream (that
4GB/4GB split, or other fun stuff like page clustering).

 I find it puzzling that there seems to be a sharp cutoff at 32GB RAM,
 no problem under but OOM just over; whereas I would have expected
 lowmem starvation to be gradual, with OOM occuring much sooner with
 64GB than with 34GB. Also, the kernel seems capable of reclaiming
 lowmem, so I wonder why does that fail just over the 32GB threshhold.
 (Obviously I have no idea what I am talking about.)

It _is_ puzzling.  It isn't immediately obvious to me why the slab that
you have isn't being reclaimed.  There might, indeed, be a fixable bug
there.  But, there are probably a bunch more bugs which will keep you
from having a nice, smoothly-running system, mostly those bugs have not
had much attention in the 10 years or so since 64-bit x86 became
commonplace.  Plus, even 10 years ago, when folks were working on this
actively, we _never_ got things running smoothly on 32GB of RAM.  Take a
look at this:

http://support.bull.com/ols/product/system/linux/redhat/help/kbf/g/inst/PrKB11417

You are effectively running the SMP kernel (hugemem is a completely
different beast).

I had a 32GB i386 system.  It was a really, really fun system to play
with, and its never-ending list of bugs helped keep me employed for
several years.  You don't want to unnecessarily inflict that pain on
yourself, really.


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#695182: [RFC] Reproducible OOM with partial workaround

2013-01-10 Thread paul . szabo
Dear Dave,

 ... I don't believe 64GB of RAM has _ever_ been booted on a 32-bit
 kernel without either violating the ABI (3GB/1GB split) or doing
 something that never got merged upstream ...

Sorry to be so contradictory:

psz@como:~$ uname -a
Linux como.maths.usyd.edu.au 3.2.32-pk06.10-t01-i386 #1 SMP Sat Jan 5 18:34:25 
EST 2013 i686 GNU/Linux
psz@como:~$ free -l
 total   used   free sharedbuffers cached
Mem:  644469004729292   59717608  0  15972 480520
Low:375836 304400  71436
High: 640710644424892   59646172
-/+ buffers/cache:4232800   60214100
Swap:134217724  0  134217724
psz@como:~$ 

(though I would not know about violations).

But OK, I take your point that I should move with the times.

Cheers, Paul

Paul Szabo   p...@maths.usyd.edu.au   http://www.maths.usyd.edu.au/u/psz/
School of Mathematics and Statistics   University of SydneyAustralia


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org