Matt McKinnon posted on Wed, 04 Jan 2017 10:25:17 -0500 as excerpted: > Hi All, > > I seem to have a similar issue to a subject in December: > > Subject: page allocation stall in kernel 4.9 when copying files from one > btrfs hdd to another > > In my case, this is caused when rsync'ing large amounts of data over NFS > to the server with the BTRFS file system. This was not apparent in the > previous kernel (4.7). > > The poster mentioned some suggestions from Ducan here: > > https://mail-archive.com/linux-btrfs@vger.kernel.org/msg60083.html > > But those are not visible in the thread. What suggestions were given to > help alleviate this pain?
In his case the copying was from 7.2krpm to 5.6krpm drives, but not the reverse or when copying from slower to faster. I said that sounded very much like an earlier bug report to both this list and LKML, where Linus responded, suggesting twiddling the dirty_* writecache knobs... Here's my earlier post there quoted (nearly) verbatim, including footnotes. I don't know how much memory your system has but the below numbers for my 16 GB system should give you a reasonable idea for initial ballpark settings... It's generally accepted wisdom among kernel devs and sysadmins[1] that the existing dirty* write-cache defaults, set at a time when common system memories measured in the MiB, not the GiB of today, are no longer appropriate and should be lowered, but the lack of agreement as to precisely what the settings should be, combined with inertia and the lack of practical pressure given that those who know about the problem have long since adjusted their own systems accordingly, means the existing now generally agreed to be inappropriate defaults continue to remain. =:^( These knobs can be tweaked in several ways. For temporary experimentation, it's generally easiest to write (as root) updated values directly to the /proc/sys/vm/dirty_* files themselves. Once you find values you are comfortable with, most distros have an existing sysctl config[2] that can be altered as appropriate, so the settings get reapplied at each boot. Various articles with the details are easily googled so I'll be brief here, but here's the apropos settings and comments from my own /etc/sysctl.conf and a brief explanation: # write-cache, foreground/background flushing # vm.dirty_ratio = 10 (% of RAM) # make it 3% of 16G ~ half a gig vm.dirty_ratio = 3 # vm.dirty_bytes = 0 # vm.dirty_background_ratio = 5 (% of RAM) # make it 1% of 16G ~ 160 M vm.dirty_background_ratio = 1 # vm.dirty_background_bytes = 0 # vm.dirty_expire_centisecs = 2999 (30 sec) # vm.dirty_writeback_centisecs = 499 (5 sec) # make it 10 sec vm.dirty_writeback_centisecs = 1000 The *_bytes and *_ratio files configure the same thing in different ways, ratio being percentage of RAM, bytes being... bytes. Set one or the other as you prefer and the other one will be automatically zeroed out. The vm.dirty_background_* settings control when the kernel starts lower priority flushing, while high priority vm.dirty_* (not background) settings control when the kernel forces threads trying to do further writes to wait until some currently in-flight writes are completed. (Rereading this now, I seem to have been inaccurate on one detail. I'm not a dev and definitely not a kernel dev, but from what I've read, once foreground writeback is triggered, the kernel actually accounts writes to the threads actually doing the writing, causing them to spend much of their time they'd otherwise be using to dirty even more memory in IO- wait, waiting to write out memory they've already dirtied, thus throttling their ability to dirty even more memory, ultimately slowing down their ability to dirty memory to the speed at which writeback is actually occurring.) But those values only apply to size up until the expiry time has occurred, at which point writeback is still forced. That's where that setting comes in. The problem is that memory has gotten bigger much faster than the speed of actually writing out to slow spinning rust has increased. (Fast ssds have far less issues in this regard, tho slow flash like common USB thumb drives remain affected, indeed, sometimes even more so.) Common random- write spinning rust write speeds are 100 MiB/sec and may be as low as 30 MiB/sec. Meanwhile, the default 10% dirty_ratio, at 16 GiB memory size, approaches[3] 1.6 GiB, ~1600 MiB. At 100 MiB/sec that's 16 seconds worth of writeback to clear. At 30 MiB/sec, that's... well beyond the 30 second expiry time! To be clear, there's still a bug if the system crashes as a result -- the normal case should simply be a system that at worst doesn't respond for the writeback period, to be sure a problem in itself when that period exceeds double-digit seconds, but surely less of one than a total crash, as long as the system /does/ come back after perhaps half a minute or so. Anyway, as you can see from the above excerpt from my own sysctl.conf, for my 16 GiB system, I use a much more reasonable 1% background writeback trigger, ~160 MiB on 16 GiB, and 3% high-priority/foreground, ~ half a GiB on 16 GiB. I actually set those long ago, before I switched to btrfs and before I switched to ssd as well, but even tho ssd should work far better with the defaults than spinning rust does, those settings don't hurt on ssd either, and I've seen no reason to change them. So try 1% background and 3% foreground flushing ratios on your 32 GiB system as well, and see if that helps, or possibly try setting the _bytes values instead, since 1% is still quite huge in writeback time terms, on 32 GiB. Tweaking those down on the previously reported bug certainly helped there as he couldn't reproduce after that, and it looks like you're running 2+ GiB dirty based on your posted meminfo now, so it should reduce that, and hopefully eliminate the trigger for you, tho of course it won't fix the root bug. As I said it shouldn't crash in any case, even if it goes unresponsive for half a minute or so at a time, so there's certainly a bug to fix, but that will hopefully let you work without running into it. Again, you can write the new values direct to the proc interface without rebooting, for experimentation. Once you find values appropriate for you, however, write them to sysctl.conf or whatever your distro uses instead, so they get applied automatically at each boot. --- [1] Sysadmins: Like me, no claim to dev here, nor am I a professional sysadmin, but arguably I do take the responsibility of adminning my own systems more seriously than most appear to, enough to claim sysadmin as an appropriate descriptor. [2] Sysctl config. Look in /etc/sysctl.d/* and/or /etc/sysctl.conf, as appropriate to your distro. [3] Approaches: The memory figure used for calculating this percentage excludes some things so it won't actually reach 10% of total memory. But the exclusions are small enough that they can be hand-waved away for purposes of this discussion. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html