On 12/16/2014 06:42 PM, Charles Cazabon wrote:
Hi,

I've been running btrfs for various filesystems for a few years now, and have
recently run into problems with a large filesystem becoming *really* slow for
basic reading.  None of the debugging/testing suggestions I've come across in
the wiki or in the mailing list archives seems to have helped.

Background: this particular filesystem holds backups for various other
machines on the network, a mix of rdiff-backup data (so lots of small files)
and rsync copies of larger files (everything from ~5MB data files to ~60GB VM
HD images).  There's roughly 16TB of data in this filesystem (the filesystem
is ~17TB).  The btrfs filesystem is a simple single volume, no snapshots,
multiple devices, or anything like that.  It's an LVM logical volume on top of
dmcrypt on top of an mdadm RAID set (8 disks in RAID 6).

The performance:  trying to copy the data off this filesystem to another
(non-btrfs) filesystem with rsync or just cp was taking aaaages - I found one
suggestion that it could be because updating the atimes required a COW of the
metadata in btrfs, so I mounted the filesystem noatime, but this doesn't
appear to have made any difference.  The speeds I'm seeing (with iotop)
fluctuate a lot.  They spend most of the time in the range of 1-3 MB/s, with
large periods of time where no IO seems to happen at all, and occasional short
spikes to ~25-30 MB/s.  System load seems to sit around 10-12 (with only 2
processes reported as running, everything else sleeping) while this happens.
The server is doing nothing other than this copy at the time.  The only
processes using any noticable CPU are rsync (source and destination processes,
around 3% CPU each, plus an md0:raid6 process around 2-3%), and a handful of
"kworker" processes, perhaps one per CPU (there are 8 physical cores in the
server, plus hyperthreading).

Other filesystems on the same physical disks have no trouble exceeding 100MB/s
reads.  The machine is not swapping (16GB RAM, ~8GB swap with 0 swap used).

Is there something obvious I'm missing here?  Is there a reason I can only
average ~3MB/s reads from a btrfs filesystem?

kernel is x86_64 linux-stable 3.17.6.  btrfs-progs is v3.17.3-3-g8cb0438.
Output of the various info commands is:

   $ sudo btrfs fi df /media/backup/
   Data, single: total=16.24TiB, used=15.73TiB
   System, DUP: total=8.00MiB, used=1.75MiB
   System, single: total=4.00MiB, used=0.00
   Metadata, DUP: total=35.50GiB, used=34.05GiB
   Metadata, single: total=8.00MiB, used=0.00
   unknown, single: total=512.00MiB, used=0.00

   $ btrfs --version
   Btrfs v3.17.3-3-g8cb0438

   $ sudo btrfs fi show

   Label: 'backup'  uuid: c18dfd04-d931-4269-b999-e94df3b1918c
     Total devices 1 FS bytes used 15.76TiB
     devid    1 size 16.37TiB used 16.31TiB path /dev/mapper/vg-backup

Thanks in advance for any suggestions.

Charles


Totally spit-balling ideas here (e.g. no suggestion as to which one to try first etc, just typing them as they come to me):

Have you tried increasing the number of stripe buffers for the filesystem? If you've gotten things spread way out you might be thrashing your stripe cache. (see /sys/block/md(number here)/md/stripe_cache_size).

Have you taken SMART (smartmotools etc) to these disks to see if any of them are reporting any sort of incipient failure conditions? If one or more drives is reporting recoverable read errors it might just be clogging you up.

Try experimentally mounting the filesystem read-only and dong some read tests. This elimination of all possible write sources will tell you things. In particular if all your reads just start breezing through then you know something in the write path is "iffy". One thing that comes to mind is that anything accessing the drive with a barrier-style operation (wait for verification of data sync all the way to disk) would have to pass all the way down through the encryption layer which could be having a multiplier effect. (you know, lots of very short delays making a large net delay).

Have you changed any hardware lately in a way that could de-optimize your interrupt handling.

I have a vague recollection that somewhere in the last month and a half or so there was a patch here (or in the kernel changelogs) about an extra put operation (or something) that would cause a worker thread to roll over to -1, then spin back down to zero before work could proceed. I know, could I _be_ more vague? Right? Try switching to kernel 3.18.1 to see if the issue just goes away. (Honestly this one's just been scratching at my brain since I started writing this reply and I just _can't_ remember the reference for it... dangit...)

When was the last time you did any of the maintenance things (like balance or defrag)? Not that I'd want to sit through 15Tb of that sort of thing, but I'm curious about the maintenance history.

Does the read performance fall off with uptime? E.g. is it "okay" right after a system boot and then start to fall off as uptime (and activity) increases? I _imagine_ that if your filesystem huge and your server is modest by comparison in terms of ram, cache pinning and fragmentation can start becoming a real problem. What else besides marshaling this filesystem is this system used for?

Have you tried segregating some of your system memory for to make sure that you aren't actually having application performance issues? I've had some luck with kernelcore= and moveablecore= (particularly moveablecore=) kernel command line options when dealing with IO induced fragmentation. On problematic systems I'll try classifying at least 1/4 of the system ram as movablecore. (e.g. on my 8GiB laptop were I do some of my experimental work, I have moveablecore=2G on the command line). Any pages that get locked into memory will be moved out of the movable-only memory first. This can have a profound (usually positive) effect on applications that want to spread out in memory. If you are running anything that likes large swaths of memory then this can help a lot. Particularly if you are also running programs that traverse large swaths of disk. Some programs (rsync of large files etc may be such a program) can do "much better" if you've done this. (BUT DON'T OVERDO IT, enough is good but too much is very bad. 8-) ).


ASIDE: Anything that uses hugepages, transparent or explicit, in any serious number has a tendency to antagonize the system cache (and vice-versa). It's a silent fight of the cache-pressure sort. When you explicitly declare an amount of ram for moveable pages only, the disk cache will not grow into that space. so moveablecore=3G creates 3GiB of space where only unlocked pages (malloced heap, stack, etc; basically only things that can get moved -- particularly swapped -- will go in that space.) The practical effect is that certain kinds of pressures will never compete. So broad-format disk I/O (e.g. using find etc) will tend to be on one side of the barrier while video playback buffers and virtual machine's ram regions are on the other. The broad and deep filesystem you describe could be thwarting your program's attempt to access it. That is, the rsync's need to load a large number of inodes could be starving rsync for memory (etc). Keeping the disk cache out of your program's space at least in part could prevent some very "interesting" contention models from ruining your day.

Or it could just make things worse.

So it's worth a try but it's not gospel. 8-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to