Re: Oddly slow read performance with near-full largish FS

Robert White Sat, 20 Dec 2014 02:58:15 -0800

On 12/16/2014 06:42 PM, Charles Cazabon wrote:

Hi,


I've been running btrfs for various filesystems for a few years now, and have
recently run into problems with a large filesystem becoming *really* slow for
basic reading.  None of the debugging/testing suggestions I've come across in
the wiki or in the mailing list archives seems to have helped.

Background: this particular filesystem holds backups for various other
machines on the network, a mix of rdiff-backup data (so lots of small files)
and rsync copies of larger files (everything from ~5MB data files to ~60GB VM
HD images).  There's roughly 16TB of data in this filesystem (the filesystem
is ~17TB).  The btrfs filesystem is a simple single volume, no snapshots,
multiple devices, or anything like that.  It's an LVM logical volume on top of
dmcrypt on top of an mdadm RAID set (8 disks in RAID 6).

The performance:  trying to copy the data off this filesystem to another
(non-btrfs) filesystem with rsync or just cp was taking aaaages - I found one
suggestion that it could be because updating the atimes required a COW of the
metadata in btrfs, so I mounted the filesystem noatime, but this doesn't
appear to have made any difference.  The speeds I'm seeing (with iotop)
fluctuate a lot.  They spend most of the time in the range of 1-3 MB/s, with
large periods of time where no IO seems to happen at all, and occasional short
spikes to ~25-30 MB/s.  System load seems to sit around 10-12 (with only 2
processes reported as running, everything else sleeping) while this happens.
The server is doing nothing other than this copy at the time.  The only
processes using any noticable CPU are rsync (source and destination processes,
around 3% CPU each, plus an md0:raid6 process around 2-3%), and a handful of
"kworker" processes, perhaps one per CPU (there are 8 physical cores in the
server, plus hyperthreading).

Other filesystems on the same physical disks have no trouble exceeding 100MB/s
reads.  The machine is not swapping (16GB RAM, ~8GB swap with 0 swap used).

Is there something obvious I'm missing here?  Is there a reason I can only
average ~3MB/s reads from a btrfs filesystem?

kernel is x86_64 linux-stable 3.17.6.  btrfs-progs is v3.17.3-3-g8cb0438.
Output of the various info commands is:

   $ sudo btrfs fi df /media/backup/
   Data, single: total=16.24TiB, used=15.73TiB
   System, DUP: total=8.00MiB, used=1.75MiB
   System, single: total=4.00MiB, used=0.00
   Metadata, DUP: total=35.50GiB, used=34.05GiB
   Metadata, single: total=8.00MiB, used=0.00
   unknown, single: total=512.00MiB, used=0.00

   $ btrfs --version
   Btrfs v3.17.3-3-g8cb0438

   $ sudo btrfs fi show

   Label: 'backup'  uuid: c18dfd04-d931-4269-b999-e94df3b1918c
     Total devices 1 FS bytes used 15.76TiB
     devid    1 size 16.37TiB used 16.31TiB path /dev/mapper/vg-backup

Thanks in advance for any suggestions.

Charles

Totally spit-balling ideas here (e.g. no suggestion as to which one totry first etc, just typing them as they come to me):

Have you tried increasing the number of stripe buffers for thefilesystem? If you've gotten things spread way out you might bethrashing your stripe cache. (see /sys/block/md(numberhere)/md/stripe_cache_size).

Have you taken SMART (smartmotools etc) to these disks to see if any ofthem are reporting any sort of incipient failure conditions? If one ormore drives is reporting recoverable read errors it might just beclogging you up.

Try experimentally mounting the filesystem read-only and dong some readtests. This elimination of all possible write sources will tell youthings. In particular if all your reads just start breezing through thenyou know something in the write path is "iffy". One thing that comes tomind is that anything accessing the drive with a barrier-style operation(wait for verification of data sync all the way to disk) would have topass all the way down through the encryption layer which could be havinga multiplier effect. (you know, lots of very short delays making a largenet delay).

Have you changed any hardware lately in a way that could de-optimizeyour interrupt handling.

I have a vague recollection that somewhere in the last month and a halfor so there was a patch here (or in the kernel changelogs) about anextra put operation (or something) that would cause a worker thread toroll over to -1, then spin back down to zero before work could proceed.I know, could I _be_ more vague? Right? Try switching to kernel 3.18.1to see if the issue just goes away. (Honestly this one's just beenscratching at my brain since I started writing this reply and I just_can't_ remember the reference for it... dangit...)

When was the last time you did any of the maintenance things (likebalance or defrag)? Not that I'd want to sit through 15Tb of that sortof thing, but I'm curious about the maintenance history.

Does the read performance fall off with uptime? E.g. is it "okay" rightafter a system boot and then start to fall off as uptime (and activity)increases? I _imagine_ that if your filesystem huge and your server ismodest by comparison in terms of ram, cache pinning and fragmentationcan start becoming a real problem. What else besides marshaling thisfilesystem is this system used for?

Have you tried segregating some of your system memory for to make surethat you aren't actually having application performance issues? I'vehad some luck with kernelcore= and moveablecore= (particularlymoveablecore=) kernel command line options when dealing with IO inducedfragmentation. On problematic systems I'll try classifying at least 1/4of the system ram as movablecore. (e.g. on my 8GiB laptop were I do someof my experimental work, I have moveablecore=2G on the command line).Any pages that get locked into memory will be moved out of themovable-only memory first. This can have a profound (usually positive)effect on applications that want to spread out in memory. If you arerunning anything that likes large swaths of memory then this can help alot. Particularly if you are also running programs that traverse largeswaths of disk. Some programs (rsync of large files etc may be such aprogram) can do "much better" if you've done this. (BUT DON'T OVERDO IT,enough is good but too much is very bad. 8-) ).

ASIDE: Anything that uses hugepages, transparent or explicit, in anyserious number has a tendency to antagonize the system cache (andvice-versa). It's a silent fight of the cache-pressure sort. When youexplicitly declare an amount of ram for moveable pages only, the diskcache will not grow into that space. so moveablecore=3G creates 3GiB ofspace where only unlocked pages (malloced heap, stack, etc; basicallyonly things that can get moved -- particularly swapped -- will go inthat space.) The practical effect is that certain kinds of pressureswill never compete. So broad-format disk I/O (e.g. using find etc) willtend to be on one side of the barrier while video playback buffers andvirtual machine's ram regions are on the other. The broad and deepfilesystem you describe could be thwarting your program's attempt toaccess it. That is, the rsync's need to load a large number of inodescould be starving rsync for memory (etc). Keeping the disk cache out ofyour program's space at least in part could prevent some very"interesting" contention models from ruining your day.


Or it could just make things worse.

So it's worth a try but it's not gospel. 8-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Oddly slow read performance with near-full largish FS

Reply via email to