Hi,

So I dug into this a bit.  Apparently with XFS the fstrim command will ignore 
the provided "length" option once it hits a large contiguous block of free 
space and just keep going until there is a non-empty block.  Most of my larger 
filesystems end up with the XFS allocation group being 1TB in size, so the meta 
data from the next allocation group ends up stopping the fstrim command at 
about the 1TB mark.  

I did capture an fstrim call with blktrace and attached the results. I did this 
test on a smaller 2TB FS where the allocation groups are 512GB.  I found an 
offset which hit a large contiguous block of empty space, so even though I only 
requested a length of 4GB it ended up trimming ~487GB.

# fstrim -v -o 549032275968 -l 4294967296 /data/bulk
/data/bulk: 487.3 GiB (523262476288 bytes) trimmed

Looking through the blktrace I see some CFQ related stuff, so maybe it is 
actually helping to reduce starvation for other processes?

These large fstrim runs can actually complete quite quickly (10-20 seconds for 
1TB), but they can also be quite slow if the FS is busy (a few minutes).

I have heard the the ATA "trim" command can cause many problems because it is 
not "queueable". However I understand that the SCSI "unmap" command does not 
have this shortcoming.  Could the virtio-scsi driver and/or librbd be handling 
these better?

Thanks for the help!

Brendan

________________________________________
From: Jason Dillaman [[email protected]]
Sent: Saturday, November 18, 2017 5:08 AM
To: Brendan Moloney
Cc: [email protected]
Subject: Re: [ceph-users] I/O stalls when doing fstrim on large RBD

Can you capture a blktrace while perform fstrim to record the discard
operations? A 1TB trim extent would cause a huge impact since it would
translate to approximately 262K IO requests to the OSDs (assuming 4MB
backing files).

On Fri, Nov 17, 2017 at 6:19 PM, Brendan Moloney <[email protected]> wrote:
> Hi,
>
> I guess this isn't strictly about Ceph, but I feel like other folks here
> must have run into the same issues.
>
> I am trying to keep my thinly provisioned RBD volumes thin.  I use
> virtio-scsi to attach the RBD volumes to my VMs with the "discard=unmap"
> option. The RBD is formatted as XFS and some of them can be quite large
> (16TB+).  I have a cron job that runs "fstrim" commands twice a week in the
> evenings.
>
> The issue is that I see massive I/O stalls on the VM during the fstrim.  To
> the point where I am getting kernel panics from hung tasks and other
> timeouts.  I have tried a number of things to lessen the impact:
>
>     - Switching from deadline to CFQ (initially I thought this helped, but
> now I am not convinced)
>     - Running fstrim with "ionice -c idle" (this doesn't seem to make a
> difference)
>     - Chunking the fstrim with the offset/length options (helps reduce worst
> case, but I can't trim less than 1TB at a time and that can still cause a
> pause for several minutes)
>
> Is there anything else I can do to avoid this issue?
>
> Thanks,
> Brendan
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



--
Jason

Attachment: fstrim_blktrace.tar.gz
Description: fstrim_blktrace.tar.gz

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to