Many thanks for the analysis !

I'm going to test with 4K on heavy mssql database to see if I'm seeing 
improvement on ios/latency.
I'll report results in this thread.


----- Mail original -----
De: "Trent Lloyd" <[email protected]>
À: "ceph-users" <[email protected]>
Envoyé: Vendredi 10 Mai 2019 09:59:39
Objet: [ceph-users] Poor performance for 512b aligned "partial" writes from 
Windows guests in OpenStack + potential fix

I recently was investigating a performance problem for a reasonably sized 
OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS HDD) with NVMe 
Journals. The primary workload is Windows guests backed by Cinder RBD volumes. 
This specific deployment is Ceph Jewel (FileStore + SimpleMessenger) which 
while it is EOL, the issue is reproducible on current versions and also on 
BlueStore however for different reasons than FileStore. 

Generally the Ceph cluster was suffering from very poor outlier performance, 
the numbers change a little bit depending on the exact situation but roughly 
80% of I/O was happening in a "reasonable" time of 0-200ms but 5-20% of I/O 
operations were taking excessively long anywhere from 500ms through to 10-20+ 
seconds. However the normal metrics for commit and apply latency were normal, 
and in fact, this latency was hard to spot in the performance metrics available 
in jewel. 

Previously I more simply considered FileStore to have the "commit" (to journal) 
stage where it was written to the journal and it is OK to return to the client 
and then the "apply" (to disk) stage where it was flushed to disk and confirmed 
so that the data could be purged from the journal. However there is really a 
third stage in the middle where FileStore submits the I/O to the operating 
system and this is done before the lock on the object is released. Until that 
succeeds another operation cannot write to the same object (generally being a 
4MB area of the disk). 

I found that the fstore_op threads would get stuck for hundreds of MS or more 
inside of pwritev() which was blocking inside of the kernel. Normally we expect 
pwritev() to be buffered I/O into the page cache and return quite fast however 
in this case the kernel was in a few percent of cases blocking with the stack 
trace included at the end of the e-mail [1]. My finding from that stack is that 
inside __block_write_begin_int we see a call to out_of_line_wait_on_bit call 
which is really an inlined call for wait_on_buffer which occurs in 
linux/fs/buffer.c in the section around line 2000-2024 with the comment "If we 
issued read requests - let them complete." ( [ 
https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002
 | 
https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002
 ] ) 

My interpretation of that code is that for Linux to store a write in the page 
cache, it has to have the entire 4K page as that is the granularity of which it 
tracks the dirty state and it needs the entire 4K page to later submit back to 
the disk. Since we wrote a part of the page, and the page wasn't already in the 
cache, it has to fetch the remainder of the page from the disk. When this 
happens, it blocks waiting for this read to complete before returning from the 
pwritev() call - hence our normally buffered write blocks. This holds up the 
tp_fstore_op thread, of which there are (by default) only 2-4 such threads 
trying to process several hundred operations per second. Additionally the size 
of the osd_op_queue is bounded, and operations do not clear out of this queue 
until the tp_fstore_op thread is done. Which ultimately means that not only are 
these partial writes delayed but it knocks on to delay other writes behind them 
because of the constrained thread pools. 

What was further confusing to this, is that I could easily reproduce this in a 
test deployment using an rbd benchmark that was only writing to a total disk 
size of 256MB which I would easily have expected to fit in the page cache: 
rbd create -p rbd --size=256M bench2 
rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256 --io-total 256M 
--io-pattern rand 

This is explained by the fact that on secondary OSDs (at least, there was some 
refactoring of fadvise which I have not fully understood as of yet), FileStore 
is using fadvise FADVISE_DONTNEED on the objects after write which causes the 
kernel to immediately discard them from the page cache without any regard to 
their statistics of being recently/frequently used. The motivation for this 
addition appears to be that on a secondary OSD we don't service reads (only 
writes) and so therefor we can optimize memory usage by throwing away this 
object and in theory leaving more room in the page cache for objects which we 
are primary for and expect to actually service reads from a client for. 
Unfortunately this behavior does not take into account partial writes, where we 
now pathologically throw away the cached copy instantly such that a write even 
1 second later will have to fetch the page from disk again. I also found that 
this FADVISE_DONTNEED is issue not only during filestore sync but also by the 
WBThrottle - which as this cluster was quite busy was constantly flushing 
writes leading to the cache being discarded almost instantly. 

Changing filestore_fadvise to False on this cluster lead to a significant 
performance increase as it could now cache the pages in memory in many cases. 
The number of reads from disk was reduced from around 40/second to 2/second, 
and the number of slow writes (>200ms) operations was reduced by 75%. 

I wrote a script to parse ceph-osd logs with debug_filestore=10 or 15 to report 
the time spent inside of write() as well as to count and report on the number 
of operations that are unaligned and also slow. It's a bit rough but you can 
find it here: [ 
https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb | 
https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb ] 

It does not solve the problem entirely, in that a filestore thread can still be 
blocked in such a case where it is not cached - but the pathological case of 
never having it in the cache is removed at least. Understanding this problem, I 
looked to the situation for BlueStore. BlueStore suffers from a similar issue 
in that the performance is quite poor due to both fadvise and also because it 
is check-summing the data in 4k blocks so needs to read the rest of the block 
in, despite not having the limitations of the Linux page cache to deal with. I 
have not yet further fully investigated BlueStore implementation other than to 
note the following doc talking about how such writes are handled and a possible 
future improvement to submit partial writes into the WAL before reading the 
rest of the block, which is apparently not done currently (and would be a great 
optimization): [ http://docs.ceph.com/docs/mimic/dev/bluestore/ | 
http://docs.ceph.com/docs/mimic/dev/bluestore/ ] 


Moving onto a full solution for this issue. We can tell Windows guests to send 
4k-aligned I/O where possible by setting the physical_block_size hint on the 
disk. This support was added mainly for the incoming new series of hard drives 
which also have 4k blocks internally, and also need to do a similar 
'read-modify-update' operation in the case where a smaller write is done. In 
this case Windows tries to align the I/O to 4k as much as possible, at the most 
basic level for example when a new file is created, it will pad out the write 
to the block to the nearest 4k. You can read more about support for that here: 
[ 
https://support.microsoft.com/en-au/help/2510009/microsoft-support-policy-for-4k-sector-hard-drives-in-windows
 | 
https://support.microsoft.com/en-au/help/2510009/microsoft-support-policy-for-4k-sector-hard-drives-in-windows
 ] 

On a basic test, booting a Windows 2016 instance and then installing several 
months of Windows Updates the number of partial writes was reduced from 23% 
(753090 / 3229597) to 1.8% (54535 / 2880217) - many of which were during early 
boot and don't re-occur once the VM is running. 

I have submitted a patch to the OpenStack Cinder RBD driver to support setting 
this parameter. You can find that here: 
[ https://review.opendev.org/#/c/658283/ | 
https://review.opendev.org/#/c/658283/ ] 

I did not have much luck finding information about any of this online when I 
searched, so this e-mail is serving largely to document my findings for others. 
But I am also looking for input from anyone as to anything I have missed, 
confirming my analysis as sound, review for my Cinder patch, etc. 

There is also likely scope to make this same patch to report a 
physical_block_size=4096 on other Ceph consumers such as the new(ish) iSCSI 
gateway, etc. 

Regards, 
Trent 


[1] fstore_op pwritev blocking stack trace - if anyone is interested in the 
perf data, flamegraph, etc - I'd be happy to share. 

tp_fstore_op 
ceph::buffer::list::write_fd 
pwritev64 
entry_SYSCALL_64_after_hwframe 
do_syscall_64 
sys_pwritev 
do_pwritev 
vfs_writev 
do_iter_write 
do_iter_readv_writev 
xfs_file_write_iter 
xfs_file_buffered_aio_write 
iomap_file_buffered_write 
iomap_apply 
iomap_write_actor 
iomap_write_begin.constprop.18 
__block_write_begin_int 
out_of_line_wait_on_bit 
__wait_on_bit 
bit_wait_io 
io_schedule 
schedule 
__schedule 
finish_task_switch 

_______________________________________________ 
ceph-users mailing list 
[email protected] 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to