Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-16 Thread Marc Roos


Hmmm, looks like diskpart is of, reports the same about a volume, that 
fsutil fsinfo ntfsinfo c: report 512 (in this case correct, because it 
is on a ssd)
Anyone knows how to use fsutil with a path mounted disk (without drive 
letter)?


-Original Message-
From: Marc Roos 
Sent: donderdag 16 mei 2019 13:46
To: aderumier; trent.lloyd
Cc: ceph-users
Subject: Re: [ceph-users] Poor performance for 512b aligned "partial" 
writes from Windows guests in OpenStack + potential fix


I am not sure if it is possible to run fsutil on disk without drive 
letter, but mounted on path. 
So I used:
diskpart
select volume 3
Filesystems

And gives me this: 
Current File System

  Type : NTFS
  Allocation Unit Size : 4096
  Flags : 

File Systems Supported for Formatting

  Type : NTFS (Default)
  Allocation Unit Sizes: 512, 1024, 2048, 4096 (Default), 8192, 16K, 
32K, 64K

  Type : FAT32
  Allocation Unit Sizes: 4096, 8192 (Default), 16K, 32K, 64K

  Type : REFS
  Allocation Unit Sizes: 4096 (Default), 64K

So it looks like it detects 4k correctly? But I do not have the  in the disk of 
libvirt and have the WD with 512e:

[@c01 ~]# smartctl -a /dev/sdb | grep 'Sector Size'
Sector Sizes: 512 bytes logical, 4096 bytes physical

CentOS Linux release 7.6.1810 (Core)
ceph version 12.2.12
libvirt-4.5.0



-Original Message-
From: Trent Lloyd [mailto:trent.ll...@canonical.com]
Sent: donderdag 16 mei 2019 9:57
To: Alexandre DERUMIER
Cc: ceph-users
Subject: Re: [ceph-users] Poor performance for 512b aligned "partial" 
writes from Windows guests in OpenStack + potential fix

For libvirt VMs, first you need to add "" to the relevant 
 sections, and then stop/start the VM to apply the change.

Then you need to make sure your VirtIO drivers (the Fedora/Red Hat 
variety anyway) are from late 2018 or so. There was a bug fixed around 
July 2018, before that date, the physical_block_size=4096 parameter is 
not used by the Windows VirtIO driver (it was supposed to be, but did 
not work).

Relevant links:
https://bugzilla.redhat.com/show_bug.cgi?id=1428641
https://github.com/virtio-win/kvm-guest-drivers-windows/pull/312 

After that, you can check if Windows is correctly recognizing the 
physical block size,

Start cmd.exe with "Run as administrator", then run fsutil fsinfo 
ntfsinfo c:

It should show "Bytes Per Physical Sector : 4096"



Lastly at least for Windows itself this makes it do 4096-byte writes 
"most of the time", however some applications including Exchange have 
special handling of the sector size. I'm not really sure how MSSQL 
handles it, for example, it may or may not work correctly if you switch 
to 4096 bytes after installation - you may have to create new data files 
or something for it to do 4k segments - or not. Hopefully the MSSQL 
documentation has some information about that.

It is also possible to set logical_block_size=4096 as well as
physical_block_size=4096 ("4k native") however this absolutely causes 
problems with some software (e.g. exchange) if you convert an existing 
installation between the two. If you try to use 4k native mode, ideally 
you would want to do a fresh install, to avoid any such issues. Or 
again, refer to the docs and test it. Just beware it may cause issues if 
you try to switch to 4k native.

As a final note you can use this tool to process an OSD log with "debug 
filestore = 10" enabled, it will print out how many of the operations 
were unaligned:
https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb


You can just enable debug filestore = 10 dynamically on 1 OSD for about
5 minutes, turn it off, and process the log. And you could compare 
before/after. I haven't written an equivalent tool for BlueStore 
unfortunately if you are already in the modern world :) I also didnt' 
check maybe debug osd or something also has the writes and offsets, so I 
could write a generic tool to cover both cases, but also I have not done 
that.



Hope that helps.

Regards,
Trent

On Thu, 16 May 2019 at 14:52, Alexandre DERUMIER 
wrote:


Many thanks for the analysis !


I'm going to test with 4K on heavy mssql database to see if I'm 
seeing improvement on ios/latency.
I'll report results in this thread.


- Mail original -
De: "Trent Lloyd" 
    À: "ceph-users" 
    Envoyé: Vendredi 10 Mai 2019 09:59:39
    Objet: [ceph-users] Poor performance for 512b aligned "partial" 
writes from Windows guests in OpenStack + potential fix

I recently was investigating a performance problem for a reasonably 

sized OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS 
HDD) with NVMe Journals. The primary workload is Windows guests backed 
by Cinder RBD volumes. 
Thi

Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-16 Thread Marc Roos


I am not sure if it is possible to run fsutil on disk without drive 
letter, but mounted on path. 
So I used:
diskpart
select volume 3
Filesystems

And gives me this: 
Current File System

  Type : NTFS
  Allocation Unit Size : 4096
  Flags : 

File Systems Supported for Formatting

  Type : NTFS (Default)
  Allocation Unit Sizes: 512, 1024, 2048, 4096 (Default), 8192, 16K, 
32K, 64K

  Type : FAT32
  Allocation Unit Sizes: 4096, 8192 (Default), 16K, 32K, 64K

  Type : REFS
  Allocation Unit Sizes: 4096 (Default), 64K

So it looks like it detects 4k correctly? But I do not have the  in the disk of 
libvirt and have the WD with 512e:

[@c01 ~]# smartctl -a /dev/sdb | grep 'Sector Size'
Sector Sizes: 512 bytes logical, 4096 bytes physical

CentOS Linux release 7.6.1810 (Core)
ceph version 12.2.12
libvirt-4.5.0



-Original Message-
From: Trent Lloyd [mailto:trent.ll...@canonical.com] 
Sent: donderdag 16 mei 2019 9:57
To: Alexandre DERUMIER
Cc: ceph-users
Subject: Re: [ceph-users] Poor performance for 512b aligned "partial" 
writes from Windows guests in OpenStack + potential fix

For libvirt VMs, first you need to add "" to the relevant 
 sections, and then stop/start the VM to apply the change.

Then you need to make sure your VirtIO drivers (the Fedora/Red Hat 
variety anyway) are from late 2018 or so. There was a bug fixed around 
July 2018, before that date, the physical_block_size=4096 parameter is 
not used by the Windows VirtIO driver (it was supposed to be, but did 
not work).

Relevant links:
https://bugzilla.redhat.com/show_bug.cgi?id=1428641
https://github.com/virtio-win/kvm-guest-drivers-windows/pull/312 

After that, you can check if Windows is correctly recognizing the 
physical block size,

Start cmd.exe with "Run as administrator", then run fsutil fsinfo 
ntfsinfo c:

It should show "Bytes Per Physical Sector : 4096"



Lastly at least for Windows itself this makes it do 4096-byte writes 
"most of the time", however some applications including Exchange have 
special handling of the sector size. I'm not really sure how MSSQL 
handles it, for example, it may or may not work correctly if you switch 
to 4096 bytes after installation - you may have to create new data files 
or something for it to do 4k segments - or not. Hopefully the MSSQL 
documentation has some information about that.

It is also possible to set logical_block_size=4096 as well as 
physical_block_size=4096 ("4k native") however this absolutely causes 
problems with some software (e.g. exchange) if you convert an existing 
installation between the two. If you try to use 4k native mode, ideally 
you would want to do a fresh install, to avoid any such issues. Or 
again, refer to the docs and test it. Just beware it may cause issues if 
you try to switch to 4k native.

As a final note you can use this tool to process an OSD log with "debug 
filestore = 10" enabled, it will print out how many of the operations 
were unaligned:
https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb


You can just enable debug filestore = 10 dynamically on 1 OSD for about 
5 minutes, turn it off, and process the log. And you could compare 
before/after. I haven't written an equivalent tool for BlueStore 
unfortunately if you are already in the modern world :) I also didnt' 
check maybe debug osd or something also has the writes and offsets, so I 
could write a generic tool to cover both cases, but also I have not done 
that.



Hope that helps.

Regards,
Trent

On Thu, 16 May 2019 at 14:52, Alexandre DERUMIER  
wrote:


Many thanks for the analysis !


I'm going to test with 4K on heavy mssql database to see if I'm 
seeing improvement on ios/latency.
I'll report results in this thread.


- Mail original -
De: "Trent Lloyd" 
    À: "ceph-users" 
    Envoyé: Vendredi 10 Mai 2019 09:59:39
    Objet: [ceph-users] Poor performance for 512b aligned "partial" 
writes from Windows guests in OpenStack + potential fix

I recently was investigating a performance problem for a reasonably 
sized OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS 
HDD) with NVMe Journals. The primary workload is Windows guests backed 
by Cinder RBD volumes. 
This specific deployment is Ceph Jewel (FileStore + 
SimpleMessenger) which while it is EOL, the issue is reproducible on 
current versions and also on BlueStore however for different reasons 
than FileStore. 

Generally the Ceph cluster was suffering from very poor outlier 
performance, the numbers change a little bit depending on the exact 
situation but roughly 80% of I/O was happening in a "reasonable" time of 
0-200ms but 5-20% of I/O operations were taking excessively

Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-16 Thread Trent Lloyd
For libvirt VMs, first you need to add "" to the relevant  sections, and then
stop/start the VM to apply the change.

Then you need to make sure your VirtIO drivers (the Fedora/Red Hat variety
anyway) are from late 2018 or so. There was a bug fixed around July 2018,
before that date, the physical_block_size=4096 parameter is not used by the
Windows VirtIO driver (it was supposed to be, but did not work).

Relevant links:
https://bugzilla.redhat.com/show_bug.cgi?id=1428641
https://github.com/virtio-win/kvm-guest-drivers-windows/pull/312

After that, you can check if Windows is correctly recognizing the physical
block size,

Start cmd.exe with "Run as administrator", then run
fsutil fsinfo ntfsinfo c:

It should show "Bytes Per Physical Sector : 4096"


Lastly at least for Windows itself this makes it do 4096-byte writes "most
of the time", however some applications including Exchange have special
handling of the sector size. I'm not really sure how MSSQL handles it, for
example, it may or may not work correctly if you switch to 4096 bytes after
installation - you may have to create new data files or something for it to
do 4k segments - or not. Hopefully the MSSQL documentation has some
information about that.

It is also possible to set logical_block_size=4096 as well as
physical_block_size=4096 ("4k native") however this absolutely causes
problems with some software (e.g. exchange) if you convert an existing
installation between the two. If you try to use 4k native mode, ideally you
would want to do a fresh install, to avoid any such issues. Or again, refer
to the docs and test it. Just beware it may cause issues if you try to
switch to 4k native.

As a final note you can use this tool to process an OSD log with "debug
filestore = 10" enabled, it will print out how many of the operations were
unaligned:
https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb

You can just enable debug filestore = 10 dynamically on 1 OSD for about 5
minutes, turn it off, and process the log. And you could compare
before/after. I haven't written an equivalent tool for BlueStore
unfortunately if you are already in the modern world :) I also didnt' check
maybe debug osd or something also has the writes and offsets, so I could
write a generic tool to cover both cases, but also I have not done that.


Hope that helps.

Regards,
Trent

On Thu, 16 May 2019 at 14:52, Alexandre DERUMIER 
wrote:

> Many thanks for the analysis !
>
>
> I'm going to test with 4K on heavy mssql database to see if I'm seeing
> improvement on ios/latency.
> I'll report results in this thread.
>
>
> - Mail original -
> De: "Trent Lloyd" 
> À: "ceph-users" 
> Envoyé: Vendredi 10 Mai 2019 09:59:39
> Objet: [ceph-users] Poor performance for 512b aligned "partial" writes
> from Windows guests in OpenStack + potential fix
>
> I recently was investigating a performance problem for a reasonably sized
> OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS HDD) with
> NVMe Journals. The primary workload is Windows guests backed by Cinder RBD
> volumes.
> This specific deployment is Ceph Jewel (FileStore + SimpleMessenger) which
> while it is EOL, the issue is reproducible on current versions and also on
> BlueStore however for different reasons than FileStore.
>
> Generally the Ceph cluster was suffering from very poor outlier
> performance, the numbers change a little bit depending on the exact
> situation but roughly 80% of I/O was happening in a "reasonable" time of
> 0-200ms but 5-20% of I/O operations were taking excessively long anywhere
> from 500ms through to 10-20+ seconds. However the normal metrics for commit
> and apply latency were normal, and in fact, this latency was hard to spot
> in the performance metrics available in jewel.
>
> Previously I more simply considered FileStore to have the "commit" (to
> journal) stage where it was written to the journal and it is OK to return
> to the client and then the "apply" (to disk) stage where it was flushed to
> disk and confirmed so that the data could be purged from the journal.
> However there is really a third stage in the middle where FileStore submits
> the I/O to the operating system and this is done before the lock on the
> object is released. Until that succeeds another operation cannot write to
> the same object (generally being a 4MB area of the disk).
>
> I found that the fstore_op threads would get stuck for hundreds of MS or
> more inside of pwritev() which was blocking inside of the kernel. Normally
> we expect pwritev() to be buffered I/O into the page cache and return quite
> fast however in this case the kernel was in a few percent of cases blocking
> with the stack trace included at the end 

Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-16 Thread Alexandre DERUMIER
Many thanks for the analysis !


I'm going to test with 4K on heavy mssql database to see if I'm seeing 
improvement on ios/latency.
I'll report results in this thread.


- Mail original -
De: "Trent Lloyd" 
À: "ceph-users" 
Envoyé: Vendredi 10 Mai 2019 09:59:39
Objet: [ceph-users] Poor performance for 512b aligned "partial" writes from 
Windows guests in OpenStack + potential fix

I recently was investigating a performance problem for a reasonably sized 
OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS HDD) with NVMe 
Journals. The primary workload is Windows guests backed by Cinder RBD volumes. 
This specific deployment is Ceph Jewel (FileStore + SimpleMessenger) which 
while it is EOL, the issue is reproducible on current versions and also on 
BlueStore however for different reasons than FileStore. 

Generally the Ceph cluster was suffering from very poor outlier performance, 
the numbers change a little bit depending on the exact situation but roughly 
80% of I/O was happening in a "reasonable" time of 0-200ms but 5-20% of I/O 
operations were taking excessively long anywhere from 500ms through to 10-20+ 
seconds. However the normal metrics for commit and apply latency were normal, 
and in fact, this latency was hard to spot in the performance metrics available 
in jewel. 

Previously I more simply considered FileStore to have the "commit" (to journal) 
stage where it was written to the journal and it is OK to return to the client 
and then the "apply" (to disk) stage where it was flushed to disk and confirmed 
so that the data could be purged from the journal. However there is really a 
third stage in the middle where FileStore submits the I/O to the operating 
system and this is done before the lock on the object is released. Until that 
succeeds another operation cannot write to the same object (generally being a 
4MB area of the disk). 

I found that the fstore_op threads would get stuck for hundreds of MS or more 
inside of pwritev() which was blocking inside of the kernel. Normally we expect 
pwritev() to be buffered I/O into the page cache and return quite fast however 
in this case the kernel was in a few percent of cases blocking with the stack 
trace included at the end of the e-mail [1]. My finding from that stack is that 
inside __block_write_begin_int we see a call to out_of_line_wait_on_bit call 
which is really an inlined call for wait_on_buffer which occurs in 
linux/fs/buffer.c in the section around line 2000-2024 with the comment "If we 
issued read requests - let them complete." ( [ 
https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002
 | 
https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002
 ] ) 

My interpretation of that code is that for Linux to store a write in the page 
cache, it has to have the entire 4K page as that is the granularity of which it 
tracks the dirty state and it needs the entire 4K page to later submit back to 
the disk. Since we wrote a part of the page, and the page wasn't already in the 
cache, it has to fetch the remainder of the page from the disk. When this 
happens, it blocks waiting for this read to complete before returning from the 
pwritev() call - hence our normally buffered write blocks. This holds up the 
tp_fstore_op thread, of which there are (by default) only 2-4 such threads 
trying to process several hundred operations per second. Additionally the size 
of the osd_op_queue is bounded, and operations do not clear out of this queue 
until the tp_fstore_op thread is done. Which ultimately means that not only are 
these partial writes delayed but it knocks on to delay other writes behind them 
because of the constrained thread pools. 

What was further confusing to this, is that I could easily reproduce this in a 
test deployment using an rbd benchmark that was only writing to a total disk 
size of 256MB which I would easily have expected to fit in the page cache: 
rbd create -p rbd --size=256M bench2 
rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256 --io-total 256M 
--io-pattern rand 

This is explained by the fact that on secondary OSDs (at least, there was some 
refactoring of fadvise which I have not fully understood as of yet), FileStore 
is using fadvise FADVISE_DONTNEED on the objects after write which causes the 
kernel to immediately discard them from the page cache without any regard to 
their statistics of being recently/frequently used. The motivation for this 
addition appears to be that on a secondary OSD we don't service reads (only 
writes) and so therefor we can optimize memory usage by throwing away this 
object and in theory leaving more room in the page cache for objects which we 
are primary for and expect to actually service reads from a client for. 
Unfortunately this behavior does not take into account partial writes, where we

Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-10 Thread Martin Verges
yes, we recommend this as a precaution to get the best possible IO
performance for all workloads and usage scenarios. 512e doesn't bring any
advantage and in some cases can mean a performance disadvantage. By the
way, 4kN and 512e cost exactly the same at our dealers.

Whether this really makes a difference in the individual case with virtual
disks by the underlying physical disks, I can't say.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Fr., 10. Mai 2019 um 10:54 Uhr schrieb Trent Lloyd <
trent.ll...@canonical.com>:

> Note that the issue I am talking about here is how a "Virtual" Ceph RBD
> disk is presented to a virtual guest, and specifically for Windows guests
> (Linux guests are not affected). I am not at all talking about how the
> physical disks are presented to Ceph itself (although Martin was, he wasn't
> clear whether changing these underlying physical disks to 4kn was for Ceph
> or other environments).
>
> I would not expect that having your underlying physical disk presented to
> Ceph itself as 512b/512e or 4kn to have a significant impact on performance
> for the reason that Linux systems generally send 4k-aligned I/O anyway
> (regardless of what the underlying disk is reporting for
> physical_block_size). There may be some exceptions to that, such as
> applications performing Direct I/O to the disk. If anyone knows otherwise,
> it would be great to hear specific details.
>
> Regards,
> Trent
>
> On Fri, May 10, 2019 at 4:40 PM Marc Roos 
> wrote:
>
>>
>> Hmmm, so if I have (wd) drives that list this in smartctl output, I
>> should try and reformat them to 4k, which will give me better
>> performance?
>>
>> Sector Sizes: 512 bytes logical, 4096 bytes physical
>>
>> Do you have a link to this download? Can only find some .cz site with
>> the rpms.
>>
>>
>> -Original Message-----
>> From: Martin Verges [mailto:martin.ver...@croit.io]
>> Sent: vrijdag 10 mei 2019 10:21
>> To: Trent Lloyd
>> Cc: ceph-users
>> Subject: Re: [ceph-users] Poor performance for 512b aligned "partial"
>> writes from Windows guests in OpenStack + potential fix
>>
>> Hello Trent,
>>
>> many thanks for the insights. We always suggest to use 4kN over 512e
>> HDDs to our users.
>>
>> As we recently found out, is that WD Support offers a tool called HUGO
>> to reformat 512e to 4kN drives with "hugo format -m  -n
>> max --fastformat -b 4096" in seconds.
>> Maybe that helps someone that has bought the wrong disk.
>>
>> --
>> Martin Verges
>> Managing director
>>
>> Mobile: +49 174 9335695
>> E-Mail: martin.ver...@croit.io
>> Chat: https://t.me/MartinVerges
>>
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht
>> Munich HRB 231263
>>
>> Web: https://croit.io
>> YouTube: https://goo.gl/PGE1Bx
>>
>>
>>
>> Am Fr., 10. Mai 2019 um 10:00 Uhr schrieb Trent Lloyd
>> :
>>
>>
>> I recently was investigating a performance problem for a
>> reasonably
>> sized OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS
>> HDD) with NVMe Journals. The primary workload is Windows guests backed
>> by Cinder RBD volumes.
>> This specific deployment is Ceph Jewel (FileStore +
>> SimpleMessenger) which while it is EOL, the issue is reproducible on
>> current versions and also on BlueStore however for different reasons
>> than FileStore.
>>
>>
>> Generally the Ceph cluster was suffering from very poor outlier
>> performance, the numbers change a little bit depending on the exact
>> situation but roughly 80% of I/O was happening in a "reasonable" time of
>> 0-200ms but 5-20% of I/O operations were taking excessively long
>> anywhere from 500ms through to 10-20+ seconds. However the normal
>> metrics for commit and apply latency were normal, and in fact, this
>> latency was hard to spot in the performance metrics available in jewel.
>>
>> Previously I more simply considered FileStore to have the
>> "commit"
>> (to journal) stage where it was written to the journal and it is OK to
>> return to the client and then the "apply" (to disk) stage where it was
>> flushed to disk and confirmed so that the data could

Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-10 Thread Martin Verges
Hello,

I'm not yet sure if I'm allowed to share the files, but if you find one of
those, you can verify the md5sum.

27d2223d66027d8e989fc07efb2df514  hugo-6.8.0.i386.deb.zip
b7db78c3927ef3d53eb2113a4e369906  hugo-6.8.0.i386.rpm.zip
9a53ed8e201298de6da7ac6a7fd9dba0  hugo-6.8.0.i386.tar.gz.zip
2deaa31186adb36b92016a252b996e70  HUGO-6.8.0.win32.zip
cd031ca8bf47b8976035d08125a2c591  HUGO-6.8.0.win64.zip
b9d90bb70415c4c5ec29dc04180c65a8  HUGO-6.8.0.winArm64.zip
6d4fc696de0b0f95b54fccdb096e634f  hugo-6.8.0.x86_64.deb.zip
12f8e39dc3cdd6c03e4eb3809a37ce65  hugo-6.8.0.x86_64.rpm.zip
545527fbb28af0c0ff4611fa20be0460  hugo-6.8.0.x86_64.tar.gz.zip

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Fr., 10. Mai 2019 um 10:40 Uhr schrieb Marc Roos <
m.r...@f1-outsourcing.eu>:

>
> Hmmm, so if I have (wd) drives that list this in smartctl output, I
> should try and reformat them to 4k, which will give me better
> performance?
>
> Sector Sizes: 512 bytes logical, 4096 bytes physical
>
> Do you have a link to this download? Can only find some .cz site with
> the rpms.
>
>
> -Original Message-
> From: Martin Verges [mailto:martin.ver...@croit.io]
> Sent: vrijdag 10 mei 2019 10:21
> To: Trent Lloyd
> Cc: ceph-users
> Subject: Re: [ceph-users] Poor performance for 512b aligned "partial"
> writes from Windows guests in OpenStack + potential fix
>
> Hello Trent,
>
> many thanks for the insights. We always suggest to use 4kN over 512e
> HDDs to our users.
>
> As we recently found out, is that WD Support offers a tool called HUGO
> to reformat 512e to 4kN drives with "hugo format -m  -n
> max --fastformat -b 4096" in seconds.
> Maybe that helps someone that has bought the wrong disk.
>
> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io
> Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht
> Munich HRB 231263
>
> Web: https://croit.io
> YouTube: https://goo.gl/PGE1Bx
>
>
>
> Am Fr., 10. Mai 2019 um 10:00 Uhr schrieb Trent Lloyd
> :
>
>
> I recently was investigating a performance problem for a
> reasonably
> sized OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS
> HDD) with NVMe Journals. The primary workload is Windows guests backed
> by Cinder RBD volumes.
> This specific deployment is Ceph Jewel (FileStore +
> SimpleMessenger) which while it is EOL, the issue is reproducible on
> current versions and also on BlueStore however for different reasons
> than FileStore.
>
>
> Generally the Ceph cluster was suffering from very poor outlier
> performance, the numbers change a little bit depending on the exact
> situation but roughly 80% of I/O was happening in a "reasonable" time of
> 0-200ms but 5-20% of I/O operations were taking excessively long
> anywhere from 500ms through to 10-20+ seconds. However the normal
> metrics for commit and apply latency were normal, and in fact, this
> latency was hard to spot in the performance metrics available in jewel.
>
> Previously I more simply considered FileStore to have the "commit"
> (to journal) stage where it was written to the journal and it is OK to
> return to the client and then the "apply" (to disk) stage where it was
> flushed to disk and confirmed so that the data could be purged from the
> journal. However there is really a third stage in the middle where
> FileStore submits the I/O to the operating system and this is done
> before the lock on the object is released. Until that succeeds another
> operation cannot write to the same object (generally being a 4MB area of
> the disk).
>
> I found that the fstore_op threads would get stuck for hundreds of
> MS or more inside of pwritev() which was blocking inside of the kernel.
> Normally we expect pwritev() to be buffered I/O into the page cache and
> return quite fast however in this case the kernel was in a few percent
> of cases blocking with the stack trace included at the end of the e-mail
> [1]. My finding from that stack is that inside __block_write_begin_int
> we see a call to out_of_line_wait_on_bit call which is really an inlined
> call for wait_on_buffer which occurs in linux/fs/buffer.c in the section
> around line 2000-2024 with the comment "If we issued read requests - let
> them complete."
>

Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-10 Thread Trent Lloyd
Note that the issue I am talking about here is how a "Virtual" Ceph RBD
disk is presented to a virtual guest, and specifically for Windows guests
(Linux guests are not affected). I am not at all talking about how the
physical disks are presented to Ceph itself (although Martin was, he wasn't
clear whether changing these underlying physical disks to 4kn was for Ceph
or other environments).

I would not expect that having your underlying physical disk presented to
Ceph itself as 512b/512e or 4kn to have a significant impact on performance
for the reason that Linux systems generally send 4k-aligned I/O anyway
(regardless of what the underlying disk is reporting for
physical_block_size). There may be some exceptions to that, such as
applications performing Direct I/O to the disk. If anyone knows otherwise,
it would be great to hear specific details.

Regards,
Trent

On Fri, May 10, 2019 at 4:40 PM Marc Roos  wrote:

>
> Hmmm, so if I have (wd) drives that list this in smartctl output, I
> should try and reformat them to 4k, which will give me better
> performance?
>
> Sector Sizes: 512 bytes logical, 4096 bytes physical
>
> Do you have a link to this download? Can only find some .cz site with
> the rpms.
>
>
> -Original Message-
> From: Martin Verges [mailto:martin.ver...@croit.io]
> Sent: vrijdag 10 mei 2019 10:21
> To: Trent Lloyd
> Cc: ceph-users
> Subject: Re: [ceph-users] Poor performance for 512b aligned "partial"
> writes from Windows guests in OpenStack + potential fix
>
> Hello Trent,
>
> many thanks for the insights. We always suggest to use 4kN over 512e
> HDDs to our users.
>
> As we recently found out, is that WD Support offers a tool called HUGO
> to reformat 512e to 4kN drives with "hugo format -m  -n
> max --fastformat -b 4096" in seconds.
> Maybe that helps someone that has bought the wrong disk.
>
> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io
> Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht
> Munich HRB 231263
>
> Web: https://croit.io
> YouTube: https://goo.gl/PGE1Bx
>
>
>
> Am Fr., 10. Mai 2019 um 10:00 Uhr schrieb Trent Lloyd
> :
>
>
> I recently was investigating a performance problem for a
> reasonably
> sized OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS
> HDD) with NVMe Journals. The primary workload is Windows guests backed
> by Cinder RBD volumes.
> This specific deployment is Ceph Jewel (FileStore +
> SimpleMessenger) which while it is EOL, the issue is reproducible on
> current versions and also on BlueStore however for different reasons
> than FileStore.
>
>
> Generally the Ceph cluster was suffering from very poor outlier
> performance, the numbers change a little bit depending on the exact
> situation but roughly 80% of I/O was happening in a "reasonable" time of
> 0-200ms but 5-20% of I/O operations were taking excessively long
> anywhere from 500ms through to 10-20+ seconds. However the normal
> metrics for commit and apply latency were normal, and in fact, this
> latency was hard to spot in the performance metrics available in jewel.
>
> Previously I more simply considered FileStore to have the "commit"
> (to journal) stage where it was written to the journal and it is OK to
> return to the client and then the "apply" (to disk) stage where it was
> flushed to disk and confirmed so that the data could be purged from the
> journal. However there is really a third stage in the middle where
> FileStore submits the I/O to the operating system and this is done
> before the lock on the object is released. Until that succeeds another
> operation cannot write to the same object (generally being a 4MB area of
> the disk).
>
> I found that the fstore_op threads would get stuck for hundreds of
> MS or more inside of pwritev() which was blocking inside of the kernel.
> Normally we expect pwritev() to be buffered I/O into the page cache and
> return quite fast however in this case the kernel was in a few percent
> of cases blocking with the stack trace included at the end of the e-mail
> [1]. My finding from that stack is that inside __block_write_begin_int
> we see a call to out_of_line_wait_on_bit call which is really an inlined
> call for wait_on_buffer which occurs in linux/fs/buffer.c in the section
> around line 2000-2024 with the comment "If we issued read requests - let
> them complete."
> (https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b255
> 9f7f827c/fs/buffer.c#L2002
> <https://gith

Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-10 Thread Marc Roos
 
Hmmm, so if I have (wd) drives that list this in smartctl output, I 
should try and reformat them to 4k, which will give me better 
performance?

Sector Sizes: 512 bytes logical, 4096 bytes physical

Do you have a link to this download? Can only find some .cz site with 
the rpms. 


-Original Message-
From: Martin Verges [mailto:martin.ver...@croit.io] 
Sent: vrijdag 10 mei 2019 10:21
To: Trent Lloyd
Cc: ceph-users
Subject: Re: [ceph-users] Poor performance for 512b aligned "partial" 
writes from Windows guests in OpenStack + potential fix

Hello Trent,

many thanks for the insights. We always suggest to use 4kN over 512e 
HDDs to our users.

As we recently found out, is that WD Support offers a tool called HUGO 
to reformat 512e to 4kN drives with "hugo format -m  -n 
max --fastformat -b 4096" in seconds.
Maybe that helps someone that has bought the wrong disk.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht 
Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx



Am Fr., 10. Mai 2019 um 10:00 Uhr schrieb Trent Lloyd 
:


I recently was investigating a performance problem for a reasonably 
sized OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS 
HDD) with NVMe Journals. The primary workload is Windows guests backed 
by Cinder RBD volumes.
This specific deployment is Ceph Jewel (FileStore + 
SimpleMessenger) which while it is EOL, the issue is reproducible on 
current versions and also on BlueStore however for different reasons 
than FileStore.


Generally the Ceph cluster was suffering from very poor outlier 
performance, the numbers change a little bit depending on the exact 
situation but roughly 80% of I/O was happening in a "reasonable" time of 
0-200ms but 5-20% of I/O operations were taking excessively long 
anywhere from 500ms through to 10-20+ seconds. However the normal 
metrics for commit and apply latency were normal, and in fact, this 
latency was hard to spot in the performance metrics available in jewel.

Previously I more simply considered FileStore to have the "commit" 
(to journal) stage where it was written to the journal and it is OK to 
return to the client and then the "apply" (to disk) stage where it was 
flushed to disk and confirmed so that the data could be purged from the 
journal. However there is really a third stage in the middle where 
FileStore submits the I/O to the operating system and this is done 
before the lock on the object is released. Until that succeeds another 
operation cannot write to the same object (generally being a 4MB area of 
the disk).

I found that the fstore_op threads would get stuck for hundreds of 
MS or more inside of pwritev() which was blocking inside of the kernel. 
Normally we expect pwritev() to be buffered I/O into the page cache and 
return quite fast however in this case the kernel was in a few percent 
of cases blocking with the stack trace included at the end of the e-mail 
[1]. My finding from that stack is that inside __block_write_begin_int 
we see a call to out_of_line_wait_on_bit call which is really an inlined 
call for wait_on_buffer which occurs in linux/fs/buffer.c in the section 
around line 2000-2024 with the comment "If we issued read requests - let 
them complete." 
(https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b255
9f7f827c/fs/buffer.c#L2002)

My interpretation of that code is that for Linux to store a write 
in the page cache, it has to have the entire 4K page as that is the 
granularity of which it tracks the dirty state and it needs the entire 
4K page to later submit back to the disk. Since we wrote a part of the 
page, and the page wasn't already in the cache, it has to fetch the 
remainder of the page from the disk. When this happens, it blocks 
waiting for this read to complete before returning from the pwritev() 
call - hence our normally buffered write blocks. This holds up the 
tp_fstore_op thread, of which there are (by default) only 2-4 such 
threads trying to process several hundred operations per second. 
Additionally the size of the osd_op_queue is bounded, and operations do 
not clear out of this queue until the tp_fstore_op thread is done. Which 
ultimately means that not only are these partial writes delayed but it 
knocks on to delay other writes behind them because of the constrained 
thread pools.

What was further confusing to this, is that I could easily 
reproduce this in a test deployment using an rbd benchmark that was only 
writing to a total disk size of 256MB which I would easily have expected 
to fit in the page cache:

rbd create -p rbd --size=256M bench2
rbd bench-write -p rbd bench2 

Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-10 Thread Martin Verges
Hello Trent,

many thanks for the insights. We always suggest to use 4kN over 512e HDDs
to our users.

As we recently found out, is that WD Support offers a tool called HUGO to
reformat 512e to 4kN drives with "hugo format -m  -n max
--fastformat -b 4096" in seconds.
Maybe that helps someone that has bought the wrong disk.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Fr., 10. Mai 2019 um 10:00 Uhr schrieb Trent Lloyd <
trent.ll...@canonical.com>:

> I recently was investigating a performance problem for a reasonably sized
> OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS HDD) with
> NVMe Journals. The primary workload is Windows guests backed by Cinder RBD
> volumes.
> This specific deployment is Ceph Jewel (FileStore + SimpleMessenger) which
> while it is EOL, the issue is reproducible on current versions and also on
> BlueStore however for different reasons than FileStore.
>
> Generally the Ceph cluster was suffering from very poor outlier
> performance, the numbers change a little bit depending on the exact
> situation but roughly 80% of I/O was happening in a "reasonable" time of
> 0-200ms but 5-20% of I/O operations were taking excessively long anywhere
> from 500ms through to 10-20+ seconds. However the normal metrics for commit
> and apply latency were normal, and in fact, this latency was hard to spot
> in the performance metrics available in jewel.
>
> Previously I more simply considered FileStore to have the "commit" (to
> journal) stage where it was written to the journal and it is OK to return
> to the client and then the "apply" (to disk) stage where it was flushed to
> disk and confirmed so that the data could be purged from the journal.
> However there is really a third stage in the middle where FileStore submits
> the I/O to the operating system and this is done before the lock on the
> object is released. Until that succeeds another operation cannot write to
> the same object (generally being a 4MB area of the disk).
>
> I found that the fstore_op threads would get stuck for hundreds of MS or
> more inside of pwritev() which was blocking inside of the kernel. Normally
> we expect pwritev() to be buffered I/O into the page cache and return quite
> fast however in this case the kernel was in a few percent of cases blocking
> with the stack trace included at the end of the e-mail [1]. My finding from
> that stack is that inside __block_write_begin_int we see a call to
> out_of_line_wait_on_bit call which is really an inlined call for
> wait_on_buffer which occurs in linux/fs/buffer.c in the section around line
> 2000-2024 with the comment "If we issued read requests - let them
> complete." (
> https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002
> )
>
> My interpretation of that code is that for Linux to store a write in the
> page cache, it has to have the entire 4K page as that is the granularity of
> which it tracks the dirty state and it needs the entire 4K page to later
> submit back to the disk. Since we wrote a part of the page, and the page
> wasn't already in the cache, it has to fetch the remainder of the page from
> the disk. When this happens, it blocks waiting for this read to complete
> before returning from the pwritev() call - hence our normally buffered
> write blocks. This holds up the tp_fstore_op thread, of which there are (by
> default) only 2-4 such threads trying to process several hundred operations
> per second. Additionally the size of the osd_op_queue is bounded, and
> operations do not clear out of this queue until the tp_fstore_op thread is
> done. Which ultimately means that not only are these partial writes delayed
> but it knocks on to delay other writes behind them because of the
> constrained thread pools.
>
> What was further confusing to this, is that I could easily reproduce this
> in a test deployment using an rbd benchmark that was only writing to a
> total disk size of 256MB which I would easily have expected to fit in the
> page cache:
> rbd create -p rbd --size=256M bench2
> rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256 --io-total
> 256M --io-pattern rand
>
> This is explained by the fact that on secondary OSDs (at least, there was
> some refactoring of fadvise which I have not fully understood as of yet),
> FileStore is using fadvise FADVISE_DONTNEED on the objects after write
> which causes the kernel to immediately discard them from the page cache
> without any regard to their statistics of being recently/frequently used.
> The motivation for this addition appears to be that on a secondary OSD we
> don't service reads (only writes) and so therefor we can optimize memory
> usage by throwing away this object and 

[ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-10 Thread Trent Lloyd
I recently was investigating a performance problem for a reasonably sized
OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS HDD) with
NVMe Journals. The primary workload is Windows guests backed by Cinder RBD
volumes.
This specific deployment is Ceph Jewel (FileStore + SimpleMessenger) which
while it is EOL, the issue is reproducible on current versions and also on
BlueStore however for different reasons than FileStore.

Generally the Ceph cluster was suffering from very poor outlier
performance, the numbers change a little bit depending on the exact
situation but roughly 80% of I/O was happening in a "reasonable" time of
0-200ms but 5-20% of I/O operations were taking excessively long anywhere
from 500ms through to 10-20+ seconds. However the normal metrics for commit
and apply latency were normal, and in fact, this latency was hard to spot
in the performance metrics available in jewel.

Previously I more simply considered FileStore to have the "commit" (to
journal) stage where it was written to the journal and it is OK to return
to the client and then the "apply" (to disk) stage where it was flushed to
disk and confirmed so that the data could be purged from the journal.
However there is really a third stage in the middle where FileStore submits
the I/O to the operating system and this is done before the lock on the
object is released. Until that succeeds another operation cannot write to
the same object (generally being a 4MB area of the disk).

I found that the fstore_op threads would get stuck for hundreds of MS or
more inside of pwritev() which was blocking inside of the kernel. Normally
we expect pwritev() to be buffered I/O into the page cache and return quite
fast however in this case the kernel was in a few percent of cases blocking
with the stack trace included at the end of the e-mail [1]. My finding from
that stack is that inside __block_write_begin_int we see a call to
out_of_line_wait_on_bit call which is really an inlined call for
wait_on_buffer which occurs in linux/fs/buffer.c in the section around line
2000-2024 with the comment "If we issued read requests - let them
complete." (
https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002
)

My interpretation of that code is that for Linux to store a write in the
page cache, it has to have the entire 4K page as that is the granularity of
which it tracks the dirty state and it needs the entire 4K page to later
submit back to the disk. Since we wrote a part of the page, and the page
wasn't already in the cache, it has to fetch the remainder of the page from
the disk. When this happens, it blocks waiting for this read to complete
before returning from the pwritev() call - hence our normally buffered
write blocks. This holds up the tp_fstore_op thread, of which there are (by
default) only 2-4 such threads trying to process several hundred operations
per second. Additionally the size of the osd_op_queue is bounded, and
operations do not clear out of this queue until the tp_fstore_op thread is
done. Which ultimately means that not only are these partial writes delayed
but it knocks on to delay other writes behind them because of the
constrained thread pools.

What was further confusing to this, is that I could easily reproduce this
in a test deployment using an rbd benchmark that was only writing to a
total disk size of 256MB which I would easily have expected to fit in the
page cache:
rbd create -p rbd --size=256M bench2
rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256 --io-total
256M --io-pattern rand

This is explained by the fact that on secondary OSDs (at least, there was
some refactoring of fadvise which I have not fully understood as of yet),
FileStore is using fadvise FADVISE_DONTNEED on the objects after write
which causes the kernel to immediately discard them from the page cache
without any regard to their statistics of being recently/frequently used.
The motivation for this addition appears to be that on a secondary OSD we
don't service reads (only writes) and so therefor we can optimize memory
usage by throwing away this object and in theory leaving more room in the
page cache for objects which we are primary for and expect to actually
service reads from a client for. Unfortunately this behavior does not take
into account partial writes, where we now pathologically throw away the
cached copy instantly such that a write even 1 second later will have to
fetch the page from disk again. I also found that this FADVISE_DONTNEED is
issue not only during filestore sync but also by the WBThrottle - which as
this cluster was quite busy was constantly flushing writes leading to the
cache being discarded almost instantly.

Changing filestore_fadvise to False on this cluster lead to a significant
performance increase as it could now cache the pages in memory in many
cases. The number of reads from disk was reduced from around 40/second to
2/second, and the number of slow writes