Re: [ceph-users] HDD-only performance, how far can it be sped up ?

2018-06-28 Thread Horace
You need 1 core per SATA disk, otherwise your load average will be skyrocketed 
when your system is at full load and render the cluster unstable, i.e. ceph-mon 
unreachable, slow requests, etc.  

Regards,
Horace Ng

- Original Message -
From: "Brian :" 
To: "Wladimir Mutel" , "ceph-users" 
Sent: Wednesday, June 20, 2018 4:17:29 PM
Subject: Re: [ceph-users] HDD-only performance, how far can it be sped up ?

Hi Wladimir,

A combination of slow enough clock speed , erasure code, single node
and SATA spinners is probably going to lead to not a really great
evaluation. Some of the experts will chime in here with answers to
your specific questions I"m sure but this test really isn't ever going
to give great results.

Brian

On Wed, Jun 20, 2018 at 8:28 AM, Wladimir Mutel  wrote:
> Dear all,
>
> I set up a minimal 1-node Ceph cluster to evaluate its performance. We
> tried to save as much as possible on the hardware, so now the box has Asus
> P10S-M WS motherboard, Xeon E3-1235L v5 CPU, 64 GB DDR4 ECC RAM and 8x3TB
> HDDs (WD30EFRX) connected to on-board SATA ports. Also we are trying to save
> on storage redundancy, so for most of our RBD images we use erasure-coded
> data-pool (default profile, jerasure 2+1) instead of 3x replication. I
> started with Luminous/Xenial 12.2.5 setup which initialized my OSDs as
> Bluestore during deploy, then updated it to Mimic/Bionic 13.2.0. Base OS is
> Ubuntu 18.04 with kernel updated to 4.17.2 from Ubuntu mainline PPA.
>
> With this setup, I created a number of RBD images to test iSCSI, rbd-nbd
> and QEMU+librbd performance (running QEMU VMs on the same box). And that
> worked moderately well as far as data volume transferred within one session
> was limited. The fastest transfers I had with 'rbd import' which pulled an
> ISO image file at up to 25 MBytes/sec from the remote CIFS share over
> Gigabit Ethernet and stored it into EC data-pool. Windows 2008 R2 & 2016
> setup, update installation, Win 2008 upgrade to 2012 and to 2016 within QEMU
> VM also went through tolerably well. I found that cache=writeback gives the
> best performance with librbd, unlike cache=unsafe which gave the best
> performance with VMs on plain local SATA drives. Also I have a subjective
> feeling (not confirmed by exact measurements) that providing a huge
> libRBD cache (like, cache size = 1GB, max dirty = 7/8GB, max dirty age = 60)
> improved Windows VM performance on bursty writes (like, during Windows
> update installations) as well as on reboots (due to cached reads).
>
> Now, what discouraged me, was my next attempt to clone an NTFS partition
> of ~2TB from a physical drive (via USB3-SATA3 convertor) to a partition on
> an RBD image. I tried to map RBD image with rbd-nbd either locally or
> remotely over Gigabit Ethernet, and the fastest speed I got with ntfsclone
> was about 8 MBytes/sec. Which means that it could spend up to 3 days copying
> these ~2TB of NTFS data. I thought about running
> ntfsclone /dev/sdX1 -o - | rbd import ... - , but ntfsclone needs to rewrite
> a part of existing RBD image starting from certain offset, so I decided this
> was not a solution in my situation. Now I am thinking about taking out one
> of OSDs and using it as a 'bcache' for this operation, but I am not sure how
> good is bcache performance with cache on rotating HDD. I know that keeping
> OSD logs and RocksDB on the same HDD creates a seeky workload which hurts
> overall transfer performance.
>
> Also I am thinking about a number of next-close possibilities, and I
> would like to hear your opinions on the benefits and drawbacks of each of
> them.
>
> 1. Would iSCSI access to that RBD image improve my performance (compared
> to rbd-nbd) ? I did not check that yet, but I noticed that Windows
> transferred about 2.5 MBytes/sec while formatting NTFS volume on this RBD
> attached to it by iSCSI. So, for seeky/sparse workloads like NTFS formatting
> the performance was not great.
>
> 2. Would it help to run ntfsclone in Linux VM, with RBD image accessed
> through QEMU+librbd ? (also going to measure that myself)
>
> 3. Is there any performance benefits in using Ceph cache-tier pools with
> my setup ? I hear now use of this technique is advised against, no?
>
> 4. We have an unused older box (Supermicro X8SIL-F mobo, Xeon X3430 CPU,
> 32 GB of DDR3 ECC RAM, 6 onboard SATA ports, used from 2010 to 2017, in
> perfectly working condition) which can be stuffed with up to 6 SATA HDDs and
> added to this Ceph cluster, so far with only Gigabit network interconnect.
> Like, move 4 OSDs out of first box into it, to have 2 boxes with 4 HDDs
> each. Is this going to improve Ceph performance with the setup described
> above ?
>
> 5. I

Re: [ceph-users] HDD-only performance, how far can it be sped up ?

2018-06-20 Thread Brian :
Hi Wladimir,

A combination of slow enough clock speed , erasure code, single node
and SATA spinners is probably going to lead to not a really great
evaluation. Some of the experts will chime in here with answers to
your specific questions I"m sure but this test really isn't ever going
to give great results.

Brian

On Wed, Jun 20, 2018 at 8:28 AM, Wladimir Mutel  wrote:
> Dear all,
>
> I set up a minimal 1-node Ceph cluster to evaluate its performance. We
> tried to save as much as possible on the hardware, so now the box has Asus
> P10S-M WS motherboard, Xeon E3-1235L v5 CPU, 64 GB DDR4 ECC RAM and 8x3TB
> HDDs (WD30EFRX) connected to on-board SATA ports. Also we are trying to save
> on storage redundancy, so for most of our RBD images we use erasure-coded
> data-pool (default profile, jerasure 2+1) instead of 3x replication. I
> started with Luminous/Xenial 12.2.5 setup which initialized my OSDs as
> Bluestore during deploy, then updated it to Mimic/Bionic 13.2.0. Base OS is
> Ubuntu 18.04 with kernel updated to 4.17.2 from Ubuntu mainline PPA.
>
> With this setup, I created a number of RBD images to test iSCSI, rbd-nbd
> and QEMU+librbd performance (running QEMU VMs on the same box). And that
> worked moderately well as far as data volume transferred within one session
> was limited. The fastest transfers I had with 'rbd import' which pulled an
> ISO image file at up to 25 MBytes/sec from the remote CIFS share over
> Gigabit Ethernet and stored it into EC data-pool. Windows 2008 R2 & 2016
> setup, update installation, Win 2008 upgrade to 2012 and to 2016 within QEMU
> VM also went through tolerably well. I found that cache=writeback gives the
> best performance with librbd, unlike cache=unsafe which gave the best
> performance with VMs on plain local SATA drives. Also I have a subjective
> feeling (not confirmed by exact measurements) that providing a huge
> libRBD cache (like, cache size = 1GB, max dirty = 7/8GB, max dirty age = 60)
> improved Windows VM performance on bursty writes (like, during Windows
> update installations) as well as on reboots (due to cached reads).
>
> Now, what discouraged me, was my next attempt to clone an NTFS partition
> of ~2TB from a physical drive (via USB3-SATA3 convertor) to a partition on
> an RBD image. I tried to map RBD image with rbd-nbd either locally or
> remotely over Gigabit Ethernet, and the fastest speed I got with ntfsclone
> was about 8 MBytes/sec. Which means that it could spend up to 3 days copying
> these ~2TB of NTFS data. I thought about running
> ntfsclone /dev/sdX1 -o - | rbd import ... - , but ntfsclone needs to rewrite
> a part of existing RBD image starting from certain offset, so I decided this
> was not a solution in my situation. Now I am thinking about taking out one
> of OSDs and using it as a 'bcache' for this operation, but I am not sure how
> good is bcache performance with cache on rotating HDD. I know that keeping
> OSD logs and RocksDB on the same HDD creates a seeky workload which hurts
> overall transfer performance.
>
> Also I am thinking about a number of next-close possibilities, and I
> would like to hear your opinions on the benefits and drawbacks of each of
> them.
>
> 1. Would iSCSI access to that RBD image improve my performance (compared
> to rbd-nbd) ? I did not check that yet, but I noticed that Windows
> transferred about 2.5 MBytes/sec while formatting NTFS volume on this RBD
> attached to it by iSCSI. So, for seeky/sparse workloads like NTFS formatting
> the performance was not great.
>
> 2. Would it help to run ntfsclone in Linux VM, with RBD image accessed
> through QEMU+librbd ? (also going to measure that myself)
>
> 3. Is there any performance benefits in using Ceph cache-tier pools with
> my setup ? I hear now use of this technique is advised against, no?
>
> 4. We have an unused older box (Supermicro X8SIL-F mobo, Xeon X3430 CPU,
> 32 GB of DDR3 ECC RAM, 6 onboard SATA ports, used from 2010 to 2017, in
> perfectly working condition) which can be stuffed with up to 6 SATA HDDs and
> added to this Ceph cluster, so far with only Gigabit network interconnect.
> Like, move 4 OSDs out of first box into it, to have 2 boxes with 4 HDDs
> each. Is this going to improve Ceph performance with the setup described
> above ?
>
> 5. I hear that RAID controllers like Adaptec 5805, LSI 2108 provide
> better performance with SATA HDDs exported as JBODs than onboard SATA AHCI
> controllers due to more aggressive caching and reordering requests. Is this
> true ?
>
> 6. On the local market we can buy Kingston KC1000/960GB NVMe drive for
> moderately reasonable price. Its specification has rewrite limit of 1 PB and
> 0.58 DWPD (drive rewrite per day). Is there any counterindications against
> using it in production Ceph setup (i.e., too low rewrite limit, look for
> 8+PB) ? What is the difference between using it as a 'bcache' os as
> specifically-designed OSD log+rocksdb storage ? Can it 

[ceph-users] HDD-only performance, how far can it be sped up ?

2018-06-20 Thread Wladimir Mutel

Dear all,

I set up a minimal 1-node Ceph cluster to evaluate its performance. 
We tried to save as much as possible on the hardware, so now the box has 
Asus P10S-M WS motherboard, Xeon E3-1235L v5 CPU, 64 GB DDR4 ECC RAM and 
8x3TB HDDs (WD30EFRX) connected to on-board SATA ports. Also we are 
trying to save on storage redundancy, so for most of our RBD images we 
use erasure-coded data-pool (default profile, jerasure 2+1) instead of 
3x replication. I started with Luminous/Xenial 12.2.5 setup which 
initialized my OSDs as Bluestore during deploy, then updated it to 
Mimic/Bionic 13.2.0. Base OS is Ubuntu 18.04 with kernel updated to 
4.17.2 from Ubuntu mainline PPA.


With this setup, I created a number of RBD images to test iSCSI, 
rbd-nbd and QEMU+librbd performance (running QEMU VMs on the same box). 
And that worked moderately well as far as data volume transferred within 
one session was limited. The fastest transfers I had with 'rbd import' 
which pulled an ISO image file at up to 25 MBytes/sec from the remote 
CIFS share over Gigabit Ethernet and stored it into EC data-pool. 
Windows 2008 R2 & 2016 setup, update installation, Win 2008 upgrade to 
2012 and to 2016 within QEMU VM also went through tolerably well. I 
found that cache=writeback gives the best performance with librbd, 
unlike cache=unsafe which gave the best performance with VMs on plain 
local SATA drives. Also I have a subjective feeling (not confirmed by 
exact measurements) that providing a huge libRBD cache (like, 
cache size = 1GB, max dirty = 7/8GB, max dirty age = 60) improved 
Windows VM performance on bursty writes (like, during Windows update 
installations) as well as on reboots (due to cached reads).


Now, what discouraged me, was my next attempt to clone an NTFS 
partition of ~2TB from a physical drive (via USB3-SATA3 convertor) to a 
partition on an RBD image. I tried to map RBD image with rbd-nbd either 
locally or remotely over Gigabit Ethernet, and the fastest speed I got 
with ntfsclone was about 8 MBytes/sec. Which means that it could spend 
up to 3 days copying these ~2TB of NTFS data. I thought about running
ntfsclone /dev/sdX1 -o - | rbd import ... - , but ntfsclone needs to 
rewrite a part of existing RBD image starting from certain offset, so I 
decided this was not a solution in my situation. Now I am thinking about 
taking out one of OSDs and using it as a 'bcache' for this operation, 
but I am not sure how good is bcache performance with cache on rotating 
HDD. I know that keeping OSD logs and RocksDB on the same HDD creates a 
seeky workload which hurts overall transfer performance.


Also I am thinking about a number of next-close possibilities, and 
I would like to hear your opinions on the benefits and drawbacks of each 
of them.


1. Would iSCSI access to that RBD image improve my performance 
(compared to rbd-nbd) ? I did not check that yet, but I noticed that 
Windows transferred about 2.5 MBytes/sec while formatting NTFS volume on 
this RBD attached to it by iSCSI. So, for seeky/sparse workloads like 
NTFS formatting the performance was not great.


2. Would it help to run ntfsclone in Linux VM, with RBD image 
accessed through QEMU+librbd ? (also going to measure that myself)


3. Is there any performance benefits in using Ceph cache-tier pools 
with my setup ? I hear now use of this technique is advised against, no?


4. We have an unused older box (Supermicro X8SIL-F mobo, Xeon X3430 
CPU, 32 GB of DDR3 ECC RAM, 6 onboard SATA ports, used from 2010 to 
2017, in perfectly working condition) which can be stuffed with up to 6 
SATA HDDs and added to this Ceph cluster, so far with only Gigabit 
network interconnect. Like, move 4 OSDs out of first box into it, to 
have 2 boxes with 4 HDDs each. Is this going to improve Ceph performance 
with the setup described above ?


5. I hear that RAID controllers like Adaptec 5805, LSI 2108 provide 
better performance with SATA HDDs exported as JBODs than onboard SATA 
AHCI controllers due to more aggressive caching and reordering requests. 
Is this true ?


6. On the local market we can buy Kingston KC1000/960GB NVMe drive 
for moderately reasonable price. Its specification has rewrite limit of 
1 PB and 0.58 DWPD (drive rewrite per day). Is there any 
counterindications against using it in production Ceph setup (i.e., too 
low rewrite limit, look for 8+PB) ? What is the difference between using 
it as a 'bcache' os as specifically-designed OSD log+rocksdb storage ? 
Can it be used as a single shared partition for all OSD daemons, or will 
it require spitting into 8 separate partitions ?


Thank you in advance for your replies.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com