Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)

Barclay Jameson Mon, 20 Apr 2015 11:14:43 -0700

Using rados benchmark. It's just a test pool anyway. I will stick with my
current OSD setup (16HDDs 4 SSDs for a 1:4 ration of SSD to HDD). I can get
> 800 MB/s write and about 1GB read.


On Mon, Apr 20, 2015 at 11:19 AM, Mark Nelson <[email protected]> wrote:

> How are you measuring the 300MB/s and 184MB/s?  IE is it per drive, or the
> client throughput?  Also what controller do you have?  We've seen some
> controllers from certain manufacturers start to top out at around 1-2GB/s
> with write cache enabled.
>
> Mark
>
> On 04/20/2015 11:15 AM, Barclay Jameson wrote:
>
>> I have a SSD pool for testing (only 8 Drives) but when I do a 1 SSD with
>> journal and 1 SSD with Data I get > 300 MB/s write. When I change all 8
>> Disks to house the journal I get < 184MB/s write.
>>
>>
>> On Mon, Apr 20, 2015 at 10:16 AM, Mark Nelson <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     The big question is how fast these drives can do O_DSYNC writes.
>>     The basic gist of this is that for every write to the journal, an
>>     ATA_CMD_FLUSH call is made to ensure that the device (or potentially
>>     the controller) know that this data really needs to be stored safely
>>     before the flush is acknowledged.  How this gets handled is really
>>     important.
>>
>>     1) If devices have limited or no power loss protection, they need to
>>     flush the contents of any caches to non-volatile memory.  How
>>     quickly this can happen depends on a lot of factors, but even on
>>     SSDs may be slow enough to limit performance greatly relative how
>>     quickly writes can proceed if uninterrupted.
>>
>>     * It's very important to note that some devices that lack power loss
>>     protection may simply *ignore* ATA_CMD_FLUSH and return immediately
>>     so as to appear fast, even though this means that data may become
>>     corrupt.  Be very careful putting journals on devices that do this!
>>
>>     ** Some devices that have claimed to have power loss protection
>>     don't actually have capacitors big enough to flush data from cache.
>>     This has lead to huge amounts of confusion and you have to be very
>>     careful.  For a specific example see the section titled "The Truth
>>     About Micron's Power-Loss Portection" here:
>>
>> http://www.anandtech.com/show/8528/micron-m600-128gb-256gb-1tb-ssd-review-nda-placeholder
>>
>>     2) Devices that feature proper power loss protection such that
>>     caches can be flushed in the event of power failure can safely
>>     ignore ATA_CMD_FLUSH and return immediately when ATA_CMD_FLUSH is
>>     called.  This greatly improves the performance of ceph journal
>>     writes and usually allows the journal to perform at or near the
>>     theoretical sequential write performance of the device.
>>
>>     3) Some controllers may be able to intercept these calls and return
>>     immediately on ATA_CMD_FLUSH if they have an on-board BBU that
>>     functions in the same way as PLP on the drives would.  Unfortunately
>>     on many controllers this is tied to enabling writeback cache and
>>     running the drives in some kind of RAID mode (single-disk RAID0 LUNs
>>     are often used for Ceph OSDs with this kind of setup).  In some
>>     cases the controller itself can become a bottleneck with SSDs so
>>     it's important to test this out and make sure it works well in
>> practice.
>>
>>     Regarding the 840 EVO, it sounds like based on user reports that it
>>     does not have PLP and does flush data on ATA_CMD_FLUSH resulting in
>>     quite a bit slower performance when doing O_DSYNC writes.
>>     Unfortunately we don't have any in the lab we can test, but likely
>>     this is why you are seeing slower write performance on them when
>>     journals are placed on the SSD.
>>
>>     Mark
>>
>>     On 04/20/2015 09:48 AM, J-P Methot wrote:
>>
>>         My journals are on-disk, each disk being a SSD. The reason I
>>         didn't go
>>         with dedicated drives for journals is that when designing the
>>         setup, I
>>         was told that having dedicated journal SSDs on a full-SSD setup
>>         would
>>         not give me performance increases.
>>
>>         So that makes the journal disk to data disk ratio 1:1.
>>
>>         The replication size is 3, yes. The pools are replicated.
>>
>>         On 4/20/2015 10:43 AM, Barclay Jameson wrote:
>>
>>             Are your journals on separate disks? What is your ratio of
>>             journal
>>             disks to data disks? Are you doing replication size 3 ?
>>
>>             On Mon, Apr 20, 2015 at 9:30 AM, J-P Methot
>>             <[email protected] <mailto:[email protected]>
>>             <mailto:[email protected] <mailto:[email protected]>>>
>>
>>             wrote:
>>
>>                  Hi,
>>
>>                  This is similar to another thread running right now,
>>             but since our
>>                  current setup is completely different from the one
>>             described in
>>                  the other thread, I thought it may be better to start a
>>             new one.
>>
>>                  We are running Ceph Firefly 0.80.8 (soon to be upgraded
>> to
>>                  0.80.9). We have 6 OSD hosts with 16 OSD each (so a
>>             total of 96
>>                  OSDs). Each OSD is a Samsung SSD 840 EVO on which I can
>>             reach
>>                  write speeds of roughly 400 MB/sec, plugged in jbod on a
>>                  controller that can theoretically transfer at 6gb/sec.
>>             All of that
>>                  is linked to openstack compute nodes on two bonded
>>             10gbps links
>>                  (so a max transfer rate of 20 gbps).
>>
>>                  When I run rados bench from the compute nodes, I reach
>>             the network
>>                  cap in read speed. However, write speeds are vastly
>>             inferior,
>>                  reaching about 920 MB/sec. If I have 4 compute nodes
>>             running the
>>                  write benchmark at the same time, I can see the number
>>             plummet to
>>                  350 MB/sec . For our planned usage, we find it to be
>>             rather slow,
>>                  considering we will run a high number of virtual
>>             machines in there.
>>
>>                  Of course, the first thing to do would be to transfer
>>             the journal
>>                  on faster drives. However, these are SSDs we're talking
>>             about. We
>>                  don't really have access to faster drives. I must find
>>             a way to
>>                  get better write speeds. Thus, I am looking for
>>             suggestions as to
>>                  how to make it faster.
>>
>>                  I have also thought of options myself like:
>>                  -Upgrading to the latest stable hammer version (would
>>             that really
>>                  give me a big performance increase?)
>>                  -Crush map modifications? (this is a long shot, but I'm
>>             still
>>                  using the default crush map, maybe there's a change
>>             there I could
>>                  make to improve performances)
>>
>>                  Any suggestions as to anything else I can tweak would
>>             be strongly
>>                  appreciated.
>>
>>                  For reference, here's part of my ceph.conf:
>>
>>                  [global]
>>                  auth_service_required = cephx
>>                  filestore_xattr_use_omap = true
>>                  auth_client_required = cephx
>>                  auth_cluster_required = cephx
>>                  osd pool default size = 3
>>
>>
>>                  osd pg bits = 12
>>                  osd pgp bits = 12
>>                  osd pool default pg num = 800
>>                  osd pool default pgp num = 800
>>
>>                  [client]
>>                  rbd cache = true
>>                  rbd cache writethrough until flush = true
>>
>>                  [osd]
>>                  filestore_fd_cache_size = 1000000
>>                  filestore_omap_header_cache_size = 1000000
>>                  filestore_fd_cache_random = true
>>                  filestore_queue_max_ops = 5000
>>                  journal_queue_max_ops = 1000000
>>                  max_open_files = 1000000
>>                  osd journal size = 10000
>>
>>                  --
>>                  ======================
>>                  Jean-Philippe Méthot
>>                  Administrateur système / System administrator
>>                  GloboTech Communications
>>                  Phone: 1-514-907-0050 <tel:1-514-907-0050>
>>             <tel:1-514-907-0050 <tel:1-514-907-0050>>
>>                  Toll Free: 1-(888)-GTCOMM1
>>                  Fax: 1-(514)-907-0750 <tel:1-%28514%29-907-0750>
>>             <tel:1-%28514%29-907-0750>
>>             [email protected] <mailto:[email protected]>
>>             <mailto:[email protected] <mailto:[email protected]>>
>>             http://www.gtcomm.net
>>
>>                  _______________________________________________
>>                  ceph-users mailing list
>>             [email protected] <mailto:[email protected]>
>>             <mailto:[email protected]
>>             <mailto:[email protected]>>
>>             http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>>         --
>>         ======================
>>         Jean-Philippe Méthot
>>         Administrateur système / System administrator
>>         GloboTech Communications
>>         Phone: 1-514-907-0050 <tel:1-514-907-0050>
>>         Toll Free: 1-(888)-GTCOMM1
>>         Fax: 1-(514)-907-0750 <tel:1-%28514%29-907-0750>
>>         [email protected] <mailto:[email protected]>
>>         http://www.gtcomm.net
>>
>>
>>
>>         _______________________________________________
>>         ceph-users mailing list
>>         [email protected] <mailto:[email protected]>
>>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>     _______________________________________________
>>     ceph-users mailing list
>>     [email protected] <mailto:[email protected]>
>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)

Reply via email to