It depends on what you expect your typical workload to be like. Ceph (and distributed storage in general) likes high io depths so writes can hit all of the drives at the same time. There are tricks (like journals, writahead logs, centralized caches, etc) that can help mitigate this, but I suspect you'll see much better performance with more concurrent writes.

Regarding file size, the smaller the file, the more likely those tricks mentioned above are to help you. Based on your results, it appears filestore may be doing a better job of it than bluestore. The question you have to ask is whether or not this kind of test represents what you are likely to see for real on your cluster.

Doing writes over a much larger file, say 3-4x over the total amount of RAM in all of the nodes, helps you get a better idea of what the behavior is like when those tricks are less effective. I think that's probably a more likely scenario in most production environments, but it's up to you which workload you think better represents what you are going to see in practice. A while back Nick Fisk showed some results wehre bluestore was slower than filestore at small sync writes and it could be that we simply have more work to do in this area. On the other hand, we pretty consistently see bluestore doing better than filestore with 4k random writes and higher IO depths, which is why I'd be curious to see how it goes if you try that.

Mark

On 11/16/2017 10:11 AM, Milanov, Radoslav Nikiforov wrote:
No,
What test parameters (iodepth/file size/numjobs) would make sense  for 3 
node/27OSD@4TB ?
- Rado

-----Original Message-----
From: Mark Nelson [mailto:mnel...@redhat.com]
Sent: Thursday, November 16, 2017 10:56 AM
To: Milanov, Radoslav Nikiforov <rad...@bu.edu>; David Turner 
<drakonst...@gmail.com>
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Bluestore performance 50% of filestore

Did you happen to have a chance to try with a higher io depth?

Mark

On 11/16/2017 09:53 AM, Milanov, Radoslav Nikiforov wrote:
FYI

Having 50GB bock.db made no difference on the performance.



- Rado



*From:*David Turner [mailto:drakonst...@gmail.com]
*Sent:* Tuesday, November 14, 2017 6:13 PM
*To:* Milanov, Radoslav Nikiforov <rad...@bu.edu>
*Cc:* Mark Nelson <mnel...@redhat.com>; ceph-users@lists.ceph.com
*Subject:* Re: [ceph-users] Bluestore performance 50% of filestore



I'd probably say 50GB to leave some extra space over-provisioned.
50GB should definitely prevent any DB operations from spilling over to the HDD.



On Tue, Nov 14, 2017, 5:43 PM Milanov, Radoslav Nikiforov
<rad...@bu.edu <mailto:rad...@bu.edu>> wrote:

    Thank you,

    It is 4TB OSDs and they might become full someday, I’ll try 60GB db
    partition – this is the max OSD capacity.



    - Rado



    *From:*David Turner [mailto:drakonst...@gmail.com
    <mailto:drakonst...@gmail.com>]
    *Sent:* Tuesday, November 14, 2017 5:38 PM


    *To:* Milanov, Radoslav Nikiforov <rad...@bu.edu
<mailto:rad...@bu.edu>>

    *Cc:*Mark Nelson <mnel...@redhat.com <mailto:mnel...@redhat.com>>;
    ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>


    *Subject:* Re: [ceph-users] Bluestore performance 50% of filestore



    You have to configure the size of the db partition in the config
    file for the cluster.  If you're db partition is 1GB, then I can all
    but guarantee that you're using your HDD for your blocks.db very
    quickly into your testing.  There have been multiple threads
    recently about what size the db partition should be and it seems to
    be based on how many objects your OSD is likely to have on it.  The
    recommendation has been to err on the side of bigger.  If you're
    running 10TB OSDs and anticipate filling them up, then you probably
    want closer to an 80GB+ db partition.  That's why I asked how full
    your cluster was and how large your HDDs are.



    Here's a link to one of the recent ML threads on this
    topic.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020
822.html

    On Tue, Nov 14, 2017 at 4:44 PM Milanov, Radoslav Nikiforov
    <rad...@bu.edu <mailto:rad...@bu.edu>> wrote:

        Block-db partition is the default 1GB (is there a way to modify
        this? journals are 5GB in filestore case) and usage is low:



        [root@kumo-ceph02 ~]# ceph df

        GLOBAL:

            SIZE        AVAIL      RAW USED     %RAW USED

            100602G     99146G        1455G          1.45

        POOLS:

            NAME              ID     USED       %USED     MAX AVAIL
        OBJECTS

            kumo-vms          1      19757M      0.02
        31147G        5067

            kumo-volumes      2        214G      0.18
        31147G       55248

            kumo-images       3        203G      0.17
        31147G       66486

            kumo-vms3         11     45824M      0.04
        31147G       11643

            kumo-volumes3     13     10837M         0
        31147G        2724

            kumo-images3      15     82450M      0.09
        31147G       10320



        - Rado



        *From:*David Turner [mailto:drakonst...@gmail.com
        <mailto:drakonst...@gmail.com>]
        *Sent:* Tuesday, November 14, 2017 4:40 PM
        *To:* Mark Nelson <mnel...@redhat.com <mailto:mnel...@redhat.com>>
        *Cc:* Milanov, Radoslav Nikiforov <rad...@bu.edu
        <mailto:rad...@bu.edu>>; ceph-users@lists.ceph.com
        <mailto:ceph-users@lists.ceph.com>


        *Subject:* Re: [ceph-users] Bluestore performance 50% of
filestore



        How big was your blocks.db partition for each OSD and what size
        are your HDDs?  Also how full is your cluster?  It's possible
        that your blocks.db partition wasn't large enough to hold the
        entire db and it had to spill over onto the HDD which would
        definitely impact performance.



        On Tue, Nov 14, 2017 at 4:36 PM Mark Nelson <mnel...@redhat.com
        <mailto:mnel...@redhat.com>> wrote:

            How big were the writes in the windows test and how much
            concurrency was
            there?

            Historically bluestore does pretty well for us with small
            random writes
            so your write results surprise me a bit.  I suspect it's the
            low queue
            depth.  Sometimes bluestore does worse with reads, especially if
            readahead isn't enabled on the client.

            Mark

            On 11/14/2017 03:14 PM, Milanov, Radoslav Nikiforov wrote:
            > Hi Mark,
            > Yes RBD is in write back, and the only thing that changed
            was converting OSDs to bluestore. It is 7200 rpm drives and
            triple replication. I also get same results (bluestore 2
            times slower) testing continuous writes on a 40GB partition
            on a Windows VM, completely different tool.
            >
            > Right now I'm going back to filestore for the OSDs so
            additional tests are possible if that helps.
            >
            > - Rado
            >
            > -----Original Message-----
            > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
            <mailto:ceph-users-boun...@lists.ceph.com>] On Behalf Of
            Mark Nelson
            > Sent: Tuesday, November 14, 2017 4:04 PM
            > To: ceph-users@lists.ceph.com
            <mailto:ceph-users@lists.ceph.com>
            > Subject: Re: [ceph-users] Bluestore performance 50% of
            filestore
            >
            > Hi Radoslav,
            >
            > Is RBD cache enabled and in writeback mode?  Do you have
            client side readahead?
            >
            > Both are doing better for writes than you'd expect from
            the native performance of the disks assuming they are
            typical 7200RPM drives and you are using 3X replication
            (~150IOPS * 27 / 3 = ~1350 IOPS).  Given the small file
            size, I'd expect that you might be getting better journal
            coalescing in filestore.
            >
            > Sadly I imagine you can't do a comparison test at this
            point, but I'd be curious how it would look if you used
            libaio with a high iodepth and a much bigger partition to do
            random writes over.
            >
            > Mark
            >
            > On 11/14/2017 01:54 PM, Milanov, Radoslav Nikiforov wrote:
            >> Hi
            >>
            >> We have 3 node, 27 OSDs cluster running Luminous 12.2.1
            >>
            >> In filestore configuration there are 3 SSDs used for
            journals of 9
            >> OSDs on each hosts (1 SSD has 3 journal paritions for 3
            OSDs).
            >>
            >> I've converted filestore to bluestore by wiping 1 host a
            time and
            >> waiting for recovery. SSDs now contain block-db - again
            one SSD
            >> serving
            >> 3 OSDs.
            >>
            >>
            >>
            >> Cluster is used as storage for Openstack.
            >>
            >> Running fio on a VM in that Openstack reveals bluestore
            performance
            >> almost twice slower than filestore.
            >>
            >> fio --name fio_test_file --direct=1 --rw=randwrite
            --bs=4k --size=1G
            >> --numjobs=2 --time_based --runtime=180 --group_reporting
            >>
            >> fio --name fio_test_file --direct=1 --rw=randread --bs=4k
            --size=1G
            >> --numjobs=2 --time_based --runtime=180 --group_reporting
            >>
            >>
            >>
            >>
            >>
            >> Filestore
            >>
            >>   write: io=3511.9MB, bw=19978KB/s, iops=4994,
            runt=180001msec
            >>
            >>   write: io=3525.6MB, bw=20057KB/s, iops=5014,
            runt=180001msec
            >>
            >>   write: io=3554.1MB, bw=20222KB/s, iops=5055,
            runt=180016msec
            >>
            >>
            >>
            >>   read : io=1995.7MB, bw=11353KB/s, iops=2838,
            runt=180001msec
            >>
            >>   read : io=1824.5MB, bw=10379KB/s, iops=2594,
            runt=180001msec
            >>
            >>   read : io=1966.5MB, bw=11187KB/s, iops=2796,
            runt=180001msec
            >>
            >>
            >>
            >> Bluestore
            >>
            >>   write: io=1621.2MB, bw=9222.3KB/s, iops=2305,
            runt=180002msec
            >>
            >>   write: io=1576.3MB, bw=8965.6KB/s, iops=2241,
            runt=180029msec
            >>
            >>   write: io=1531.9MB, bw=8714.3KB/s, iops=2178,
            runt=180001msec
            >>
            >>
            >>
            >>   read : io=1279.4MB, bw=7276.5KB/s, iops=1819,
            runt=180006msec
            >>
            >>   read : io=773824KB, bw=4298.9KB/s, iops=1074,
            runt=180010msec
            >>
            >>   read : io=1018.5MB, bw=5793.7KB/s, iops=1448,
            runt=180001msec
            >>
            >>
            >>
            >>
            >>
            >> - Rado
            >>
            >>
            >>
            >>
            >>
            >> _______________________________________________
            >> ceph-users mailing list
            >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
            >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
            >>
            > _______________________________________________
            > ceph-users mailing list
            > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
            > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
            >
            _______________________________________________
            ceph-users mailing list
            ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
            http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to