Re: [ceph-users] cephfs change metadata pool?

Di Zhang Wed, 20 Jul 2016 10:37:19 -0700

update:

    After upgrading to Jewel and changing journaling to SSD, I no longer
have the slow/blocked requests warnings during normal data copying.
    Thank you all.


Zhang Di

On Wed, Jul 13, 2016 at 11:04 PM, Christian Balzer <[email protected]> wrote:

>
> Hello,
>
> On Wed, 13 Jul 2016 22:47:05 -0500 Di Zhang wrote:
>
> > Hi,
> >       I changed to only use the infiniband network. For the 4KB write,
> the IOPS doesn’t improve much.
>
> That's mostly going to be bound by latencies (as I just wrote in the other
> thread), both network and internal Ceph ones.
>
> The cluster I described in the other thread has 32 OSDs and does about
> 1050 "IOPS" with "rados -p rbd bench 30 write -t 32 -b 4096".
> So about half with your 15 OSDs isn't all that unexpected.
>
> Once again, to get something more realistic use fio.
>
> >I also logged into the OSD nodes and atop showed the disks are not always
> at 100% busy. Please check a snapshot of one node below:
>
> When you do the 4KB bench (for 60 seconds or so), also watch the CPU
> usage, rados bench is a killer there.
>
> Christian
>
> >
> > DSK |          sdc  | busy     72% |  read    20/s |  write   86/s |
> KiB/w     13  | MBr/s   0.16 |  MBw/s   1.12 |  avio 6.69 ms |
> > DSK |          sda  | busy     47% |  read     0/s |  write  589/s |
> KiB/w      4  | MBr/s   0.00 |  MBw/s   2.83 |  avio 0.79 ms |
> > DSK |          sdb  | busy     31% |  read    14/s |  write   77/s |
> KiB/w     10  | MBr/s   0.11 |  MBw/s   0.76 |  avio 3.42 ms |
> > DSK |          sdd  | busy     19% |  read     4/s |  write   50/s |
> KiB/w     11  | MBr/s   0.03 |  MBw/s   0.55 |  avio 3.40 ms |
> > NET | transport     | tcpi   656/s |  tcpo   655/s |  udpi     0/s |
> udpo     0/s  | tcpao    0/s |  tcppo    0/s |  tcprs    0/s |
> > NET | network       | ipi    657/s |  ipo    655/s |  ipfrw    0/s |
> deliv  657/s  |              |  icmpi    0/s |  icmpo    0/s |
> > NET | p10p1     0%  | pcki     0/s |  pcko     0/s |  si    0 Kbps | so
>   1 Kbps  | erri     0/s |  erro     0/s |  drpo     0/s |
> > NET | ib0     ----  | pcki   637/s |  pcko   636/s |  si 8006 Kbps | so
> 5213 Kbps  | erri     0/s |  erro     0/s |  drpo     0/s |
> > NET | lo      ----  | pcki    19/s |  pcko    19/s |  si   14 Kbps | so
>  14 Kbps  | erri     0/s |  erro     0/s |  drpo     0/s |
> >
> >       /dev/sda is the OS and journaling SSD. The other three are OSDs.
> >
> >       Am I missing anything?
> >
> >       Thanks,
> >
> >
> >
> >
> > Zhang, Di
> > Postdoctoral Associate
> > Baylor College of Medicine
> >
> > > On Jul 13, 2016, at 6:56 PM, Christian Balzer <[email protected]> wrote:
> > >
> > >
> > > Hello,
> > >
> > > On Wed, 13 Jul 2016 12:01:14 -0500 Di Zhang wrote:
> > >
> > >> I also tried 4K write bench. The IOPS is ~420.
> > >
> > > That's what people usually mean (4KB blocks) when talking about IOPS.
> > > This number is pretty low, my guess would be network latency on your
> 1Gbs
> > > network for the most part.
> > >
> > > You should run atop on your storage nodes will running a test like this
> > > and see if the OSDs (HDDs) are also very busy.
> > >
> > > Lastly the rados bench gives you some basic numbers but it is not the
> same
> > > as real client I/O, for that you want to run fio inside a VM or in your
> > > case on a mounted CephFS.
> > >
> > >> I used to have better
> > >> bandwidth when I use the same network for both the cluster and
> clients. Now
> > >> the bandwidth must be limited by the 1G ethernet.
> > > That's the bandwidth you also see in your 4MB block tests below.
> > > For small I/Os the real killer is latency, though.
> > >
> > >> What would you suggest to
> > >> me to do?
> > >>
> > > That depends on your budget mostly (switch ports, client NICs).
> > >
> > > A uniform, single 10Gb/s network would be better in all aspects than
> the
> > > split network you have now.
> > >
> > > Christian
> > >
> > >> Thanks,
> > >>
> > >> On Wed, Jul 13, 2016 at 11:37 AM, Di Zhang <[email protected]>
> wrote:
> > >>
> > >>> Hello,
> > >>>    Sorry for the misunderstanding about IOPS. Here are some summary
> stats
> > >>> of my benchmark (Is the 20 - 30 IOPS seems normal to you?):
> > >>>
> > >>> ceph osd pool create test 512 512
> > >>>
> > >>> rados bench -p test 10 write --no-cleanup
> > >>>
> > >>> Total time run:         10.480383
> > >>> Total writes made:      288
> > >>> Write size:             4194304
> > >>> Object size:            4194304
> > >>> Bandwidth (MB/sec):     109.92
> > >>> Stddev Bandwidth:       11.9926
> > >>> Max bandwidth (MB/sec): 124
> > >>> Min bandwidth (MB/sec): 80
> > >>> Average IOPS:           27
> > >>> Stddev IOPS:            3
> > >>> Max IOPS:               31
> > >>> Min IOPS:               20
> > >>> Average Latency(s):     0.579105
> > >>> Stddev Latency(s):      0.19902
> > >>> Max latency(s):         1.32831
> > >>> Min latency(s):         0.245505
> > >>>
> > >>> rados bench -p bench -p test 10 seq
> > >>> Total time run:       10.340724
> > >>> Total reads made:     288
> > >>> Read size:            4194304
> > >>> Object size:          4194304
> > >>> Bandwidth (MB/sec):   111.404
> > >>> Average IOPS          27
> > >>> Stddev IOPS:          2
> > >>> Max IOPS:             31
> > >>> Min IOPS:             22
> > >>> Average Latency(s):   0.564858
> > >>> Max latency(s):       1.65278
> > >>> Min latency(s):       0.141504
> > >>>
> > >>> rados bench -p bench -p test 10 rand
> > >>> Total time run:       10.546251
> > >>> Total reads made:     293
> > >>> Read size:            4194304
> > >>> Object size:          4194304
> > >>> Bandwidth (MB/sec):   111.13
> > >>> Average IOPS:         27
> > >>> Stddev IOPS:          2
> > >>> Max IOPS:             32
> > >>> Min IOPS:             24
> > >>> Average Latency(s):   0.57092
> > >>> Max latency(s):       1.8631
> > >>> Min latency(s):       0.161936
> > >>>
> > >>>
> > >>> On Tue, Jul 12, 2016 at 9:18 PM, Christian Balzer <[email protected]>
> wrote:
> > >>>
> > >>>>
> > >>>> Hello,
> > >>>>
> > >>>> On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote:
> > >>>>
> > >>>>> I am using 10G infiniband for cluster network and 1G ethernet for
> > >>>> public.
> > >>>> Hmm, very unbalanced, but I guess that's HW you already had.
> > >>>>
> > >>>>> Because I don't have enough slots on the node, so I am using three
> > >>>> files on
> > >>>>> the OS drive (SSD) for journaling, which really improved but not
> > >>>> entirely
> > >>>>> solved the problem.
> > >>>>>
> > >>>> If you can, use partitions instead of files, less overhead.
> > >>>> What model SSD is that?
> > >>>>
> > >>>> Also putting the meta-data pool on SSDs might help.
> > >>>>
> > >>>>> I am quite happy with the current IOPS, which range from 200 MB/s
> to 400
> > >>>>> MB/s sequential write, depending on the block size.
> > >>>> That's not IOPS, that's bandwidth, throughput.
> > >>>>
> > >>>>> But the problem is,
> > >>>>> when I transfer data to the cephfs at a rate below 100MB/s, I can
> > >>>> observe
> > >>>>> the slow/blocked requests warnings after a few minutes via "ceph
> -w".
> > >>>>
> > >>>> I doubt the speed has anything to do with this, but the actual
> block size
> > >>>> and IOPS numbers.
> > >>>>
> > >>>> As always, watch your storage nodes with atop (or iostat) during
> such
> > >>>> scenarios/tests and spot the bottlenecks.
> > >>>>
> > >>>>> It's
> > >>>>> not specific to any particular OSDs. So I started to doubt if the
> > >>>>> configuration is correct or upgrading to Jewel can solve it.
> > >>>>>
> > >>>> Jewel is likely to help in general, but can't fix insufficient HW or
> > >>>> broken configurations.
> > >>>>
> > >>>>> There are about 5,000,000 objects currently in the cluster.
> > >>>>>
> > >>>> You're robably not hitting his, but read the recent filestore merge
> and
> > >>>> split threads, including the entirety of this thread:
> > >>>>
> https://www.mail-archive.com/[email protected]/msg29243.html
> > >>>>
> > >>>> Christian
> > >>>>
> > >>>>> Thanks for the hints.
> > >>>>>
> > >>>>> On Tue, Jul 12, 2016 at 8:19 PM, Christian Balzer <[email protected]>
> > >>>> wrote:
> > >>>>>
> > >>>>>>
> > >>>>>> Hello,
> > >>>>>>
> > >>>>>> On Tue, 12 Jul 2016 19:54:38 -0500 Di Zhang wrote:
> > >>>>>>
> > >>>>>>> It's a 5 nodes cluster. Each node has 3 OSDs. I set pg_num = 512
> > >>>> for both
> > >>>>>>> cephfs_data and cephfs_metadata. I experienced some slow/blocked
> > >>>> requests
> > >>>>>>> issues when I was using hammer 0.94.x and prior. So I was
> thinking
> > >>>> if the
> > >>>>>>> pg_num is too large for metadata.
> > >>>>>>
> > >>>>>> Very, VERY much doubt this.
> > >>>>>>
> > >>>>>> Your "ideal" values for a cluster of this size (are you planning
> to
> > >>>> grow
> > >>>>>> it?) would be about 1024 PGs for data and 128 or 256 PGs for
> > >>>> meta-data.
> > >>>>>>
> > >>>>>> Not really that far off and more importantly not overloading the
> OSDs
> > >>>> with
> > >>>>>> too many PGs in total. Or do you have more pools?
> > >>>>>>
> > >>>>>>
> > >>>>>>> I just upgraded the cluster to Jewel
> > >>>>>>> today. Will watch if the problem remains.
> > >>>>>>>
> > >>>>>> Jewel improvements might mask things, but I'd venture that your
> > >>>> problems
> > >>>>>> were caused by your HW not being sufficient for the load.
> > >>>>>>
> > >>>>>> As in, do you use SSD journals, etc?
> > >>>>>> How many IOPS do you need/expect from your CephFS?
> > >>>>>> How many objects are in there?
> > >>>>>>
> > >>>>>> Christian
> > >>>>>>
> > >>>>>>> Thank you.
> > >>>>>>>
> > >>>>>>> On Tue, Jul 12, 2016 at 6:45 PM, Gregory Farnum <
> [email protected]
> > >>>>>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> I'm not at all sure that rados cppool actually captures
> > >>>> everything (it
> > >>>>>>>> might). Doug has been working on some similar stuff for disaster
> > >>>>>>>> recovery testing and can probably walk you through moving over.
> > >>>>>>>>
> > >>>>>>>> But just how large *is* your metadata pool in relation to
> others?
> > >>>>>>>> Having a too-large pool doesn't cost much unless it's
> > >>>>>>>> grossly-inflated, and having a nice distribution of your folders
> > >>>> is
> > >>>>>>>> definitely better than not.
> > >>>>>>>> -Greg
> > >>>>>>>>
> > >>>>>>>> On Tue, Jul 12, 2016 at 4:14 PM, Di Zhang <[email protected]
> >
> > >>>>>> wrote:
> > >>>>>>>>> Hi,
> > >>>>>>>>>
> > >>>>>>>>>    Is there any way to change the metadata pool for a cephfs
> > >>>> without
> > >>>>>>>> losing
> > >>>>>>>>> any existing data? I know how to clone the metadata pool using
> > >>>> rados
> > >>>>>>>> cppool.
> > >>>>>>>>> But the filesystem still links to the original metadata pool no
> > >>>>>> matter
> > >>>>>>>> what
> > >>>>>>>>> you name it.
> > >>>>>>>>>
> > >>>>>>>>>    The motivation here is to decrease the pg_num of the
> > >>>> metadata
> > >>>>>> pool. I
> > >>>>>>>>> created this cephfs cluster sometime ago, while I didn't
> realize
> > >>>>>> that I
> > >>>>>>>>> shouldn't assign a large pg_num to such a small pool.
> > >>>>>>>>>
> > >>>>>>>>>    I'm not sure if I can delete the fs and re-create it using
> > >>>> the
> > >>>>>>>> existing
> > >>>>>>>>> data pool and the cloned metadata pool.
> > >>>>>>>>>
> > >>>>>>>>>    Thank you.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Zhang Di
> > >>>>>>>>>
> > >>>>>>>>> _______________________________________________
> > >>>>>>>>> ceph-users mailing list
> > >>>>>>>>> [email protected]
> > >>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Christian Balzer        Network/Systems Engineer
> > >>>>>> [email protected]           Global OnLine Japan/Rakuten
> Communications
> > >>>>>> http://www.gol.com/
> > >>>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Christian Balzer        Network/Systems Engineer
> > >>>> [email protected]           Global OnLine Japan/Rakuten Communications
> > >>>> http://www.gol.com/
> > >>>>
> > >>>
> > >>>
> > >
> > >
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > [email protected] <mailto:[email protected]>        Global OnLine
> Japan/Rakuten Communications
> > > http://www.gol.com/ <http://www.gol.com/>
>
>
> --
> Christian Balzer        Network/Systems Engineer
> [email protected]           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs change metadata pool?

Reply via email to