Hello,
Sorry for the misunderstanding about IOPS. Here are some summary stats
of my benchmark (Is the 20 - 30 IOPS seems normal to you?):
ceph osd pool create test 512 512
rados bench -p test 10 write --no-cleanup
Total time run: 10.480383
Total writes made: 288
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 109.92
Stddev Bandwidth: 11.9926
Max bandwidth (MB/sec): 124
Min bandwidth (MB/sec): 80
Average IOPS: 27
Stddev IOPS: 3
Max IOPS: 31
Min IOPS: 20
Average Latency(s): 0.579105
Stddev Latency(s): 0.19902
Max latency(s): 1.32831
Min latency(s): 0.245505
rados bench -p bench -p test 10 seq
Total time run: 10.340724
Total reads made: 288
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 111.404
Average IOPS 27
Stddev IOPS: 2
Max IOPS: 31
Min IOPS: 22
Average Latency(s): 0.564858
Max latency(s): 1.65278
Min latency(s): 0.141504
rados bench -p bench -p test 10 rand
Total time run: 10.546251
Total reads made: 293
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 111.13
Average IOPS: 27
Stddev IOPS: 2
Max IOPS: 32
Min IOPS: 24
Average Latency(s): 0.57092
Max latency(s): 1.8631
Min latency(s): 0.161936
On Tue, Jul 12, 2016 at 9:18 PM, Christian Balzer <[email protected]> wrote:
>
> Hello,
>
> On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote:
>
> > I am using 10G infiniband for cluster network and 1G ethernet for public.
> Hmm, very unbalanced, but I guess that's HW you already had.
>
> > Because I don't have enough slots on the node, so I am using three files
> on
> > the OS drive (SSD) for journaling, which really improved but not entirely
> > solved the problem.
> >
> If you can, use partitions instead of files, less overhead.
> What model SSD is that?
>
> Also putting the meta-data pool on SSDs might help.
>
> > I am quite happy with the current IOPS, which range from 200 MB/s to 400
> > MB/s sequential write, depending on the block size.
> That's not IOPS, that's bandwidth, throughput.
>
> >But the problem is,
> > when I transfer data to the cephfs at a rate below 100MB/s, I can observe
> > the slow/blocked requests warnings after a few minutes via "ceph -w".
>
> I doubt the speed has anything to do with this, but the actual block size
> and IOPS numbers.
>
> As always, watch your storage nodes with atop (or iostat) during such
> scenarios/tests and spot the bottlenecks.
>
> >It's
> > not specific to any particular OSDs. So I started to doubt if the
> > configuration is correct or upgrading to Jewel can solve it.
> >
> Jewel is likely to help in general, but can't fix insufficient HW or
> broken configurations.
>
> > There are about 5,000,000 objects currently in the cluster.
> >
> You're robably not hitting his, but read the recent filestore merge and
> split threads, including the entirety of this thread:
> https://www.mail-archive.com/[email protected]/msg29243.html
>
> Christian
>
> > Thanks for the hints.
> >
> > On Tue, Jul 12, 2016 at 8:19 PM, Christian Balzer <[email protected]> wrote:
> >
> > >
> > > Hello,
> > >
> > > On Tue, 12 Jul 2016 19:54:38 -0500 Di Zhang wrote:
> > >
> > > > It's a 5 nodes cluster. Each node has 3 OSDs. I set pg_num = 512 for
> both
> > > > cephfs_data and cephfs_metadata. I experienced some slow/blocked
> requests
> > > > issues when I was using hammer 0.94.x and prior. So I was thinking
> if the
> > > > pg_num is too large for metadata.
> > >
> > > Very, VERY much doubt this.
> > >
> > > Your "ideal" values for a cluster of this size (are you planning to
> grow
> > > it?) would be about 1024 PGs for data and 128 or 256 PGs for meta-data.
> > >
> > > Not really that far off and more importantly not overloading the OSDs
> with
> > > too many PGs in total. Or do you have more pools?
> > >
> > >
> > > >I just upgraded the cluster to Jewel
> > > > today. Will watch if the problem remains.
> > > >
> > > Jewel improvements might mask things, but I'd venture that your
> problems
> > > were caused by your HW not being sufficient for the load.
> > >
> > > As in, do you use SSD journals, etc?
> > > How many IOPS do you need/expect from your CephFS?
> > > How many objects are in there?
> > >
> > > Christian
> > >
> > > > Thank you.
> > > >
> > > > On Tue, Jul 12, 2016 at 6:45 PM, Gregory Farnum <[email protected]>
> > > wrote:
> > > >
> > > > > I'm not at all sure that rados cppool actually captures everything
> (it
> > > > > might). Doug has been working on some similar stuff for disaster
> > > > > recovery testing and can probably walk you through moving over.
> > > > >
> > > > > But just how large *is* your metadata pool in relation to others?
> > > > > Having a too-large pool doesn't cost much unless it's
> > > > > grossly-inflated, and having a nice distribution of your folders is
> > > > > definitely better than not.
> > > > > -Greg
> > > > >
> > > > > On Tue, Jul 12, 2016 at 4:14 PM, Di Zhang <[email protected]>
> > > wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Is there any way to change the metadata pool for a cephfs
> without
> > > > > losing
> > > > > > any existing data? I know how to clone the metadata pool using
> rados
> > > > > cppool.
> > > > > > But the filesystem still links to the original metadata pool no
> > > matter
> > > > > what
> > > > > > you name it.
> > > > > >
> > > > > > The motivation here is to decrease the pg_num of the metadata
> > > pool. I
> > > > > > created this cephfs cluster sometime ago, while I didn't realize
> > > that I
> > > > > > shouldn't assign a large pg_num to such a small pool.
> > > > > >
> > > > > > I'm not sure if I can delete the fs and re-create it using
> the
> > > > > existing
> > > > > > data pool and the cloned metadata pool.
> > > > > >
> > > > > > Thank you.
> > > > > >
> > > > > >
> > > > > > Zhang Di
> > > > > >
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list
> > > > > > [email protected]
> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > > >
> > > > >
> > >
> > >
> > > --
> > > Christian Balzer Network/Systems Engineer
> > > [email protected] Global OnLine Japan/Rakuten Communications
> > > http://www.gol.com/
> > >
>
>
> --
> Christian Balzer Network/Systems Engineer
> [email protected] Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com