On Wed, Sep 2, 2015 at 9:34 PM, Bob Ababurko <[email protected]> wrote:
> When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a very
> long time to rebalance. I should note that my cluster is slightly unique in
> that I am using cephfs(shouldn't matter?) and it currently contains about
> 310 million objects.
>
> The last time I replaced a disk/OSD was 2.5 days ago and it is still
> rebalancing. This is on a cluster with no client load.
>
> The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro
> SSD which contains the journals for said OSD's. Thats means 30 OSD's in
> total. System disk is on its own disk. I'm also using a backend network
> with single Gb NIC. THe rebalancing rate(objects/s) seems to be very slow
> when it is close to finishing....say <1% objects misplaced.
>
> It doesn't seem right that it would take 2+ days to rebalance a 1TB disk
> with no load on the cluster. Are my expectations off?
Possibly...Ceph basically needs to treat each object as a single IO.
If you're recovering from a failed disk then you've got to replicate
roughly 310 million * 3 / 30 = 31 million objects. If it's perfectly
balanced across 30 disks that get 80 IOPS that's 12916 seconds (~3.5
hours) worth of work just to read each file — and in reality it's
likely to take more than one IO to read the file, and then you have to
spend a bunch to write it as well.
>
> I'm not sure if my pg_num/pgp_num needs to be changed OR the rebalance time
> is dependent on the number of objects in the pool. These are thoughts i've
> had but am not certain are relevant here.
Rebalance time is dependent on the number of objects in the pool. You
*might* see an improvement by increasing "osd max push objects" from
its default of 10...or you might not. That many small files isn't
something I've explored.
-Greg
>
> $ sudo ceph -v
> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>
> $ sudo ceph -s
> [sudo] password for bababurko:
> cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
> health HEALTH_WARN
> 5 pgs backfilling
> 5 pgs stuck unclean
> recovery 3046506/676638611 objects misplaced (0.450%)
> monmap e1: 3 mons at
> {cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0}
> election epoch 20, quorum 0,1,2 cephmon01,cephmon02,cephmon03
> mdsmap e6070: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby
> osdmap e4395: 30 osds: 30 up, 30 in; 5 remapped pgs
> pgmap v3100039: 2112 pgs, 3 pools, 6454 GB data, 321 Mobjects
> 18319 GB used, 9612 GB / 27931 GB avail
> 3046506/676638611 objects misplaced (0.450%)
> 2095 active+clean
> 12 active+clean+scrubbing+deep
> 5 active+remapped+backfilling
> recovery io 2294 kB/s, 147 objects/s
>
> $ sudo rados df
> pool name KB objects clones degraded
> unfound rd rd KB wr wr KB
> cephfs_data 6767569962 335746702 0 0
> 0 2136834 1 676984208 7052266742
> cephfs_metadata 42738 1058437 0 0
> 0 16130199 30718800215 295996938 3811963908
> rbd 0 0 0 0
> 0 0 0 0 0
> total used 19209068780 336805139
> total avail 10079469460
> total space 29288538240
>
> $ sudo ceph osd pool get cephfs_data pgp_num
> pg_num: 1024
> $ sudo ceph osd pool get cephfs_metadata pgp_num
> pg_num: 1024
>
>
> thanks,
> Bob
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com