Re: [ceph-users] rebalancing taking very long time

Vickey Singh Wed, 09 Sep 2015 11:04:26 -0700

Agreed with Alphe , Ceph Hammer (0.94.2) sucks when it comes to recovery
and rebalancing.


Here is my Ceph Hammer cluster , which is like this for more than 30 hours.

You might be thinking about that one OSD which is down and not in.  Its
intentional, i want to remove that OSD.
I want the cluster to become healthy again before i remove that OSD.

Can someone help us with this problem

 cluster 86edf8b8-b353-49f1-ab0a-a4827a9ea5e8
     health HEALTH_WARN
            14 pgs stuck unclean
            5 requests are blocked > 32 sec
            recovery 420/28358085 objects degraded (0.001%)
            recovery 199941/28358085 objects misplaced (0.705%)
            too few PGs per OSD (28 < min 30)
     monmap e3: 3 mons at {stor0201=10.100.1.201:6789/0,stor0202
=10.100.1.202:6789/0,stor0203=10.100.1.203:6789/0}
            election epoch 1076, quorum 0,1,2 stor0201,stor0202,
stor0203
     osdmap e778879: 96 osds: 95 up, 95 in; 14 remapped pgs
      pgmap v2475334: 896 pgs, 4 pools, 51364 GB data, 9231 kobjects
            150 TB used, 193 TB / 344 TB avail
            420/28358085 objects degraded (0.001%)
            199941/28358085 objects misplaced (0.705%)
                 879 active+clean
                  14 active+remapped
                   3 active+clean+scrubbing+deep



On Tue, Sep 8, 2015 at 5:59 PM, Alphe Salas <[email protected]> wrote:

> I can say exactly the same I am using ceph sin 0.38 and I never get osd so
> laggy than with 0.94. rebalancing /rebuild algorithm is crap in 0.94
> serriously I have 2 osd serving 2 discs of 2TB and 4 GB of RAM osd takes
> 1.6GB each !!! serriously ! that makes avanche snow.
>
> Let me be straight and explain what changed.
>
> in 0.38 you ALWAYS could stop the ceph cluster and then start it up it
> would evaluate if everyone is back if there is enough replicas then start
> rebuilding /rebalancing what needed of course like 10 minutes was necesary
> to bring up ceph cluster but then the rebuilding /rebalancing process was
> smooth.
> With 0.94 first you have 2 osd too full at 95 % and 4 osd at 63% over 20
> osd. then you get a disc crash. so ceph starts automatically to rebuild and
> rebalance stuff. and there osd start to lag then to crash
> you stop ceph cluster you change the drive restart the ceph cluster stops
> all rebuild process setting no-backfill, norecovey noscrub nodeep-scrub you
> rm the old osd create a new one wait for all osd
> to be in and up and then starts rebuilding lag/rebalancing since it is
> automated not much a choice there.
>
> And again all osd are stuck in enless lag/down/recovery intent cycle...
>
> It is a pain serriously. 5 days after changing the faulty disc it is still
> locked in the lag/down/recovery cycle.
>
> Sur it can be argued that my machines are really ressource limited and
> that I should buy 3 thousand dollar worth server at least. But intil 0.72
> that rebalancing /rebuilding process was working smoothly on the same
> hardware.
>
> It seems to me that the rebalancing/rebuilding algorithm is more strict
> now than it was in the past. in the past only what really really needed to
> be rebuild or rebalance was rebalanced or rebuild.
>
> I can still delete all and go back to 0.72... like I should buy a cray
> T-90 to not have anymore problems and have ceph run smoothly. But this will
> not help making ceph a better product.
>
> for me ceph 0.94 is like windows vista...
>
> Alphe Salas
> I.T ingeneer
>
>
> On 09/08/2015 10:20 AM, Gregory Farnum wrote:
>
>> On Wed, Sep 2, 2015 at 9:34 PM, Bob Ababurko <[email protected]> wrote:
>>
>>> When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a
>>> very
>>> long time to rebalance.  I should note that my cluster is slightly
>>> unique in
>>> that I am using cephfs(shouldn't matter?) and it currently contains about
>>> 310 million objects.
>>>
>>> The last time I replaced a disk/OSD was 2.5 days ago and it is still
>>> rebalancing.  This is on a cluster with no client load.
>>>
>>> The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro
>>> SSD which contains the journals for said OSD's.  Thats means 30 OSD's in
>>> total.  System disk is on its own disk.  I'm also using a backend network
>>> with single Gb NIC.  THe rebalancing rate(objects/s) seems to be very
>>> slow
>>> when it is close to finishing....say <1% objects misplaced.
>>>
>>> It doesn't seem right that it would take 2+ days to rebalance a 1TB disk
>>> with no load on the cluster.  Are my expectations off?
>>>
>>
>> Possibly...Ceph basically needs to treat each object as a single IO.
>> If you're recovering from a failed disk then you've got to replicate
>> roughly 310 million * 3 / 30 = 31 million objects. If it's perfectly
>> balanced across 30 disks that get 80 IOPS that's 12916 seconds (~3.5
>> hours) worth of work just to read each file — and in reality it's
>> likely to take more than one IO to read the file, and then you have to
>> spend a bunch to write it as well.
>>
>>
>>> I'm not sure if my pg_num/pgp_num needs to be changed OR the rebalance
>>> time
>>> is dependent on the number of objects in the pool.  These are thoughts
>>> i've
>>> had but am not certain are relevant here.
>>>
>>
>> Rebalance time is dependent on the number of objects in the pool. You
>> *might* see an improvement by increasing "osd max push objects" from
>> its default of 10...or you might not. That many small files isn't
>> something I've explored.
>> -Greg
>>
>>
>>> $ sudo ceph -v
>>> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>>>
>>> $ sudo ceph -s
>>> [sudo] password for bababurko:
>>>      cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
>>>       health HEALTH_WARN
>>>              5 pgs backfilling
>>>              5 pgs stuck unclean
>>>              recovery 3046506/676638611 objects misplaced (0.450%)
>>>       monmap e1: 3 mons at
>>> {cephmon01=
>>> 10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0
>>> }
>>>              election epoch 20, quorum 0,1,2
>>> cephmon01,cephmon02,cephmon03
>>>       mdsmap e6070: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby
>>>       osdmap e4395: 30 osds: 30 up, 30 in; 5 remapped pgs
>>>        pgmap v3100039: 2112 pgs, 3 pools, 6454 GB data, 321 Mobjects
>>>              18319 GB used, 9612 GB / 27931 GB avail
>>>              3046506/676638611 objects misplaced (0.450%)
>>>                  2095 active+clean
>>>                    12 active+clean+scrubbing+deep
>>>                     5 active+remapped+backfilling
>>> recovery io 2294 kB/s, 147 objects/s
>>>
>>> $ sudo rados df
>>> pool name                 KB      objects       clones     degraded
>>> unfound           rd        rd KB           wr        wr KB
>>> cephfs_data       6767569962    335746702            0            0
>>> 0      2136834            1    676984208   7052266742
>>> cephfs_metadata        42738      1058437            0            0
>>> 0     16130199  30718800215    295996938   3811963908
>>> rbd                        0            0            0            0
>>> 0            0            0            0            0
>>>    total used     19209068780    336805139
>>>    total avail    10079469460
>>>    total space    29288538240
>>>
>>> $ sudo ceph osd pool get cephfs_data pgp_num
>>> pg_num: 1024
>>> $ sudo ceph osd pool get cephfs_metadata pgp_num
>>> pg_num: 1024
>>>
>>>
>>> thanks,
>>> Bob
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] rebalancing taking very long time

Reply via email to