In addition, i was able to extract some logs from the last time active/peering problem happened. http://pastebin.com/BakFREFP It ends with me restarting it.
On Mon, Apr 1, 2013 at 10:23 AM, Erdem Agaoglu <[email protected]>wrote: > Hi all, > > We are currently in process of enlarging our bobtail cluster size by > adding OSDs. We have 12 disks per machine and we are creating one OSD per > disk, adding them one by one as recommended. Only thing we don't do is > starting with a small weight and increasing it slowly. Weights are all 1. > > In this scenario both rbd and radosgw are unable to respond only in the > first two minutes of adding a new OSD. After that small hiccup, we have > some pgs like active+remapped+wait_backfill, active+remapped+backfilling, > active+recovery_wait+remapped, active+degraded+remapped+backfilling and > everything works OK. After a few hours of backfilling and recovery all pgs > come active+clean and we add another OSD. > > But sometimes, that small hiccup takes longer than a few minutes. In that > times status shows some pgs are stuck in active and some are stuck in > peering. When we look at the pg dump we see all those active or peering pgs > are on the same 2 OSDs and are unable to move forward. At this stage rbd > works poorly and radosgw is completely stalled. Only after restarting one > of those 2 OSDs, pg's start to backfill and clients continue with their > operations. > > Since this is a live cluster we don't want to wait too long and usually go > restart the OSD in a hurry. That's why i cannot currently provide status or > pg query outputs. We have some logs but i don't know what to look for or if > they are verbose enough. > > Can this be any kind of a known issue? If not, where should i look to get > any ideas about what's happening when it occurs? > > Thanks in advance > > -- > erdem agaoglu > -- erdem agaoglu
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
