Thanks Sam, I'll provide details if it keeps happening
On Thu, Apr 4, 2013 at 4:01 PM, Sam Lang <[email protected]> wrote: > Hi Erdem, > > This is likely a bug. We've created a ticket to keep track: > http://tracker.ceph.com/issues/4645. > > -slang [inktank dev | http://www.inktank.com | http://www.ceph.com] > > On Mon, Apr 1, 2013 at 3:18 AM, Erdem Agaoglu <[email protected]> > wrote: > > In addition, i was able to extract some logs from the last time > > active/peering problem happened. > > http://pastebin.com/BakFREFP > > It ends with me restarting it. > > > > > > On Mon, Apr 1, 2013 at 10:23 AM, Erdem Agaoglu <[email protected]> > > wrote: > >> > >> Hi all, > >> > >> We are currently in process of enlarging our bobtail cluster size by > >> adding OSDs. We have 12 disks per machine and we are creating one OSD > per > >> disk, adding them one by one as recommended. Only thing we don't do is > >> starting with a small weight and increasing it slowly. Weights are all > 1. > >> > >> In this scenario both rbd and radosgw are unable to respond only in the > >> first two minutes of adding a new OSD. After that small hiccup, we have > some > >> pgs like active+remapped+wait_backfill, active+remapped+backfilling, > >> active+recovery_wait+remapped, active+degraded+remapped+backfilling and > >> everything works OK. After a few hours of backfilling and recovery all > pgs > >> come active+clean and we add another OSD. > >> > >> But sometimes, that small hiccup takes longer than a few minutes. In > that > >> times status shows some pgs are stuck in active and some are stuck in > >> peering. When we look at the pg dump we see all those active or peering > pgs > >> are on the same 2 OSDs and are unable to move forward. At this stage rbd > >> works poorly and radosgw is completely stalled. Only after restarting > one of > >> those 2 OSDs, pg's start to backfill and clients continue with their > >> operations. > >> > >> Since this is a live cluster we don't want to wait too long and usually > go > >> restart the OSD in a hurry. That's why i cannot currently provide > status or > >> pg query outputs. We have some logs but i don't know what to look for > or if > >> they are verbose enough. > >> > >> Can this be any kind of a known issue? If not, where should i look to > get > >> any ideas about what's happening when it occurs? > >> > >> Thanks in advance > >> > >> -- > >> erdem agaoglu > > > > > > > > > > -- > > erdem agaoglu > > > > _______________________________________________ > > ceph-users mailing list > > [email protected] > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- erdem agaoglu
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
