Thanks Sam,

I'll provide details if it keeps happening


On Thu, Apr 4, 2013 at 4:01 PM, Sam Lang <[email protected]> wrote:

> Hi Erdem,
>
> This is likely a bug.  We've created a ticket to keep track:
> http://tracker.ceph.com/issues/4645.
>
> -slang [inktank dev | http://www.inktank.com | http://www.ceph.com]
>
> On Mon, Apr 1, 2013 at 3:18 AM, Erdem Agaoglu <[email protected]>
> wrote:
> > In addition, i was able to extract some logs from the last time
> > active/peering problem happened.
> > http://pastebin.com/BakFREFP
> > It ends with me restarting it.
> >
> >
> > On Mon, Apr 1, 2013 at 10:23 AM, Erdem Agaoglu <[email protected]>
> > wrote:
> >>
> >> Hi all,
> >>
> >> We are currently in process of enlarging our bobtail cluster size by
> >> adding OSDs. We have 12 disks per machine and we are creating one OSD
> per
> >> disk, adding them one by one as recommended. Only thing we don't do is
> >> starting with a small weight and increasing it slowly. Weights are all
> 1.
> >>
> >> In this scenario both rbd and radosgw are unable to respond only in the
> >> first two minutes of adding a new OSD. After that small hiccup, we have
> some
> >> pgs like active+remapped+wait_backfill, active+remapped+backfilling,
> >> active+recovery_wait+remapped, active+degraded+remapped+backfilling and
> >> everything works OK. After a few hours of backfilling and recovery all
> pgs
> >> come active+clean and we add another OSD.
> >>
> >> But sometimes, that small hiccup takes longer than a few minutes. In
> that
> >> times status shows some pgs are stuck in active and some are stuck in
> >> peering. When we look at the pg dump we see all those active or peering
> pgs
> >> are on the same 2 OSDs and are unable to move forward. At this stage rbd
> >> works poorly and radosgw is completely stalled. Only after restarting
> one of
> >> those 2 OSDs, pg's start to backfill and clients continue with their
> >> operations.
> >>
> >> Since this is a live cluster we don't want to wait too long and usually
> go
> >> restart the OSD in a hurry. That's why i cannot currently provide
> status or
> >> pg query outputs. We have some logs but i don't know what to look for
> or if
> >> they are verbose enough.
> >>
> >> Can this be any kind of a known issue? If not, where should i look to
> get
> >> any ideas about what's happening when it occurs?
> >>
> >> Thanks in advance
> >>
> >> --
> >> erdem agaoglu
> >
> >
> >
> >
> > --
> > erdem agaoglu
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>



-- 
erdem agaoglu
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to