Re: [Gluster-devel] Upstream smoke test failures

Nithya Balachandran Tue, 22 Nov 2016 00:25:03 -0800

On 22 November 2016 at 13:09, Raghavendra Gowdappa <[email protected]>
wrote:


>
>
> ----- Original Message -----
> > From: "Vijay Bellur" <[email protected]>
> > To: "Nithya Balachandran" <[email protected]>
> > Cc: "Gluster Devel" <[email protected]>
> > Sent: Wednesday, November 16, 2016 9:41:12 AM
> > Subject: Re: [Gluster-devel] Upstream smoke test failures
> >
> > On Tue, Nov 15, 2016 at 8:40 AM, Nithya Balachandran
> > <[email protected]> wrote:
> > >
> > >
> > > On 15 November 2016 at 18:55, Vijay Bellur <[email protected]> wrote:
> > >>
> > >> On Mon, Nov 14, 2016 at 10:34 PM, Nithya Balachandran
> > >> <[email protected]> wrote:
> > >> >
> > >> >
> > >> > On 14 November 2016 at 21:38, Vijay Bellur <[email protected]>
> wrote:
> > >> >>
> > >> >> I would prefer that we disable dbench only if we have an owner for
> > >> >> fixing the problem and re-enabling it as part of smoke tests.
> Running
> > >> >> dbench seamlessly on gluster has worked for a long while and if it
> is
> > >> >> failing today, we need to address this regression asap.
> > >> >>
> > >> >> Does anybody have more context or clues on why dbench is failing
> now?
> > >> >>
> > >> > While I agree that it needs to be looked at asap, leaving it in
> until we
> > >> > get
> > >> > an owner seems rather pointless as all it does is hold up various
> > >> > patches
> > >> > and waste machine time. Re-triggering it multiple times so that it
> > >> > eventually passes does not add anything to the regression test
> processes
> > >> > or
> > >> > validate the patch as we know there is a problem.
> > >> >
> > >> > I would vote for removing it and assigning someone to look at it
> > >> > immediately.
> > >> >
> > >>
> > >> From the debugging done so far can we identify an owner to whom this
> > >> can be assigned? I looked around for related discussions and could
> > >> figure out that we are looking to get statedumps. Do we have more
> > >> information/context beyond this?
> > >>
> > > I have updated the BZ (https://bugzilla.redhat.com/
> show_bug.cgi?id=1379228)
> > > with info from the last failure - looks like hangs in write-behind and
> > > read-ahead.
> > >
> >
> >
> > I spent some time on this today and it does look like write-behind is
> > absorbing READs without performing any WIND/UNWIND actions. I have
> > attached a statedump from a slave that had the dbench problem (thanks,
> > Nigel!) to the above bug.
> >
> > Snip from statedump:
> >
> > [global.callpool.stack.2]
> > stack=0x7fd970002cdc
> > uid=0
> > gid=0
> > pid=31884
> > unique=37870
> > lk-owner=0000000000000000
> > op=READ
> > type=1
> > cnt=2
> >
> > [global.callpool.stack.2.frame.1]
> > frame=0x7fd9700036ac
> > ref_count=0
> > translator=patchy-read-ahead
> > complete=0
> > parent=patchy-readdir-ahead
> > wind_from=ra_page_fault
> > wind_to=FIRST_CHILD (fault_frame->this)->fops->readv
> > unwind_to=ra_fault_cbk
> >
> > [global.callpool.stack.2.frame.2]
> > frame=0x7fd97000346c
> > ref_count=1
> > translator=patchy-readdir-ahead
> > complete=0
> >
> >
> > Note that the frame which was wound from ra_page_fault() to
> > write-behind is not yet complete and write-behind has not progressed
> > the call. There are several callstacks with a similar signature in
> > statedump.
>
> I think the culprit here is read-ahead, not write-behind. If read fop was
> dropped in write-behind, we should've seen a frame associated with
> write-behind (complete=0 for a frame associated with a xlator indicates
> frame was not unwound from _that_ xlator). But I didn't see any. Also empty
> request queues in wb_inode corroborate the hypothesis. K



We have seen both . See comment#17 in https://bugzilla.redhat.com/
show_bug.cgi?id=1379228 .


regards,
Nithya


arthick subrahmanya is working on a similar issue reported by a user.
> However, we've not made much of a progress till now.
>
> >
> > In write-behind's readv implementation, we stub READ fops and enqueue
> > them in the relevant inode context. Once enqueued the stub resumes
> > when appropriate set of conditions happen in write-behind. This is not
> > happening now and  I am not certain if:
> >
> > - READ fops are languishing in a queue and not being resumed or
> > - READ fops are pre-maturely dropped from a queue without winding or
> > unwinding
> >
> > When I gdb'd into the client process and examined the inode contexts
> > for write-behind, I found all queues to be empty. This seems to
> > indicate that the latter reason is more plausible but I have not yet
> > found a code path to account for this possibility.
> >
> > One approach to proceed further is to add more logs in write-behind to
> > get a better understanding of the problem. I will try that out
> > sometime later this week. We are also considering disabling
> > write-behind for smoke tests in the interim after a trial run (with
> > write-behind disabled) later in the day.
> >
> > Thanks,
> > Vijay
> > _______________________________________________
> > Gluster-devel mailing list
> > [email protected]
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> >
>

_______________________________________________
Gluster-devel mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Upstream smoke test failures

Reply via email to