On 22 November 2016 at 13:09, Raghavendra Gowdappa <[email protected]> wrote:
> > > ----- Original Message ----- > > From: "Vijay Bellur" <[email protected]> > > To: "Nithya Balachandran" <[email protected]> > > Cc: "Gluster Devel" <[email protected]> > > Sent: Wednesday, November 16, 2016 9:41:12 AM > > Subject: Re: [Gluster-devel] Upstream smoke test failures > > > > On Tue, Nov 15, 2016 at 8:40 AM, Nithya Balachandran > > <[email protected]> wrote: > > > > > > > > > On 15 November 2016 at 18:55, Vijay Bellur <[email protected]> wrote: > > >> > > >> On Mon, Nov 14, 2016 at 10:34 PM, Nithya Balachandran > > >> <[email protected]> wrote: > > >> > > > >> > > > >> > On 14 November 2016 at 21:38, Vijay Bellur <[email protected]> > wrote: > > >> >> > > >> >> I would prefer that we disable dbench only if we have an owner for > > >> >> fixing the problem and re-enabling it as part of smoke tests. > Running > > >> >> dbench seamlessly on gluster has worked for a long while and if it > is > > >> >> failing today, we need to address this regression asap. > > >> >> > > >> >> Does anybody have more context or clues on why dbench is failing > now? > > >> >> > > >> > While I agree that it needs to be looked at asap, leaving it in > until we > > >> > get > > >> > an owner seems rather pointless as all it does is hold up various > > >> > patches > > >> > and waste machine time. Re-triggering it multiple times so that it > > >> > eventually passes does not add anything to the regression test > processes > > >> > or > > >> > validate the patch as we know there is a problem. > > >> > > > >> > I would vote for removing it and assigning someone to look at it > > >> > immediately. > > >> > > > >> > > >> From the debugging done so far can we identify an owner to whom this > > >> can be assigned? I looked around for related discussions and could > > >> figure out that we are looking to get statedumps. Do we have more > > >> information/context beyond this? > > >> > > > I have updated the BZ (https://bugzilla.redhat.com/ > show_bug.cgi?id=1379228) > > > with info from the last failure - looks like hangs in write-behind and > > > read-ahead. > > > > > > > > > I spent some time on this today and it does look like write-behind is > > absorbing READs without performing any WIND/UNWIND actions. I have > > attached a statedump from a slave that had the dbench problem (thanks, > > Nigel!) to the above bug. > > > > Snip from statedump: > > > > [global.callpool.stack.2] > > stack=0x7fd970002cdc > > uid=0 > > gid=0 > > pid=31884 > > unique=37870 > > lk-owner=0000000000000000 > > op=READ > > type=1 > > cnt=2 > > > > [global.callpool.stack.2.frame.1] > > frame=0x7fd9700036ac > > ref_count=0 > > translator=patchy-read-ahead > > complete=0 > > parent=patchy-readdir-ahead > > wind_from=ra_page_fault > > wind_to=FIRST_CHILD (fault_frame->this)->fops->readv > > unwind_to=ra_fault_cbk > > > > [global.callpool.stack.2.frame.2] > > frame=0x7fd97000346c > > ref_count=1 > > translator=patchy-readdir-ahead > > complete=0 > > > > > > Note that the frame which was wound from ra_page_fault() to > > write-behind is not yet complete and write-behind has not progressed > > the call. There are several callstacks with a similar signature in > > statedump. > > I think the culprit here is read-ahead, not write-behind. If read fop was > dropped in write-behind, we should've seen a frame associated with > write-behind (complete=0 for a frame associated with a xlator indicates > frame was not unwound from _that_ xlator). But I didn't see any. Also empty > request queues in wb_inode corroborate the hypothesis. K We have seen both . See comment#17 in https://bugzilla.redhat.com/ show_bug.cgi?id=1379228 . regards, Nithya arthick subrahmanya is working on a similar issue reported by a user. > However, we've not made much of a progress till now. > > > > > In write-behind's readv implementation, we stub READ fops and enqueue > > them in the relevant inode context. Once enqueued the stub resumes > > when appropriate set of conditions happen in write-behind. This is not > > happening now and I am not certain if: > > > > - READ fops are languishing in a queue and not being resumed or > > - READ fops are pre-maturely dropped from a queue without winding or > > unwinding > > > > When I gdb'd into the client process and examined the inode contexts > > for write-behind, I found all queues to be empty. This seems to > > indicate that the latter reason is more plausible but I have not yet > > found a code path to account for this possibility. > > > > One approach to proceed further is to add more logs in write-behind to > > get a better understanding of the problem. I will try that out > > sometime later this week. We are also considering disabling > > write-behind for smoke tests in the interim after a trial run (with > > write-behind disabled) later in the day. > > > > Thanks, > > Vijay > > _______________________________________________ > > Gluster-devel mailing list > > [email protected] > > http://www.gluster.org/mailman/listinfo/gluster-devel > > >
_______________________________________________ Gluster-devel mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-devel
