On Thu, Aug 2, 2018 at 4:37 PM, Kotresh Hiremath Ravishankar < [email protected]> wrote:
> > > On Thu, Aug 2, 2018 at 3:49 PM, Xavi Hernandez <[email protected]> > wrote: > >> On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee <[email protected]> >> wrote: >> >>> >>> >>> On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee <[email protected]> >>> wrote: >>> >>>> I just went through the nightly regression report of brick mux runs and >>>> here's what I can summarize. >>>> >>>> ============================================================ >>>> ============================================================ >>>> ================================================= >>>> Fails only with brick-mux >>>> ============================================================ >>>> ============================================================ >>>> ================================================= >>>> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after >>>> 400 secs. Refer https://fstat.gluster.org/fail >>>> ure/209?state=2&start_date=2018-06-30&end_date=2018-07-31&branch=all, >>>> specifically the latest report https://build.gluster.org/job/ >>>> regression-test-burn-in/4051/consoleText . Wasn't timing out as >>>> frequently as it was till 12 July. But since 27 July, it has timed out >>>> twice. Beginning to believe commit 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 >>>> has added the delay and now 400 secs isn't sufficient enough (Mohit?) >>>> >>>> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t >>>> (Ref - https://build.gluster.org/job/regression-test-with-multiplex >>>> /814/console) - Test fails only in brick-mux mode, AI on Atin to look >>>> at and get back. >>>> >>>> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t ( >>>> https://build.gluster.org/job/regression-test-with-multiple >>>> x/813/console) - Seems like failed just twice in last 30 days as per >>>> https://fstat.gluster.org/failure/251?state=2&start_date= >>>> 2018-06-30&end_date=2018-07-31&branch=all. Need help from AFR team. >>>> >>>> tests/bugs/quota/bug-1293601.t (https://build.gluster.org/job >>>> /regression-test-with-multiplex/812/console) - Hasn't failed after 26 >>>> July and earlier it was failing regularly. Did we fix this test through any >>>> patch (Mohit?) >>>> >>>> tests/bitrot/bug-1373520.t - (https://build.gluster.org/job >>>> /regression-test-with-multiplex/811/console) - Hasn't failed after 27 >>>> July and earlier it was failing regularly. Did we fix this test through any >>>> patch (Mohit?) >>>> >>> >>> I see this has failed in day before yesterday's regression run as well >>> (and I could reproduce it locally with brick mux enabled). The test fails >>> in healing a file within a particular time period. >>> >>> *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* FAILED >>> COMMAND: 512 path_size /d/backends/patchy5/FILE1 >>> >>> Need EC dev's help here. >>> >> >> I'm not sure where the problem is exactly. I've seen that when the test >> fails, self-heal is attempting to heal the file, but when the file is >> accessed, an Input/Output error is returned, aborting heal. I've checked >> that a heal is attempted every time the file is accessed, but it fails >> always. This error seems to come from bit-rot stub xlator. >> >> When in this situation, if I stop and start the volume, self-heal >> immediately heals the files. It seems like an stale state that is kept by >> the stub xlator, preventing the file from being healed. >> >> Adding bit-rot maintainers for help on this one. >> > > Bitrot-stub marks the file as corrupted in inode_ctx. But when the file > and it's hardlink are deleted from that brick and a lookup is done > on the file, it cleans up the marker on getting ENOENT. This is part of > recovery steps, and only md-cache is disabled during the process. > Is there any other perf xlators that needs to be disabled for this > scenario to expect a lookup/revalidate on the brick where > the back end file is deleted? > Can you make sure there are no perf xlators in bitrot stack while doing it? That may not be a good idea to keep it for internal 'validations'. > >> Xavi >> >> >> >>> >>>> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a >>>> core, not sure if related to brick mux or not, so not sure if brick mux is >>>> culprit here or not. Ref - https://build.gluster.org/job/ >>>> regression-test-with-multiplex/806/console . Seems to be a glustershd >>>> crash. Need help from AFR folks. >>>> >>>> ============================================================ >>>> ============================================================ >>>> ================================================= >>>> Fails for non-brick mux case too >>>> ============================================================ >>>> ============================================================ >>>> ================================================= >>>> tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup >>>> very often, with out brick mux as well. Refer >>>> https://build.gluster.org/job/regression-test-burn-in/4050/consoleText >>>> . There's an email in gluster-devel and a BZ 1610240 for the same. >>>> >>>> tests/bugs/bug-1368312.t - Seems to be recent failures ( >>>> https://build.gluster.org/job/regression-test-with-multiple >>>> x/815/console) - seems to be a new failure, however seen this for a >>>> non-brick-mux case too - https://build.gluster.org/job/ >>>> regression-test-burn-in/4039/consoleText . Need some eyes from AFR >>>> folks. >>>> >>>> tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to >>>> brick mux, have seen this failing at multiple default regression runs. >>>> Refer https://fstat.gluster.org/failure/392?state=2&start_date= >>>> 2018-06-30&end_date=2018-07-31&branch=all . We need help from geo-rep >>>> dev to root cause this earlier than later >>>> >>>> tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to >>>> brick mux, have seen this failing at multiple default regression runs. >>>> Refer https://fstat.gluster.org/failure/393?state=2&start_date= >>>> 2018-06-30&end_date=2018-07-31&branch=all . We need help from geo-rep >>>> dev to root cause this earlier than later >>>> >>>> tests/bugs/glusterd/validating-server-quorum.t ( >>>> https://build.gluster.org/job/regression-test-with-multiple >>>> x/810/console) - Fails for non-brick-mux cases too, >>>> https://fstat.gluster.org/failure/580?state=2&start_date= >>>> 2018-06-30&end_date=2018-07-31&branch=all . Atin has a patch >>>> https://review.gluster.org/20584 which resolves it but patch is >>>> failing regression for a different test which is unrelated. >>>> >>>> tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t >>>> (Ref - https://build.gluster.org/job/regression-test-with-multiplex >>>> /809/console) - fails for non brick mux case too - >>>> https://build.gluster.org/job/regression-test-burn-in/4049/consoleText >>>> - Need some eyes from AFR folks. >>>> >>> > > > -- > Thanks and Regards, > Kotresh H R > > _______________________________________________ > Gluster-devel mailing list > [email protected] > https://lists.gluster.org/mailman/listinfo/gluster-devel > -- Amar Tumballi (amarts)
_______________________________________________ Gluster-devel mailing list [email protected] https://lists.gluster.org/mailman/listinfo/gluster-devel
