Deepshika, I see that tests/bugs/glusterd/optimized-basic-testcases-in-cluster.t test is failed in today's run #273. But I couldn't get logs from https://ci-logs.gluster.org/distributed-regression-logs-273.tgz , I see 404 Not found error with message saying "The requested URL /distributed-regression-logs-273.tgz was not found on this server." Please help me in getting the logs.
On Thu, Oct 4, 2018 at 10:31 PM Atin Mukherjee <amukh...@redhat.com> wrote: > Deepshika, > > Please keep us posted on if you see the particular glusterd test failing > again. It’ll be great to see this nightly job green sooner than later :-) . > > On Thu, 4 Oct 2018 at 15:07, Deepshikha Khandelwal <dkhan...@redhat.com> > wrote: > >> On Thu, Oct 4, 2018 at 6:10 AM Sanju Rakonde <srako...@redhat.com> wrote: >> > >> > >> > >> > On Wed, Oct 3, 2018 at 3:26 PM Deepshikha Khandelwal < >> dkhan...@redhat.com> wrote: >> >> >> >> Hello folks, >> >> >> >> Distributed-regression job[1] is now a part of Gluster's >> >> nightly-master build pipeline. The following are the issues we have >> >> resolved since we started working on this: >> >> >> >> 1) Collecting gluster logs from servers. >> >> 2) Tests failed due to infra-related issues have been fixed. >> >> 3) Time taken to run regression testing reduced to ~50-60 minutes. >> >> >> >> To get time down to 40 minutes needs your help! >> >> >> >> Currently, there is a test that is failing: >> >> >> >> tests/bugs/glusterd/optimized-basic-testcases-in-cluster.t >> >> >> >> This needs fixing first. >> > >> > >> > Where can I get the logs of this test case? In >> https://build.gluster.org/job/distributed-regression/264/console I see >> this test case is failed and re-attempted. But I couldn't find logs. >> There's a link in the end of console output where you can look for the >> logs of failed tests. >> We had a bug in the setup and the logs were not getting saved. We've >> fixed this and future jobs should have the logs at the log collector's >> link show up in the console output. >> >> >> >> >> >> >> There's a test that takes 14 minutes to complete - >> >> `tests/bugs/index/bug-1559004-EMLINK-handling.t`. A single test taking >> >> 14 minutes is not something we can distribute. Can we look at how we >> >> can speed this up[2]? When this test fails, it is re-attempted, >> >> further increasing the time. This happens in the regular >> >> centos7-regression job as well. >> >> >> >> If you see any other issues, please file a bug[3]. >> >> >> >> [1]: https://build.gluster.org/job/distributed-regression >> >> [2]: https://build.gluster.org/job/distributed-regression/264/console >> >> [3]: >> https://bugzilla.redhat.com/enter_bug.cgi?product=glusterfs&component=project-infrastructure >> >> >> >> Thanks, >> >> Deepshikha Khandelwal >> >> On Tue, Jun 26, 2018 at 9:02 AM Nigel Babu <nig...@redhat.com> wrote: >> >> > >> >> > >> >> > >> >> > On Mon, Jun 25, 2018 at 7:28 PM Amar Tumballi <atumb...@redhat.com> >> wrote: >> >> >> >> >> >> >> >> >> >> >> >>> There are currently a few known issues: >> >> >>> * Not collecting the entire logs (/var/log/glusterfs) from servers. >> >> >> >> >> >> >> >> >> If I look at the activities involved with regression failures, this >> can wait. >> >> > >> >> > >> >> > Well, we can't debug the current failures without having the logs. >> So this has to be fixed first. >> >> > >> >> >> >> >> >> >> >> >>> >> >> >>> * A few tests fail due to infra-related issues like geo-rep tests. >> >> >> >> >> >> >> >> >> Please open bugs for this, so we can track them, and take it to >> closure. >> >> > >> >> > >> >> > These are failing due to infra reasons. Most likely subtle >> differences in the setup of these nodes vs our normal nodes. We'll only be >> able to debug them once we get the logs. I know the geo-rep ones are easy >> to fix. The playbook for setting up geo-rep correctly just didn't make it >> over to the playbook used for these images. >> >> > >> >> >> >> >> >> >> >> >>> >> >> >>> * Takes ~80 minutes with 7 distributed servers (targetting 60 >> minutes) >> >> >> >> >> >> >> >> >> Time can change with more tests added, and also please plan to have >> number of server as 1 to n. >> >> > >> >> > >> >> > While the n is configurable, however it will be fixed to a single >> digit number for now. We will need to place *some* limitation somewhere or >> else we'll end up not being able to control our cloud bills. >> >> > >> >> >> >> >> >> >> >> >>> >> >> >>> * We've only tested plain regressions. ASAN and Valgrind are >> currently untested. >> >> >> >> >> >> >> >> >> Great to have it running not 'per patch', but as nightly, or weekly >> to start with. >> >> > >> >> > >> >> > This is currently not targeted until we phase out current >> regressions. >> >> > >> >> >>> >> >> >>> >> >> >>> Before bringing it into production, we'll run this job nightly and >> >> >>> watch it for a month to debug the other failures. >> >> >>> >> >> >> >> >> >> I would say, bring it to production sooner, say 2 weeks, and also >> plan to have the current regression as is with a special command like 'run >> regression in-one-machine' in gerrit (or something similar) with voting >> rights, so we can fall back to this method if something is broken in >> parallel testing. >> >> >> >> >> >> I have seen that regardless of amount of time we put some scripts >> in testing, the day we move to production, some thing would be broken. So, >> let that happen earlier than later, so it would help next release branching >> out. Don't want to be stuck for branching due to infra failures. >> >> > >> >> > >> >> > Having two regression jobs that can vote is going to cause more >> confusion than it's worth. There are a couple of intermittent memory issues >> with the test script that we need to debug and fix before I'm comfortable >> in making this job a voting job. We've worked around these problems right >> now. It still pops up now and again. The fact that things break often is >> not an excuse to prevent avoidable failures. The one month timeline was >> taken with all these factors into consideration. The 2-week timeline is a >> no-go at this point. >> >> > >> >> > When we are ready to make the switch, we won't be switching 100% of >> the job. We'll start with a sliding scale so that we can monitor failures >> and machine creation adequately. >> >> > >> >> > -- >> >> > nigelb >> >> _______________________________________________ >> >> Gluster-devel mailing list >> >> Gluster-devel@gluster.org >> >> https://lists.gluster.org/mailman/listinfo/gluster-devel >> > >> > >> > >> > -- >> > Thanks, >> > Sanju >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel@gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-devel >> > -- > - Atin (atinm) > -- Thanks, Sanju
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel