On Thu, Jun 9, 2016 at 12:14 PM, Kaushal M <[email protected]> wrote: > On Thu, Jun 9, 2016 at 12:03 PM, Saravanakumar Arumugam > <[email protected]> wrote: >> Hi Kaushal, >> >> One of the patch is failing for (http://review.gluster.org/#/c/14653/) is >> failing in NETBSD. >> Its log: >> https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/15624/ >> >> But the patch mentioned in NETBSD is another >> one.(http://review.gluster.org/#/c/13872/) >> > > Yup. We know this is happening, but don't know why yet. I'll keep this > thread updated with any findings I have. > >> Thanks, >> Saravana >> >> >> >> On 06/09/2016 11:52 AM, Kaushal M wrote: >>> >>> In addition to the builder issues we're having, we are also facing >>> problems with jenkins voting/commenting randomly. >>> >>> The comments generally link to older jobs for older patchsets, which >>> were run about 2 months back (beginning of April). For example, >>> https://review.gluster.org/14665 has a netbsd regression +1 vote, from >>> a job run in April for review 13873, and which actually failed. >>> >>> Another observation that I've made is that these fake votes sometime >>> provide a -1 Verified. Jenkins shouldn't be using this flag anymore. >>> >>> These 2 observations, make me wonder if another jenkins instance is >>> running somewhere, from our old backups possibly? Michael, could this >>> be possible? >>> >>> To check from where these votes/comments were coming from, I tried >>> checking the Gerrit sshd logs. This wasn't helpful, because all logins >>> apparently happen from 127.0.0.1. This is probably some firewall rule >>> that has been setup, post migration, because I see older logs giving >>> proper IPs. I'll require Michael's help with fixing this, if possible. >>> >>> I'll continue to investigate, and update this thread with anything I find. >>>
My guess was right!! This problem should now be fixed, as well as the problem with the builders. The cause for both is the same: our old jenkins server, back from the dead (zombie-jenkins from now on). The hypervisor in iWeb which hosted our services earlier, which was supposed to be off, had started up about 4 days back. This brought back zombie-jenkins. Zombie-jenkins continued from where is left off around early April. It started getting gerrit events, and started running jobs for them. Zombie-jenkins started numbering jobs from where it left off, and used these numbers when reporting back to gerrit. But these job numbers had already been used by new-jenkins about 2 months back when it started. This is why the links in the comments pointed to the old jobs in new-jenkins. I've checked logs on Gerrit (with help from Micheal) and can verify that these comments/votes did come zombie-jenkins's IP. Zombie-jenkins also explains the random build failures being seen on the builders. Zombie-jenkins and new-jenkins each thought they had the slaves to themselves and launched jobs on them, causing jobs to clash sometimes, which resulted in random failures reported in new-jenkins. I'm yet to login to a slave and verify this, but I'm pretty sure this what happened. For now, Michael has stopped the iWeb hypervisor and zombie-jenkins. This should stop anymore random comments in Gerrit and failures in Jenkins. I'll get Michael (once he's back on Monday) to figure out why zombie-jenkins restarted, and write up a proper postmortem about the issues. >>> ~kaushal >>> _______________________________________________ >>> Gluster-infra mailing list >>> [email protected] >>> http://www.gluster.org/mailman/listinfo/gluster-infra >> >> _______________________________________________ Gluster-infra mailing list [email protected] http://www.gluster.org/mailman/listinfo/gluster-infra
