On Mon, Jan 22, 2018 at 5:13 PM, Nigel Babu <nig...@redhat.com> wrote:
> Update: All the nodes that had problems with geo-rep are now fixed. > Waiting on the patch to be merged before we switch over to Centos 7. If > things go well, we'll replace nodes one by one as soon as we have one green > on Centos 7. > I just noticed we failed again on the geo-rep tests @ https://build.gluster.org/job/centos6-regression/8604/console . Nigel reconfirmed that we have all the machines cleaned up. What else could be going wrong here? > On Mon, Jan 22, 2018 at 12:21 PM, Nigel Babu <nig...@redhat.com> wrote: > >> Hello folks, >> >> As you may have noticed, we've had a lot of centos6-regression failures >> lately. The geo-replication failures are the new ones which particularly >> concern me. These failures have nothing to do with the test. The tests are >> exposing a problem in our infrastructure that we've carried around for a >> long time. Our machines are not clean machines that we automated. We setup >> automation on machines that were already created. At some point, we loaned >> machines for debugging. During this time, developers have inadvertently >> done 'make install' on the system to install onto system paths rather than >> into /build/install. This is what is causing the geo-replication tests >> to fail. I've tried cleaning the machines up several times with little to >> no success. >> >> Last week, we decided to take an aggressive path to fix this problem. We >> planned to replace all our problematic nodes with new Centos 7 nodes. This >> exposed more problems. We expected a specific type of machine from >> Rackspace. These are no longer offered. Thus, our automation fails on some >> steps. I've spent this weekend tweaking our automation so that it works >> on the new Rackspace machines and I'm down to just one test failure[1]. >> I have a patch up to fix this failure[2]. As soon as that patch is merged, >> we can push forward with Centos7 nodes. In 4.0, we're dropping support for >> Centos 6, so this decision makes more sense to do sooner than later. >> >> We'll not be lending machines anymore from production. We'll be creating >> new nodes which are a snapshots of an existing production node. This >> machine will be destroyed after use. This helps prevent this particular >> problem in the future. This also means that our machine capacity at all >> times is at 100 with very minimal wastage. >> >> [1]: https://build.gluster.org/job/cage-test/184/consoleText >> [2]: https://review.gluster.org/#/c/19262/ >> >> -- >> nigelb >> > > > > -- > nigelb > > _______________________________________________ > Gluster-devel mailing list > gluster-de...@gluster.org > http://lists.gluster.org/mailman/listinfo/gluster-devel >
_______________________________________________ Gluster-infra mailing list Gluster-infra@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-infra