Re: Failing Master

Jarek Potiuk Sun, 02 Feb 2020 01:20:13 -0800

Ok. The master is fixed now (finally!). The master is now working so please
rebase all of your open PRs to master.


At the end we had a number of different problems, some coincidences  at the
same time that’s why it was so hectic and difficult to diagnose:

   - Travis queue was stalled (at some point in time we had some 20 builds
   waiting in a queue) so we did not rebase some merges to save time and
   merged them  from old masters
   - Some of the master merges were cancelled - so we could not see which
   commit broke the build - that make us come up with different hypothesis for
   the problem
   - Our optimisations for CI builds optimisations (skip Kubernetes builds
   when no kubernetes-related changes) cause the contrib/example_dags move to
   slip under the radar of PR CI checks
   - Even if we did not have the optimisations -  Kubernetes Git Sync uses
   master of Airflow, so we would not have detected that by PR failure (only
   after merge)
   - We had a number of “false positives” and lack of detailed logs for
   Kubernetes.
   - We had a mysterious hang on kerberos tests - but it was caused likely
   by Travis environment change (it’s gone now)
   - We had Redis test failures caused by 3.4 release of redis-py libraries
   which contained a change (Redis class became un-hashable by adding __eq__
   hook) - luckily they reverted it two hours ago (
   https://github.com/andymccurdy/redis-py/blob/master/CHANGES)
   - We downloaded Apache RAT tool from a maven repository. And this maven
   repo is very unstable recently.
   - There are a number of follow-up PRs (already merged or building on
   Travis now)  that will resolve those problems and prevent it in the future.

J.


On Thu, Jan 30, 2020 at 11:16 AM Ash Berlin-Taylor <a...@apache.org> wrote:

> Spent a little bit of time looking at this and it seems it was (super)
> flaky tests -- I've managed to get 1 commit back on master passing by just
> retrying the one failed job.
>
> Looking at the latest commit now.
>
> On Jan 30 2020, at 7:54 am, Jarek Potiuk <jarek.pot...@polidea.com> wrote:
> > It looks like we have a failing master - seems that yesterday's Travis'
> > super-slow queue and a number of PRs that were merged without rebasing
> and
> > caused master to be broken.
> >
> > I will not be at my PC for couple of hours at least so maybe some other
> > committers can take a look in the meantime.
> >
> > J.
> >
> > --
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Failing Master

Reply via email to