Still seems that the "timeout" in the last kerberos job is back now
(intermittent) - seems to appear when we run more of those builds in
parallel.
So still one more diagnosis/fix is needed I am afraid.

On Sun, Feb 2, 2020 at 11:06 AM Ash Berlin-Taylor <a...@apache.org> wrote:

> Great work Jarek!
>
> On 2 February 2020 09:18:52 GMT, Jarek Potiuk <jarek.pot...@polidea.com>
> wrote:
>>
>> Ok. The master is fixed now (finally!). The master is now working so please
>> rebase all of your open PRs to master.
>>
>> At the end we had a number of different problems, some coincidences  at the
>> same time that’s why it was so hectic and difficult to diagnose:
>>
>>    - Travis queue was stalled (at some point in time we had some 20 builds
>>    waiting in a queue) so we did not rebase some merges to save time and
>>    merged them  from old masters
>>    - Some of the master merges were cancelled - so we could not see which
>>    commit broke the build - that make us come up with different hypothesis 
>> for
>>    the problem
>>    - Our optimisations for CI builds optimisations (skip Kubernetes builds
>>    when no kubernetes-related changes) cause the contrib/example_dags move to
>>    slip under the radar of PR CI checks
>>    - Even if we did not have the optimisations -  Kubernetes Git Sync uses
>>    master of Airflow, so we would not have detected that by PR failure (only
>>    after merge)
>>    - We had a number of “false positives” and lack of detailed logs for
>>    Kubernetes.
>>    - We had a mysterious hang on kerberos tests - but it was caused likely
>>    by Travis environment change (it’s gone now)
>>    - We had Redis test failures caused by 3.4 release of redis-py libraries
>>    which contained a change (Redis class became un-hashable by adding __eq__
>>    hook) - luckily they reverted it two hours ago (
>>    https://github.com/andymccurdy/redis-py/blob/master/CHANGES)
>>    - We downloaded Apache RAT tool from a maven repository. And this maven
>>    repo is very unstable recently.
>>    - There are a number of follow-up PRs (already merged or building on
>>    Travis now)  that will resolve those problems and prevent it in the 
>> future.
>>
>> J.
>>
>>
>> On Thu, Jan 30, 2020 at 11:16 AM Ash Berlin-Taylor <a...@apache.org> wrote:
>>
>>  Spent a little bit of time looking at this and it seems it was (super)
>>>  flaky tests -- I've managed to get 1 commit back on master passing by just
>>>  retrying the one failed job.
>>>
>>>  Looking at the latest commit now.
>>>
>>>  On Jan 30 2020, at 7:54 am, Jarek Potiuk <jarek.pot...@polidea.com> wrote:
>>>
>>>> It looks like we have a failing master - seems that yesterday's Travis'
>>>> super-slow queue and a number of PRs that were merged without rebasing
>>>>
>>> and
>>>
>>>>  caused master to be broken.
>>>>
>>>>  I will not be at my PC for couple of hours at least so maybe some other
>>>>  committers can take a look in the meantime.
>>>>
>>>>  J.
>>>>
>>>>  --
>>>>  Jarek Potiuk
>>>>  Polidea <https://www.polidea.com/> | Principal Software Engineer
>>>>
>>>>  M: +48 660 796 129 <+48660796129>
>>>>  [image: Polidea] <https://www.polidea.com/>
>>>>
>>>>
>>>
>>>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to