Thanks for asking ! Really appreciated!. I think adding better diagnostics
for docker containers might be a good idea.

At this point seems that most of the builds are passing.  Yesterday we had
some failures due to an old PR merged (but it was quickly fixed by Kaxil !).

We still have intermittent failures from time to time:

   -  I saw occasional failures when database was missing mid-flight - for
   example here: https://travis-ci.org/apache/airflow/jobs/645157182. The
   DB was ok at the beginning of test (it is checked now) but then it
   disappeared.
   - The kerberos test occasionally hang
   with 
tests/cli/commands/test_task_command.py::TestCliTaskBackfill::test_run_ignores_all_dependencies
   running longer than 8 minutes (typically it runs for about 1 minute and
   this is our longest test in non-integration tests. This one is really
   difficult, but the diagnostics should help

It gets a bit difficult to investigate remaining issues.

What you could definitely help with is to add more diagnostics: I added it
for Kubernetes tests. Right now we are uploading all the Kubernetes logs to
file.io so that we can download them from there in case we need to
investigate a failure. It would be great to also dump all the logs from all
the containers  (docker logs) after each build and upload them to file.io
(no matter if it succeeds or fails). That will make it easier to analyse
root causes for problems when they happen intermittently.

I already do something similar now when I check the environment at the
beginning of tests:
https://github.com/apache/airflow/blob/master/scripts/ci/in_container/check_environment.sh#L56
-
in case anything is wrong I dump all the logs to the standard output.
However I think with the file.io we have a chance to upload such logs
always and have access to the logs without cluttering Travis logs.

Dumping logs to file.io is super easy. For kubernetes tests I did it in
python but it should be even easier for BASH:
https://github.com/apache/airflow/blob/master/tests/runtime/kubernetes/test_kubernetes_executor.py#L107


J.

On Mon, Feb 3, 2020 at 7:46 AM Michał Słowikowski <
michal.slowikow...@polidea.com> wrote:

> Can I help you somehow with failing master?
>
> On Mon, Feb 3, 2020 at 7:44 AM Michał Słowikowski <
> michal.slowikow...@polidea.com> wrote:
>
> > Thanks Jarek, awesome work!
> >
> > On Sun, Feb 2, 2020 at 11:09 AM Jarek Potiuk <jarek.pot...@polidea.com>
> > wrote:
> >
> >> Still seems that the "timeout" in the last kerberos job is back now
> >> (intermittent) - seems to appear when we run more of those builds in
> >> parallel.
> >> So still one more diagnosis/fix is needed I am afraid.
> >>
> >> On Sun, Feb 2, 2020 at 11:06 AM Ash Berlin-Taylor <a...@apache.org>
> wrote:
> >>
> >> > Great work Jarek!
> >> >
> >> > On 2 February 2020 09:18:52 GMT, Jarek Potiuk <
> jarek.pot...@polidea.com
> >> >
> >> > wrote:
> >> >>
> >> >> Ok. The master is fixed now (finally!). The master is now working so
> >> please
> >> >> rebase all of your open PRs to master.
> >> >>
> >> >> At the end we had a number of different problems, some coincidences
> >> at the
> >> >> same time that’s why it was so hectic and difficult to diagnose:
> >> >>
> >> >>    - Travis queue was stalled (at some point in time we had some 20
> >> builds
> >> >>    waiting in a queue) so we did not rebase some merges to save time
> >> and
> >> >>    merged them  from old masters
> >> >>    - Some of the master merges were cancelled - so we could not see
> >> which
> >> >>    commit broke the build - that make us come up with different
> >> hypothesis for
> >> >>    the problem
> >> >>    - Our optimisations for CI builds optimisations (skip Kubernetes
> >> builds
> >> >>    when no kubernetes-related changes) cause the contrib/example_dags
> >> move to
> >> >>    slip under the radar of PR CI checks
> >> >>    - Even if we did not have the optimisations -  Kubernetes Git Sync
> >> uses
> >> >>    master of Airflow, so we would not have detected that by PR
> failure
> >> (only
> >> >>    after merge)
> >> >>    - We had a number of “false positives” and lack of detailed logs
> for
> >> >>    Kubernetes.
> >> >>    - We had a mysterious hang on kerberos tests - but it was caused
> >> likely
> >> >>    by Travis environment change (it’s gone now)
> >> >>    - We had Redis test failures caused by 3.4 release of redis-py
> >> libraries
> >> >>    which contained a change (Redis class became un-hashable by adding
> >> __eq__
> >> >>    hook) - luckily they reverted it two hours ago (
> >> >>    https://github.com/andymccurdy/redis-py/blob/master/CHANGES)
> >> >>    - We downloaded Apache RAT tool from a maven repository. And this
> >> maven
> >> >>    repo is very unstable recently.
> >> >>    - There are a number of follow-up PRs (already merged or building
> on
> >> >>    Travis now)  that will resolve those problems and prevent it in
> the
> >> future.
> >> >>
> >> >> J.
> >> >>
> >> >>
> >> >> On Thu, Jan 30, 2020 at 11:16 AM Ash Berlin-Taylor <a...@apache.org>
> >> wrote:
> >> >>
> >> >>  Spent a little bit of time looking at this and it seems it was
> (super)
> >> >>>  flaky tests -- I've managed to get 1 commit back on master passing
> >> by just
> >> >>>  retrying the one failed job.
> >> >>>
> >> >>>  Looking at the latest commit now.
> >> >>>
> >> >>>  On Jan 30 2020, at 7:54 am, Jarek Potiuk <jarek.pot...@polidea.com
> >
> >> wrote:
> >> >>>
> >> >>>> It looks like we have a failing master - seems that yesterday's
> >> Travis'
> >> >>>> super-slow queue and a number of PRs that were merged without
> >> rebasing
> >> >>>>
> >> >>> and
> >> >>>
> >> >>>>  caused master to be broken.
> >> >>>>
> >> >>>>  I will not be at my PC for couple of hours at least so maybe some
> >> other
> >> >>>>  committers can take a look in the meantime.
> >> >>>>
> >> >>>>  J.
> >> >>>>
> >> >>>>  --
> >> >>>>  Jarek Potiuk
> >> >>>>  Polidea <https://www.polidea.com/> | Principal Software Engineer
> >> >>>>
> >> >>>>  M: +48 660 796 129 <+48660796129>
> >> >>>>  [image: Polidea] <https://www.polidea.com/>
> >> >>>>
> >> >>>>
> >> >>>
> >> >>>
> >>
> >> --
> >>
> >> Jarek Potiuk
> >> Polidea <https://www.polidea.com/> | Principal Software Engineer
> >>
> >> M: +48 660 796 129 <+48660796129>
> >> [image: Polidea] <https://www.polidea.com/>
> >>
> >
> >
> > --
> >
> > Michał Słowikowski
> > Polidea <https://www.polidea.com/> | Junior Software Engineer
> >
> > E: michal.slowikow...@polidea.com
> >
> > Unique Tech
> > Check out our projects! <https://www.polidea.com/our-work>
> >
>
>
> --
>
> Michał Słowikowski
> Polidea <https://www.polidea.com/> | Junior Software Engineer
>
> E: michal.slowikow...@polidea.com
>
> Unique Tech
> Check out our projects! <https://www.polidea.com/our-work>
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to