OK. I have a green build finally on the CI optimisations PR with Quarantined tests (and they all passed this time). I expect we might discover few more tests that might get quarantined in the next few days - I will keep an eye on that and will organize "Test Cleaning" project on GitHub and involve others.
PR here: https://github.com/apache/airflow/pull/8393 Build here: https://github.com/apache/airflow/actions/runs/82894632 I got already an approval from Tomek, but if you want to take a look and comment. I am still testing caching for images on my own fork, but once I get it confirmed, I'd love to merge it. Some optimisations that I brought back/introduced: - tests are not executed for doc-only changes - images will be (once merged) downloaded from GitHub Registry so likely much faster - we have a "scheduled" nightly build that will build everything from scratch and check if no requirements have been broken - updated documentation and removed Travis references - coloured output where needed - much nicer static check output now (we have timestamp in GA so we could disable verbosity - improved split of static checks between two static check jobs - to utilise parallelism better. - reorganised some fast jobs (requirements, prod image) that do not depend on tests so that they can run earlier - shorter names for jobs so that they are nicer to view in the actions view - matrix definitions of the jobs so that we can manage them better What is left is to bring Kubernetes jobs to Github Actions. Working on it next. J. On Mon, Apr 20, 2020 at 12:11 PM Jarek Potiuk <[email protected]> wrote: > Absolutely 1 Great idea! Happy to coordinate - and I hope others would > like to join it as well :) > > On Mon, Apr 20, 2020 at 12:04 PM Tomasz Urbaszek < > [email protected]> wrote: > >> Got it! >> >> What would you say to organize a more coordinated effort to improve >> our testing suite something like "Fridays with tests"? In a few weeks, >> this should result in a much better test suite and probably fewer >> problems with CI. This also a nice way to take a look at Airflow >> internals :) >> >> Tomek >> >> >> On Mon, Apr 20, 2020 at 10:18 AM Jarek Potiuk <[email protected]> >> wrote: >> > >> > Both - depending on the tests. I think for now I've been over-cautious a >> > bit and after merging while observing a few runs in production (and >> other >> > people's PR we might quickly go down with the number of quarantined >> tests. >> > >> > I think most of the problematic tests are really "long-running" and >> pretty >> > stand-alone ones. I think part of the process should be that if we find >> > that they require some side effects, we will be able to fix that the and >> > eventually we will only have few quarantined "single tests" rather than >> > "whole classes" >> > >> > On Mon, Apr 20, 2020 at 7:42 AM Tomasz Urbaszek < >> [email protected]> >> > wrote: >> > >> > > Thank you Jarek for your work! >> > > +1 for the idea of quarantine tests. Just one question: are we marking >> > > single tests or whole classes? This question is mostly related to >> > > tests that requires some side effects from previous tests. >> > > >> > > Tomek >> > > >> > > >> > > On Mon, Apr 20, 2020 at 2:38 AM Jarek Potiuk < >> [email protected]> >> > > wrote: >> > > > >> > > > Hello everyone, >> > > > >> > > > I have a proposal - very much COVID-19-inspired on how to fix our CI >> > > tests... >> > > > >> > > > After the recent problems with CI together with Daniel and Tomek we >> > > > decided to make an emergency migration to Github Actions. So we did. >> > > > >> > > > I think overall it was a good move, but we had some problems with >> it. >> > > > It turns out that while we were blaming Travis for everything wrong >> > > > that happened in our builds, it was not always Travis' fault. We >> have >> > > > some tests that are also failing in Github Actions and I think it's >> > > > the highest time we fix them. >> > > > >> > > > I spend a better part of the weekend bring trying different things >> and >> > > > implementing numerous optimizations back to our CI configuration (a >> > > > lot of those were lost during the emergency move). >> > > > >> > > > While running it I had many issues and I think I found a good way to >> > > > handle our flaky tests. I would love that others think about it. >> > > > >> > > > Those interested - please take a look at the PR "Bring back CI >> > > > optimisations" https://github.com/apache/airflow/pull/8393 >> > > > Corresponding GituhbActions here: >> > > > https://github.com/apache/airflow/actions/runs/82410109 >> > > > >> > > > I implemented a lot of optimizations in this PR (some of them will >> > > > only take effect after we merge to master) but most of all I wanted >> to >> > > > introduce a concept of "quarantined tests" (good name isn't it :) ) >> > > > >> > > > Here is the idea: >> > > > >> > > > - tests that are marked as @pytest.mark.quarantined are skipped in >> > > > regular runs (I identified 58 potential candidates - not all of them >> > > > are flaky but I wanted to be safe) >> > > > - there is one dedicated "Quarantine" job that runs only >> quarantined >> > > > tests (it's Postgres 9.6 with Python 3.6 for now) >> > > > - those "quarantined" tests are run with 90 s. timeout each and >> rerun >> > > > up to 3 times if they fail >> > > > - failure of any of the Quarantine tests does not fail the whole CI >> > > > - I plan to create GithUb issues for groups of those tests >> > > > (MoveOutOfQuarantine NNNN) >> > > > - I think it's best if we split them between committers >> > > > - The job of the committers will be to observe the stability of >> those >> > > tests >> > > > - once we fix and observe that the tests are "stable" we move them >> > > > out of Quarantine back to regular tests (by removing >> > > > @pytest.mark.quarantined) >> > > > - the goal is to move all our tests out of Quarantine >> > > > - in the future we can move any flaky test to Quarantine (by adding >> > > > @pytest.mark.quarantined) and it will give us time to observe it and >> > > > fix any flakiness. >> > > > >> > > > Let me know what you think of it? >> > > > >> > > > J. >> > > > >> > > > -- >> > > > Jarek Potiuk >> > > > Polidea | Principal Software Engineer >> > > > >> > > > M: +48 660 796 129 >> > > >> > > >> > > >> > > -- >> > > >> > > Tomasz Urbaszek >> > > Polidea | Software Engineer >> > > >> > > M: +48 505 628 493 >> > > E: [email protected] >> > > >> > > Unique Tech >> > > Check out our projects! >> > > >> > >> > >> > -- >> > >> > Jarek Potiuk >> > Polidea <https://www.polidea.com/> | Principal Software Engineer >> > >> > M: +48 660 796 129 <+48660796129> >> > [image: Polidea] <https://www.polidea.com/> >> >> >> >> -- >> >> Tomasz Urbaszek >> Polidea | Software Engineer >> >> M: +48 505 628 493 >> E: [email protected] >> >> Unique Tech >> Check out our projects! >> > > > -- > > Jarek Potiuk > Polidea <https://www.polidea.com/> | Principal Software Engineer > > M: +48 660 796 129 <+48660796129> > [image: Polidea] <https://www.polidea.com/> > > -- Jarek Potiuk Polidea <https://www.polidea.com/> | Principal Software Engineer M: +48 660 796 129 <+48660796129> [image: Polidea] <https://www.polidea.com/>
