Thanks Chesnay for bringing up this discussion and sharing those thoughts
to speed up the building process.

I'd +1 for option 2 and 3.

We can benefits a lot from Option 2. Developing table, connectors,
libraries, docs modules would result in much fewer tests(1/3 to 1/tens) to
PRs for those modules take up more than half of all the PRs in my

Option 3 can be a supplementary to option 2 that if the PR is modifying
fundamental modules like flink-core or flink-runtime.
It can even be a switch of the tests scope(basic/full) of a PR, so that
committers do not need to trigger it multiple times.
With it we can postpone the testing of IT cases or connectors before the PR
reaches a stable state.

Zhu Zhu

Chesnay Schepler <> 于2019年8月15日周四 下午3:38写道:

> Hello everyone,
> improving our build times is a hot topic at the moment so let's discuss
> the different ways how they could be reduced.
>         Current state:
> First up, let's look at some numbers:
> 1 full build currently consumes 5h of build time total ("total time"),
> and in the ideal case takes about 1h20m ("run time") to complete from
> start to finish. The run time may fluctuate of course depending on the
> current Travis load. This applies both to builds on the Apache and
> flink-ci Travis.
> At the time of writing, the current queue time for PR jobs (reminder:
> running on flink-ci) is about 30 minutes (which basically means that we
> are processing builds at the rate that they come in), however we are in
> an admittedly quiet period right now.
> 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> everyone was scrambling to get their changes merged in time for the
> feature freeze.
> (Note: Recently optimizations where added to ci-bot where pending builds
> are canceled if a new commit was pushed to the PR or the PR was closed,
> which should prove especially useful during the rush hours we see before
> feature-freezes.)
>         Past approaches
> Over the years we have done rather few things to improve this situation
> (hence our current predicament).
> Beyond the sporadic speedup of some tests, the only notable reduction in
> total build times was the introduction of cron jobs, which consolidated
> the per-commit matrix from 4 configurations (different scala/hadoop
> versions) to 1.
> The separation into multiple build profiles was only a work-around for
> the 50m limit on Travis. Running tests in parallel has the obvious
> potential of reducing run time, but we're currently hitting a hard limit
> since a few modules (flink-tests, flink-runtime,
> flink-table-planner-blink) are so loaded with tests that they nearly
> consume an entire profile by themselves (and thus no further splitting
> is possible).
> The rework that introduced stages, at the time of introduction, did also
> not provide a speed up, although this changed slightly once more
> profiles were added and some optimizations to the caching have been made.
> Very recently we modified the surefire-plugin configuration for
> flink-table-planner-blink to reuse JVM forks for IT cases, providing a
> significant speedup (18 minutes!). So far we have not seen any negative
> consequences.
>         Suggestions
> This is a list of /all /suggestions for reducing run/total times that I
> have seen recently (in other words, they aren't necessarily mine nor may
> I agree with all of them).
>  1. Enable JVM reuse for IT cases in more modules.
>       * We've seen significant speedups in the blink planner, and this
>         should be applicable for all modules. However, I presume there's
>         a reason why we disabled JVM reuse (information on this would be
>         appreciated)
>  2. Custom differential build scripts
>       * Setup custom scripts for determining which modules might be
>         affected by change, and manipulate the splits accordingly. This
>         approach is conceptually quite straight-forward, but has limits
>         since it has to be pessimistic; i.e. a change in flink-core
>         _must_ result in testing all modules.
>  3. Only run smoke tests when PR is opened, run heavy tests on demand.
>       * With the introduction of the ci-bot we now have significantly
>         more options on how to handle PR builds. One option could be to
>         only run basic tests when the PR is created (which may be only
>         modified modules, or all unit tests, or another low-cost
>         scheme), and then have a committer trigger other builds (full
>         test run, e2e tests, etc...) on demand.
>  4. Move more tests into cron builds
>       * The budget version of 3); move certain tests that are either
>         expensive (like some runtime tests that take minutes) or in
>         rarely modified modules (like gelly) into cron jobs.
>  5. Gradle
>       * Gradle was brought up a few times for it's built-in support for
>         differential builds; basically providing 2) without the overhead
>         of maintaining additional scripts.
>       * To date no PoC was provided that shows it working in our CI
>         environment (i.e., handling splits & caching etc).
>       * This is the most disruptive change by a fair margin, as it would
>         affect the entire project, developers and potentially users (f
>         they build from source).
>  6. CI service
>       * Our current artifact caching setup on Travis is basically a
>         hack; we're basically abusing the Travis cache, which is meant
>         for long-term caching, to ship build artifacts across jobs. It's
>         brittle at times due to timing/visibility issues and on branches
>         the cleanup processes can interfere with running builds. It is
>         also not as effective as it could be.
>       * There are CI services that provide build artifact caching out of
>         the box, which could be useful for us.
>       * To date, no PoC for using another CI service has been provided.

