Re: [DISCUSS] Reducing build times

Arvid Heise Fri, 16 Aug 2019 01:14:35 -0700

Thank you for starting the discussion as well!

+1 to 1. it seems to be a quite low-hanging fruit that we should try to
employ as much as possible.


-0 to 2. the build setup is already very complicated. Adding new
functionality that I would expect to come out of the box of a modern build
tool seems like too much effort for me. I'm proposing a 7. action item that
I would like to try out first before making the setup more complicated.

+0 to 3. What is the actual intent here? If it's about failing earlier,
then I'd rather propose to reorder the tests such that unit and smoke tests
of every module are run before IT tests. If it's about being able to
approve a PR quicker, are smoke tests really enough? However, if we have
layered tests, then it would be rather easy to omit IT tests altogether in
specific (local) builds.

-1 to 4. I really want to see when stuff breaks not only once per day (or
whatever the CRON cycle is). I can really see more broken code being merged
into master because of the disconnect.

+1 to 5. Gradle build cache has worked well for me in the past. If there is
a general interest, I can start a POC (or improve upon older POCs). I
currently expect shading to be the most effort.

+1 to 6. Travis had so many drawbacks in the past and now that most of the
senior staff has been layed off, I don't expect any improvements at all.
At my old company, I switched our open source projects to Azure pipelines
with great success. Azure pipelines offers 10 instances for open source
projects and it's payment model is pay-as-you-go [1]. Since artifact
sharing seems to be an issue with Travis anyways, it looks rather easy to
use in pipelines [2].
I'd also expect Github CI to be a good fit for our needs [3], but it's
rather young and I have no experience.

---

7. Option I'd like to try the global build cache that's provided by Gradle
enterprise for Maven first [4]. It basically fingerprints a task
(fingerprint of upstream tasks, source files + black magic) and whenever
the fingerprint matches it fetches the results from the build cache. In
theory, we would get the results of 2. implicitly without any effort. Of
course, Gradle enterprise costs money (which I could inquire if general
interest exists) but it would also allow us to downgrade the Travis plan
(and Travis is really expensive).


[1]
https://azure.microsoft.com/en-in/blog/announcing-azure-pipelines-with-unlimited-ci-cd-minutes-for-open-source/
[2]
https://docs.microsoft.com/en-us/azure/devops/pipelines/artifacts/pipeline-artifacts?view=azure-devops&tabs=yaml
[3] https://github.blog/2019-08-08-github-actions-now-supports-ci-cd/
[4] https://docs.gradle.com/enterprise/maven-extension/

On Fri, Aug 16, 2019 at 5:20 AM Jark Wu <[email protected]> wrote:

> Thanks Chesnay for starting this discussion.
>
> +1 for #1, it might be the easiest way to get a significant speedup.
> If the only reason is for isolation. I think we can fix the static fields
> or global state used in Flink if possible.
>
> +1 for #2, and thanks Aleksey for the prototype. I think it's a good
> approach which doesn't introduce too much things to maintain.
>
> +1 for #3(run CRON or e2e tests on demand).
> We have this requirement when reviewing some pull requests, because we
> don't sure whether it will broken some specific e2e test.
> Currently, we have to run it locally by building the whole project. Or
> enable CRON jobs for the pushed branch in contributor's own travis.
>
> Besides that, I think FLINK-11464[1] is also a good way to cache
> distributions to save a lot of download time.
>
> Best,
> Jark
>
> [1]: https://issues.apache.org/jira/browse/FLINK-11464
>
> On Thu, 15 Aug 2019 at 21:47, Aleksey Pak <[email protected]> wrote:
>
> > Hi all!
> >
> > Thanks for starting this discussion.
> >
> > I'd like to also add my 2 cents:
> >
> > +1 for #2, differential build scripts.
> > I've worked on the approach. And with it, I think it's possible to reduce
> > total build time with relatively low effort, without enforcing any new
> > build tool and low maintenance cost.
> >
> > You can check a proposed change (for the old CI setup, when Flink PRs
> were
> > running in Apache common CI pool) here:
> > https://github.com/apache/flink/pull/9065
> > In the proposed change, the dependency check is not heavily hardcoded and
> > just uses maven's results for dependency graph analysis.
> >
> > > This approach is conceptually quite straight-forward, but has limits
> > since it has to be pessimistic; > i.e. a change in flink-core _must_
> result
> > in testing all modules.
> >
> > Agree, in Flink case, there are some core modules that would trigger
> whole
> > tests run with such approach. For developers who modify such components,
> > the build time would be the longest. But this approach should really help
> > for developers who touch more-or-less independent modules.
> >
> > Even for core modules, it's possible to create "abstraction" barriers by
> > changing dependency graph. For example, it can look like: flink-core-api
> > <-- flink-core, flink-core-api <-- flink-connectors.
> > In that case, only change in flink-core-api would trigger whole tests
> run.
> >
> > +1 for #3, separating PR CI runs to different stages.
> > Imo, it may require more change to current CI setup, compared to #2 and
> > better it should not be silly. Best, if it integrates with the Flink bot
> > and triggers some follow up build steps only when some prerequisites are
> > done.
> >
> > +1 for #4, to move some tests into cron runs.
> > But imo, this does not scale well, it applies only to a small subset of
> > tests.
> >
> > +1 for #6, to use other CI service(s).
> > More specifically, GitHub gives build actions for free that can be used
> to
> > offload some build steps/PR checks. It can help to move out some PR
> checks
> > from the main CI build (for example: documentation builds, license
> checks,
> > code formatting checks).
> >
> > Regards,
> > Aleksey
> >
> > On Thu, Aug 15, 2019 at 11:08 AM Till Rohrmann <[email protected]>
> > wrote:
> >
> > > Thanks for starting this discussion Chesnay. I think it has become
> > obvious
> > > to the Flink community that with the existing build setup we cannot
> > really
> > > deliver fast build times which are essential for fast iteration cycles
> > and
> > > high developer productivity. The reasons for this situation are
> manifold
> > > but it is definitely affected by Flink's project growth, not always
> > optimal
> > > tests and the inflexibility that everything needs to be built. Hence, I
> > > consider the reduction of build times crucial for the project's health
> > and
> > > future growth.
> > >
> > > Without necessarily voicing a strong preference for any of the
> presented
> > > suggestions, I wanted to comment on each of them:
> > >
> > > 1. This sounds promising. Could the reason why we don't reuse JVMs date
> > > back to the time when we still had a lot of static fields in Flink
> which
> > > made it hard to reuse JVMs and the potentially mutated global state?
> > >
> > > 2. Building hand-crafted solutions around a build system in order to
> > > compensate for its limitations which other build systems support out of
> > the
> > > box sounds like the not invented here syndrome to me. Reinventing the
> > wheel
> > > has historically proven to be usually not the best solution and it
> often
> > > comes with a high maintenance price tag. Moreover, it would add just
> > > another layer of complexity around our existing build system. I think
> the
> > > current state where we have the maven setup in pom files and for Travis
> > > multiple bash scripts specializing the builds to make it fit the time
> > limit
> > > is already not very transparent/easy to understand.
> > >
> > > 3. I could see this work but it also requires a very good understanding
> > of
> > > Flink of every committer because the committer needs to know which
> tests
> > > would be good to run additionally.
> > >
> > > 4. I would be against this option solely to decrease our build time. My
> > > observation is that the community does not monitor the health of the
> cron
> > > jobs well enough. In the past the cron jobs have been unstable for as
> > long
> > > as a complete release cycle. Moreover, I've seen that PRs were merged
> > which
> > > passed Travis but broke the cron jobs. Consequently, I fear that this
> > > option would deteriorate Flink's stability.
> > >
> > > 5. I would rephrase this point into changing the build system. Gradle
> > could
> > > be one candidate but there are also other build systems out there like
> > > Bazel. Changing the build system would indeed be a major endeavour but
> I
> > > could see the long term benefits of such a change (similar to having a
> > > consistent and enforced code style) in particular if the build system
> > > supports the functionality which we would otherwise build & maintain on
> > our
> > > own. I think there would be ways to make the transition not as
> disruptive
> > > as described. For example, one could keep the Maven build and the new
> > build
> > > side by side until one is confident enough that the new build produces
> > the
> > > same output as the Maven build. Maybe it would also be possible to
> > migrate
> > > individual modules starting from the leaves. However, I admit that
> > changing
> > > the build system will affect every Flink developer because she needs to
> > > learn & understand it.
> > >
> > > 6. I would like to learn about other people's experience with different
> > CI
> > > systems. Travis worked okish for Flink so far but we see sometimes
> > problems
> > > with its caching mechanism as Chesnay stated. I think that this topic
> is
> > > actually orthogonal to the other suggestions.
> > >
> > > My gut feeling is that not a single suggestion will be our solution
> but a
> > > combination of them.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Thu, Aug 15, 2019 at 10:50 AM Zhu Zhu <[email protected]> wrote:
> > >
> > > > Thanks Chesnay for bringing up this discussion and sharing those
> > thoughts
> > > > to speed up the building process.
> > > >
> > > > I'd +1 for option 2 and 3.
> > > >
> > > > We can benefits a lot from Option 2. Developing table, connectors,
> > > > libraries, docs modules would result in much fewer tests(1/3 to
> 1/tens)
> > > to
> > > > run.
> > > > PRs for those modules take up more than half of all the PRs in my
> > > > observation.
> > > >
> > > > Option 3 can be a supplementary to option 2 that if the PR is
> modifying
> > > > fundamental modules like flink-core or flink-runtime.
> > > > It can even be a switch of the tests scope(basic/full) of a PR, so
> that
> > > > committers do not need to trigger it multiple times.
> > > > With it we can postpone the testing of IT cases or connectors before
> > the
> > > PR
> > > > reaches a stable state.
> > > >
> > > > Thanks,
> > > > Zhu Zhu
> > > >
> > > > Chesnay Schepler <[email protected]> 于2019年8月15日周四 下午3:38写道：
> > > >
> > > > > Hello everyone,
> > > > >
> > > > > improving our build times is a hot topic at the moment so let's
> > discuss
> > > > > the different ways how they could be reduced.
> > > > >
> > > > >
> > > > >         Current state:
> > > > >
> > > > > First up, let's look at some numbers:
> > > > >
> > > > > 1 full build currently consumes 5h of build time total ("total
> > time"),
> > > > > and in the ideal case takes about 1h20m ("run time") to complete
> from
> > > > > start to finish. The run time may fluctuate of course depending on
> > the
> > > > > current Travis load. This applies both to builds on the Apache and
> > > > > flink-ci Travis.
> > > > >
> > > > > At the time of writing, the current queue time for PR jobs
> (reminder:
> > > > > running on flink-ci) is about 30 minutes (which basically means
> that
> > we
> > > > > are processing builds at the rate that they come in), however we
> are
> > in
> > > > > an admittedly quiet period right now.
> > > > > 2 weeks ago the queue times on flink-ci peaked at around 5-6h as
> > > > > everyone was scrambling to get their changes merged in time for the
> > > > > feature freeze.
> > > > >
> > > > > (Note: Recently optimizations where added to ci-bot where pending
> > > builds
> > > > > are canceled if a new commit was pushed to the PR or the PR was
> > closed,
> > > > > which should prove especially useful during the rush hours we see
> > > before
> > > > > feature-freezes.)
> > > > >
> > > > >
> > > > >         Past approaches
> > > > >
> > > > > Over the years we have done rather few things to improve this
> > situation
> > > > > (hence our current predicament).
> > > > >
> > > > > Beyond the sporadic speedup of some tests, the only notable
> reduction
> > > in
> > > > > total build times was the introduction of cron jobs, which
> > consolidated
> > > > > the per-commit matrix from 4 configurations (different scala/hadoop
> > > > > versions) to 1.
> > > > >
> > > > > The separation into multiple build profiles was only a work-around
> > for
> > > > > the 50m limit on Travis. Running tests in parallel has the obvious
> > > > > potential of reducing run time, but we're currently hitting a hard
> > > limit
> > > > > since a few modules (flink-tests, flink-runtime,
> > > > > flink-table-planner-blink) are so loaded with tests that they
> nearly
> > > > > consume an entire profile by themselves (and thus no further
> > splitting
> > > > > is possible).
> > > > >
> > > > > The rework that introduced stages, at the time of introduction, did
> > > also
> > > > > not provide a speed up, although this changed slightly once more
> > > > > profiles were added and some optimizations to the caching have been
> > > made.
> > > > >
> > > > > Very recently we modified the surefire-plugin configuration for
> > > > > flink-table-planner-blink to reuse JVM forks for IT cases,
> providing
> > a
> > > > > significant speedup (18 minutes!). So far we have not seen any
> > negative
> > > > > consequences.
> > > > >
> > > > >
> > > > >         Suggestions
> > > > >
> > > > > This is a list of /all /suggestions for reducing run/total times
> > that I
> > > > > have seen recently (in other words, they aren't necessarily mine
> nor
> > > may
> > > > > I agree with all of them).
> > > > >
> > > > >  1. Enable JVM reuse for IT cases in more modules.
> > > > >       * We've seen significant speedups in the blink planner, and
> > this
> > > > >         should be applicable for all modules. However, I presume
> > > there's
> > > > >         a reason why we disabled JVM reuse (information on this
> would
> > > be
> > > > >         appreciated)
> > > > >  2. Custom differential build scripts
> > > > >       * Setup custom scripts for determining which modules might be
> > > > >         affected by change, and manipulate the splits accordingly.
> > This
> > > > >         approach is conceptually quite straight-forward, but has
> > limits
> > > > >         since it has to be pessimistic; i.e. a change in flink-core
> > > > >         _must_ result in testing all modules.
> > > > >  3. Only run smoke tests when PR is opened, run heavy tests on
> > demand.
> > > > >       * With the introduction of the ci-bot we now have
> significantly
> > > > >         more options on how to handle PR builds. One option could
> be
> > to
> > > > >         only run basic tests when the PR is created (which may be
> > only
> > > > >         modified modules, or all unit tests, or another low-cost
> > > > >         scheme), and then have a committer trigger other builds
> (full
> > > > >         test run, e2e tests, etc...) on demand.
> > > > >  4. Move more tests into cron builds
> > > > >       * The budget version of 3); move certain tests that are
> either
> > > > >         expensive (like some runtime tests that take minutes) or in
> > > > >         rarely modified modules (like gelly) into cron jobs.
> > > > >  5. Gradle
> > > > >       * Gradle was brought up a few times for it's built-in support
> > for
> > > > >         differential builds; basically providing 2) without the
> > > overhead
> > > > >         of maintaining additional scripts.
> > > > >       * To date no PoC was provided that shows it working in our CI
> > > > >         environment (i.e., handling splits & caching etc).
> > > > >       * This is the most disruptive change by a fair margin, as it
> > > would
> > > > >         affect the entire project, developers and potentially users
> > (f
> > > > >         they build from source).
> > > > >  6. CI service
> > > > >       * Our current artifact caching setup on Travis is basically a
> > > > >         hack; we're basically abusing the Travis cache, which is
> > meant
> > > > >         for long-term caching, to ship build artifacts across jobs.
> > > It's
> > > > >         brittle at times due to timing/visibility issues and on
> > > branches
> > > > >         the cleanup processes can interfere with running builds. It
> > is
> > > > >         also not as effective as it could be.
> > > > >       * There are CI services that provide build artifact caching
> out
> > > of
> > > > >         the box, which could be useful for us.
> > > > >       * To date, no PoC for using another CI service has been
> > provided.
> > > > >
> > > > >
> > > >
> > >
> >
>


-- 

Arvid Heise | Senior Software Engineer

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen

Re: [DISCUSS] Reducing build times

Reply via email to