Thanks for starting this discussion Chesnay. I think it has become obvious to the Flink community that with the existing build setup we cannot really deliver fast build times which are essential for fast iteration cycles and high developer productivity. The reasons for this situation are manifold but it is definitely affected by Flink's project growth, not always optimal tests and the inflexibility that everything needs to be built. Hence, I consider the reduction of build times crucial for the project's health and future growth.
Without necessarily voicing a strong preference for any of the presented suggestions, I wanted to comment on each of them: 1. This sounds promising. Could the reason why we don't reuse JVMs date back to the time when we still had a lot of static fields in Flink which made it hard to reuse JVMs and the potentially mutated global state? 2. Building hand-crafted solutions around a build system in order to compensate for its limitations which other build systems support out of the box sounds like the not invented here syndrome to me. Reinventing the wheel has historically proven to be usually not the best solution and it often comes with a high maintenance price tag. Moreover, it would add just another layer of complexity around our existing build system. I think the current state where we have the maven setup in pom files and for Travis multiple bash scripts specializing the builds to make it fit the time limit is already not very transparent/easy to understand. 3. I could see this work but it also requires a very good understanding of Flink of every committer because the committer needs to know which tests would be good to run additionally. 4. I would be against this option solely to decrease our build time. My observation is that the community does not monitor the health of the cron jobs well enough. In the past the cron jobs have been unstable for as long as a complete release cycle. Moreover, I've seen that PRs were merged which passed Travis but broke the cron jobs. Consequently, I fear that this option would deteriorate Flink's stability. 5. I would rephrase this point into changing the build system. Gradle could be one candidate but there are also other build systems out there like Bazel. Changing the build system would indeed be a major endeavour but I could see the long term benefits of such a change (similar to having a consistent and enforced code style) in particular if the build system supports the functionality which we would otherwise build & maintain on our own. I think there would be ways to make the transition not as disruptive as described. For example, one could keep the Maven build and the new build side by side until one is confident enough that the new build produces the same output as the Maven build. Maybe it would also be possible to migrate individual modules starting from the leaves. However, I admit that changing the build system will affect every Flink developer because she needs to learn & understand it. 6. I would like to learn about other people's experience with different CI systems. Travis worked okish for Flink so far but we see sometimes problems with its caching mechanism as Chesnay stated. I think that this topic is actually orthogonal to the other suggestions. My gut feeling is that not a single suggestion will be our solution but a combination of them. Cheers, Till On Thu, Aug 15, 2019 at 10:50 AM Zhu Zhu <reed...@gmail.com> wrote: > Thanks Chesnay for bringing up this discussion and sharing those thoughts > to speed up the building process. > > I'd +1 for option 2 and 3. > > We can benefits a lot from Option 2. Developing table, connectors, > libraries, docs modules would result in much fewer tests(1/3 to 1/tens) to > run. > PRs for those modules take up more than half of all the PRs in my > observation. > > Option 3 can be a supplementary to option 2 that if the PR is modifying > fundamental modules like flink-core or flink-runtime. > It can even be a switch of the tests scope(basic/full) of a PR, so that > committers do not need to trigger it multiple times. > With it we can postpone the testing of IT cases or connectors before the PR > reaches a stable state. > > Thanks, > Zhu Zhu > > Chesnay Schepler <ches...@apache.org> 于2019年8月15日周四 下午3:38写道： > > > Hello everyone, > > > > improving our build times is a hot topic at the moment so let's discuss > > the different ways how they could be reduced. > > > > > > Current state: > > > > First up, let's look at some numbers: > > > > 1 full build currently consumes 5h of build time total ("total time"), > > and in the ideal case takes about 1h20m ("run time") to complete from > > start to finish. The run time may fluctuate of course depending on the > > current Travis load. This applies both to builds on the Apache and > > flink-ci Travis. > > > > At the time of writing, the current queue time for PR jobs (reminder: > > running on flink-ci) is about 30 minutes (which basically means that we > > are processing builds at the rate that they come in), however we are in > > an admittedly quiet period right now. > > 2 weeks ago the queue times on flink-ci peaked at around 5-6h as > > everyone was scrambling to get their changes merged in time for the > > feature freeze. > > > > (Note: Recently optimizations where added to ci-bot where pending builds > > are canceled if a new commit was pushed to the PR or the PR was closed, > > which should prove especially useful during the rush hours we see before > > feature-freezes.) > > > > > > Past approaches > > > > Over the years we have done rather few things to improve this situation > > (hence our current predicament). > > > > Beyond the sporadic speedup of some tests, the only notable reduction in > > total build times was the introduction of cron jobs, which consolidated > > the per-commit matrix from 4 configurations (different scala/hadoop > > versions) to 1. > > > > The separation into multiple build profiles was only a work-around for > > the 50m limit on Travis. Running tests in parallel has the obvious > > potential of reducing run time, but we're currently hitting a hard limit > > since a few modules (flink-tests, flink-runtime, > > flink-table-planner-blink) are so loaded with tests that they nearly > > consume an entire profile by themselves (and thus no further splitting > > is possible). > > > > The rework that introduced stages, at the time of introduction, did also > > not provide a speed up, although this changed slightly once more > > profiles were added and some optimizations to the caching have been made. > > > > Very recently we modified the surefire-plugin configuration for > > flink-table-planner-blink to reuse JVM forks for IT cases, providing a > > significant speedup (18 minutes!). So far we have not seen any negative > > consequences. > > > > > > Suggestions > > > > This is a list of /all /suggestions for reducing run/total times that I > > have seen recently (in other words, they aren't necessarily mine nor may > > I agree with all of them). > > > > 1. Enable JVM reuse for IT cases in more modules. > > * We've seen significant speedups in the blink planner, and this > > should be applicable for all modules. However, I presume there's > > a reason why we disabled JVM reuse (information on this would be > > appreciated) > > 2. Custom differential build scripts > > * Setup custom scripts for determining which modules might be > > affected by change, and manipulate the splits accordingly. This > > approach is conceptually quite straight-forward, but has limits > > since it has to be pessimistic; i.e. a change in flink-core > > _must_ result in testing all modules. > > 3. Only run smoke tests when PR is opened, run heavy tests on demand. > > * With the introduction of the ci-bot we now have significantly > > more options on how to handle PR builds. One option could be to > > only run basic tests when the PR is created (which may be only > > modified modules, or all unit tests, or another low-cost > > scheme), and then have a committer trigger other builds (full > > test run, e2e tests, etc...) on demand. > > 4. Move more tests into cron builds > > * The budget version of 3); move certain tests that are either > > expensive (like some runtime tests that take minutes) or in > > rarely modified modules (like gelly) into cron jobs. > > 5. Gradle > > * Gradle was brought up a few times for it's built-in support for > > differential builds; basically providing 2) without the overhead > > of maintaining additional scripts. > > * To date no PoC was provided that shows it working in our CI > > environment (i.e., handling splits & caching etc). > > * This is the most disruptive change by a fair margin, as it would > > affect the entire project, developers and potentially users (f > > they build from source). > > 6. CI service > > * Our current artifact caching setup on Travis is basically a > > hack; we're basically abusing the Travis cache, which is meant > > for long-term caching, to ship build artifacts across jobs. It's > > brittle at times due to timing/visibility issues and on branches > > the cleanup processes can interfere with running builds. It is > > also not as effective as it could be. > > * There are CI services that provide build artifact caching out of > > the box, which could be useful for us. > > * To date, no PoC for using another CI service has been provided. > > > > >