Yes, we can ensure the same (or better) experience for contributors.

On the powerful machines, builds finish in 1.5 hours (without any caching
enabled).

Azure Pipelines offers 10 concurrent builds with a timeout of 6 hours for a
build for open source projects. Flink needs 3.5 hours on that infra (not
parallelized at all, no caching). These free machines are very similar to
those of Travis, so I expect no build time regressions, if we set it up
similarly.


On Wed, Sep 4, 2019 at 9:19 AM Chesnay Schepler <ches...@apache.org> wrote:

> Will using more powerful for the project make it more difficult to
> ensure that contributor builds are still running in a reasonable time?
>
> As an example of this happening on Travis, contributors currently cannot
> run all e2e tests since they timeout, but on apache we have a larger
> timeout.
>
> On 03/09/2019 18:57, Robert Metzger wrote:
> > Hi all,
> >
> > I wanted to give a short update on this:
> > - Arvid, Aljoscha and I have started working on a Gradle PoC, currently
> > working on making all modules compile and test with Gradle. We've also
> > identified some problematic areas (shading being the most obvious one)
> > which we will analyse as part of the PoC.
> > The goal is to see how much Gradle helps to parallelise our build, and to
> > avoid duplicate work (incremental builds).
> >
> > - I am working on setting up a Flink testing infrastructure based on
> Azure
> > Pipelines, using more powerful hardware. Alibaba kindly provided me with
> > two 32 core machines (temporarily), and another company reached out to
> > privately, looking into options for cheap, fast machines :)
> > If nobody in the community disagrees, I am going to set up Azure
> Pipelines
> > with our apache/flink GitHub as a build infrastructure that exists next
> to
> > Flinkbot and flink-ci. I would like to make sure that Azure Pipelines is
> > equally or even more reliable than Travis, and I want to see what the
> > required maintenance work is.
> > On top of that, Azure Pipelines is a very feature-rich tool with a lot of
> > nice options for us to improve the build experience (statistics about
> tests
> > (flaky tests etc.), nice docker support, plenty of free build resources
> for
> > open source projects, ...)
> >
> > Best,
> > Robert
> >
> >
> >
> >
> >
> > On Mon, Aug 19, 2019 at 5:12 PM Robert Metzger <rmetz...@apache.org>
> wrote:
> >
> >> Hi all,
> >>
> >> I have summarized all arguments mentioned so far + some additional
> >> research into a Wiki page here:
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=125309279
> >>
> >> I'm happy to hear further comments on my summary! I'm pretty sure we can
> >> find more pro's and con's for the different options.
> >>
> >> My opinion after looking at the options:
> >>
> >>     - Flink relies on an outdated build tool (Maven), while a good
> >>     alternative is well-established (gradle), and will likely provide a
> much
> >>     better CI and local build experience through incremental build and
> cached
> >>     intermediates.
> >>     Scripting around Maven, or splitting modules / test execution /
> >>     repositories won't solve this problem. We should rather spend the
> effort in
> >>     migrating to a modern build tool which will provide us benefits in
> the long
> >>     run.
> >>     - Flink relies on a fairly slow build service (Travis CI), while
> >>     simply putting more money onto the problem could cut the build time
> at
> >>     least in half.
> >>     We should consider using a build service that provides bigger
> machines
> >>     to solve our build time problem.
> >>
> >> My opinion is based on many assumptions (gradle is actually as fast as
> >> promised (haven't used it before), we can build Flink with gradle, we
> find
> >> sponsors for bigger build machines) that we need to test first through
> PoCs.
> >>
> >> Best,
> >> Robert
> >>
> >>
> >>
> >>
> >> On Mon, Aug 19, 2019 at 10:26 AM Aljoscha Krettek <aljos...@apache.org>
> >> wrote:
> >>
> >>> I did a quick test: a normal "mvn clean install -DskipTests
> >>> -Drat.skip=true -Dmaven.javadoc.skip=true -Punsafe-mapr-repo” on my
> machine
> >>> takes about 14 minutes. After removing all mentions of
> maven-shade-plugin
> >>> the build time goes down to roughly 11.5 minutes. (Obviously the
> resulting
> >>> Flink won’t work, because some expected stuff is not packaged and most
> of
> >>> the end-to-end tests use the shade plugin to package the jars for
> testing.
> >>>
> >>> Aljoscha
> >>>
> >>>> On 18. Aug 2019, at 19:52, Robert Metzger <rmetz...@apache.org>
> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I wanted to understand the impact of the hardware we are using for
> >>> running
> >>>> our tests. Each travis worker has 2 virtual cores, and 7.5 gb memory
> >>> [1].
> >>>> They are using Google Cloud Compute Engine *n1-standard-2* instances.
> >>>> Running a full "mvn clean verify" takes *03:32 h* on such a machine
> >>> type.
> >>>> Running the same workload on a 32 virtual cores, 64 gb machine, takes
> >>> *1:21
> >>>> h*.
> >>>>
> >>>> What is interesting are the per-module build time differences.
> >>>> Modules which are parallelizing tests well greatly benefit from the
> >>>> additional cores:
> >>>> "flink-tests" 36:51 min vs 4:33 min
> >>>> "flink-runtime" 23:41 min vs 3:47 min
> >>>> "flink-table-planner" 15:54 min vs 3:13 min
> >>>>
> >>>> On the other hand, we have modules which are not parallel at all:
> >>>> "flink-connector-kafka": 16:32 min vs 15:19 min
> >>>> "flink-connector-kafka-0.11": 9:52 min vs 7:46 min
> >>>> Also, the checkstyle plugin is not scaling at all.
> >>>>
> >>>> Chesnay reported some significant speedups by reusing forks.
> >>>> I don't know how much effort it would be to make the Kafka tests
> >>>> parallelizable. In total, they currently use 30 minutes on the big
> >>> machine
> >>>> (while 31 CPUs are idling :) )
> >>>>
> >>>> Let me know what you think about these results. If the community is
> >>>> generally interested in further investigating into that direction, I
> >>> could
> >>>> look into software to orchestrate this, as well as sponsors for such
> an
> >>>> infrastructure.
> >>>>
> >>>> [1] https://docs.travis-ci.com/user/reference/overview/
> >>>>
> >>>>
> >>>> On Fri, Aug 16, 2019 at 3:27 PM Chesnay Schepler <ches...@apache.org>
> >>> wrote:
> >>>>> @Aljoscha Shading takes a few minutes for a full build; you can see
> >>> this
> >>>>> quite easily by looking at the compile step in the misc profile
> >>>>> <https://api.travis-ci.org/v3/job/572560060/log.txt>; all modules
> that
> >>>>> longer than a fraction of a section are usually caused by shading
> lots
> >>>>> of classes. Note that I cannot tell you how much of this is spent on
> >>>>> relocations, and how much on writing the jar.
> >>>>>
> >>>>> Personally, I'd very much like us to move all shading to
> flink-shaded;
> >>>>> this would finally allows us to use newer maven versions without
> >>> needing
> >>>>> cumbersome workarounds for flink-dist. However, this isn't a trivial
> >>>>> affair in some cases; IIRC calcite could be difficult to handle.
> >>>>>
> >>>>> On another note, this would also simplify switching the main repo to
> >>>>> another build system, since you would no longer had to deal with
> >>>>> relocations, just packaging + merging NOTICE files.
> >>>>>
> >>>>> @BowenLi I disagree, flink-shaded does not include any tests,  API
> >>>>> compatibility checks, checkstyle, layered shading (e.g.,
> flink-runtime
> >>>>> and flink-dist, where both relocate dependencies and one is bundled
> by
> >>>>> the other), and, most importantly, CI (and really, without CI being
> >>>>> covered in a PoC there's nothing to discuss).
> >>>>>
> >>>>> On 16/08/2019 15:13, Aljoscha Krettek wrote:
> >>>>>> Speaking of flink-shaded, do we have any idea what the impact of
> >>> shading
> >>>>> is on the build time? We could get rid of shading completely in the
> >>> Flink
> >>>>> main repository by moving everything that we shade to flink-shaded.
> >>>>>> Aljoscha
> >>>>>>
> >>>>>>> On 16. Aug 2019, at 14:58, Bowen Li <bowenl...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> +1 to Till's points on #2 and #5, especially the potential
> >>>>> non-disruptive,
> >>>>>>> gradual migration approach if we decide to go that route.
> >>>>>>>
> >>>>>>> To add on, I want to point it out that we can actually start with
> >>>>>>> flink-shaded project [1] which is a perfect candidate for PoC. It's
> >>> of
> >>>>> much
> >>>>>>> smaller size, totally isolated from and not interfered with flink
> >>>>> project
> >>>>>>> [2], and it actually covers most of our practical feature
> >>> requirements
> >>>>> for
> >>>>>>> a build tool - all making it an ideal experimental field.
> >>>>>>>
> >>>>>>> [1] https://github.com/apache/flink-shaded
> >>>>>>> [2] https://github.com/apache/flink
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Aug 16, 2019 at 4:52 AM Till Rohrmann <
> trohrm...@apache.org>
> >>>>> wrote:
> >>>>>>>> For the sake of keeping the discussion focused and not cluttering
> >>> the
> >>>>>>>> discussion thread I would suggest to split the detailed reporting
> >>> for
> >>>>>>>> reusing JVMs to a separate thread and cross linking it from here.
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> Till
> >>>>>>>>
> >>>>>>>> On Fri, Aug 16, 2019 at 1:36 PM Chesnay Schepler <
> >>> ches...@apache.org>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Update:
> >>>>>>>>>
> >>>>>>>>> TL;DR: table-planner is a good candidate for enabling fork reuse
> >>> right
> >>>>>>>>> away, while flink-tests has the potential for huge savings, but
> we
> >>>>> have
> >>>>>>>>> to figure out some issues first.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Build link: https://travis-ci.org/zentol/flink/builds/572659220
> >>>>>>>>>
> >>>>>>>>> 4/8 profiles failed.
> >>>>>>>>>
> >>>>>>>>> No speedup in libraries, python, blink_planner, 7 minutes saved
> in
> >>>>>>>>> libraries (table-planner).
> >>>>>>>>>
> >>>>>>>>> The kafka and connectors profiles both fail in kafka tests due to
> >>>>>>>>> producer leaks, and no speed up could be confirmed so far:
> >>>>>>>>>
> >>>>>>>>> java.lang.AssertionError: Detected producer leak. Thread name:
> >>>>>>>>> kafka-producer-network-thread | producer-239
> >>>>>>>>>         at org.junit.Assert.fail(Assert.java:88)
> >>>>>>>>>         at
> >>>>>>>>>
> >>>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.checkProducerLeak(FlinkKafkaProducer011ITCase.java:677)
> >>>>>>>>>         at
> >>>>>>>>>
> >>>
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011ITCase.testFlinkKafkaProducer011FailBeforeNotify(FlinkKafkaProducer011ITCase.java:210)
> >>>>>>>>> The tests profile failed due to various errors in migration
> tests:
> >>>>>>>>>
> >>>>>>>>> junit.framework.AssertionFailedError: Did not see the expected
> >>>>>>>> accumulator
> >>>>>>>>> results within time limit.
> >>>>>>>>>         at
> >>>>>>>>>
> >>>
> org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSavepoint(TypeSerializerSnapshotMigrationITCase.java:141)
> >>>>>>>>> *However*, a normal tests run takes 40 minutes, while this one
> >>> above
> >>>>>>>>> failed after 19 minutes and is only missing the migration tests
> >>> (which
> >>>>>>>>> currently need 6-7 minutes). So we could save somewhere between
> 15
> >>> to
> >>>>> 20
> >>>>>>>>> minutes here.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Finally, the misc profiles fails in YARN:
> >>>>>>>>>
> >>>>>>>>> java.lang.AssertionError
> >>>>>>>>>         at
> >>> org.apache.flink.yarn.YARNITCase.setup(YARNITCase.java:64)
> >>>>>>>>> No significant speedup could be observed in other modules; for
> >>>>>>>>> flink-yarn-tests we can maybe get a minute or 2 out of it.
> >>>>>>>>>
> >>>>>>>>> On 16/08/2019 10:43, Chesnay Schepler wrote:
> >>>>>>>>>> There appears to be a general agreement that 1) should be looked
> >>>>> into;
> >>>>>>>>>> I've setup a branch with fork reuse being enabled for all tests;
> >>> will
> >>>>>>>>>> report back the results.
> >>>>>>>>>>
> >>>>>>>>>> On 15/08/2019 09:38, Chesnay Schepler wrote:
> >>>>>>>>>>> Hello everyone,
> >>>>>>>>>>>
> >>>>>>>>>>> improving our build times is a hot topic at the moment so let's
> >>>>>>>>>>> discuss the different ways how they could be reduced.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>        Current state:
> >>>>>>>>>>>
> >>>>>>>>>>> First up, let's look at some numbers:
> >>>>>>>>>>>
> >>>>>>>>>>> 1 full build currently consumes 5h of build time total ("total
> >>>>>>>>>>> time"), and in the ideal case takes about 1h20m ("run time") to
> >>>>>>>>>>> complete from start to finish. The run time may fluctuate of
> >>> course
> >>>>>>>>>>> depending on the current Travis load. This applies both to
> >>> builds on
> >>>>>>>>>>> the Apache and flink-ci Travis.
> >>>>>>>>>>>
> >>>>>>>>>>> At the time of writing, the current queue time for PR jobs
> >>>>> (reminder:
> >>>>>>>>>>> running on flink-ci) is about 30 minutes (which basically means
> >>> that
> >>>>>>>>>>> we are processing builds at the rate that they come in),
> however
> >>> we
> >>>>>>>>>>> are in an admittedly quiet period right now.
> >>>>>>>>>>> 2 weeks ago the queue times on flink-ci peaked at around 5-6h
> as
> >>>>>>>>>>> everyone was scrambling to get their changes merged in time for
> >>> the
> >>>>>>>>>>> feature freeze.
> >>>>>>>>>>>
> >>>>>>>>>>> (Note: Recently optimizations where added to ci-bot where
> pending
> >>>>>>>>>>> builds are canceled if a new commit was pushed to the PR or the
> >>> PR
> >>>>>>>>>>> was closed, which should prove especially useful during the
> rush
> >>>>>>>>>>> hours we see before feature-freezes.)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>        Past approaches
> >>>>>>>>>>>
> >>>>>>>>>>> Over the years we have done rather few things to improve this
> >>>>>>>>>>> situation (hence our current predicament).
> >>>>>>>>>>>
> >>>>>>>>>>> Beyond the sporadic speedup of some tests, the only notable
> >>>>> reduction
> >>>>>>>>>>> in total build times was the introduction of cron jobs, which
> >>>>>>>>>>> consolidated the per-commit matrix from 4 configurations
> >>> (different
> >>>>>>>>>>> scala/hadoop versions) to 1.
> >>>>>>>>>>>
> >>>>>>>>>>> The separation into multiple build profiles was only a
> >>> work-around
> >>>>>>>>>>> for the 50m limit on Travis. Running tests in parallel has the
> >>>>>>>>>>> obvious potential of reducing run time, but we're currently
> >>> hitting
> >>>>> a
> >>>>>>>>>>> hard limit since a few modules (flink-tests, flink-runtime,
> >>>>>>>>>>> flink-table-planner-blink) are so loaded with tests that they
> >>> nearly
> >>>>>>>>>>> consume an entire profile by themselves (and thus no further
> >>>>>>>>>>> splitting is possible).
> >>>>>>>>>>>
> >>>>>>>>>>> The rework that introduced stages, at the time of introduction,
> >>> did
> >>>>>>>>>>> also not provide a speed up, although this changed slightly
> once
> >>>>> more
> >>>>>>>>>>> profiles were added and some optimizations to the caching have
> >>> been
> >>>>>>>>>>> made.
> >>>>>>>>>>>
> >>>>>>>>>>> Very recently we modified the surefire-plugin configuration for
> >>>>>>>>>>> flink-table-planner-blink to reuse JVM forks for IT cases,
> >>> providing
> >>>>>>>>>>> a significant speedup (18 minutes!). So far we have not seen
> any
> >>>>>>>>>>> negative consequences.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>        Suggestions
> >>>>>>>>>>>
> >>>>>>>>>>> This is a list of /all /suggestions for reducing run/total
> times
> >>>>> that
> >>>>>>>>>>> I have seen recently (in other words, they aren't necessarily
> >>> mine
> >>>>>>>>>>> nor may I agree with all of them).
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Enable JVM reuse for IT cases in more modules.
> >>>>>>>>>>>      * We've seen significant speedups in the blink planner,
> and
> >>>>> this
> >>>>>>>>>>>        should be applicable for all modules. However, I presume
> >>>>>>>> there's
> >>>>>>>>>>>        a reason why we disabled JVM reuse (information on this
> >>> would
> >>>>>>>> be
> >>>>>>>>>>>        appreciated)
> >>>>>>>>>>> 2. Custom differential build scripts
> >>>>>>>>>>>      * Setup custom scripts for determining which modules
> might be
> >>>>>>>>>>>        affected by change, and manipulate the splits
> accordingly.
> >>>>> This
> >>>>>>>>>>>        approach is conceptually quite straight-forward, but has
> >>>>> limits
> >>>>>>>>>>>        since it has to be pessimistic; i.e. a change in
> flink-core
> >>>>>>>>>>>        _must_ result in testing all modules.
> >>>>>>>>>>> 3. Only run smoke tests when PR is opened, run heavy tests on
> >>>>> demand.
> >>>>>>>>>>>      * With the introduction of the ci-bot we now have
> >>> significantly
> >>>>>>>>>>>        more options on how to handle PR builds. One option
> could
> >>> be
> >>>>> to
> >>>>>>>>>>>        only run basic tests when the PR is created (which may
> be
> >>>>> only
> >>>>>>>>>>>        modified modules, or all unit tests, or another low-cost
> >>>>>>>>>>>        scheme), and then have a committer trigger other builds
> >>> (full
> >>>>>>>>>>>        test run, e2e tests, etc...) on demand.
> >>>>>>>>>>> 4. Move more tests into cron builds
> >>>>>>>>>>>      * The budget version of 3); move certain tests that are
> >>> either
> >>>>>>>>>>>        expensive (like some runtime tests that take minutes)
> or in
> >>>>>>>>>>>        rarely modified modules (like gelly) into cron jobs.
> >>>>>>>>>>> 5. Gradle
> >>>>>>>>>>>      * Gradle was brought up a few times for it's built-in
> support
> >>>>> for
> >>>>>>>>>>>        differential builds; basically providing 2) without the
> >>>>>>>> overhead
> >>>>>>>>>>>        of maintaining additional scripts.
> >>>>>>>>>>>      * To date no PoC was provided that shows it working in
> our CI
> >>>>>>>>>>>        environment (i.e., handling splits & caching etc).
> >>>>>>>>>>>      * This is the most disruptive change by a fair margin, as
> it
> >>>>>>>> would
> >>>>>>>>>>>        affect the entire project, developers and potentially
> users
> >>>>> (f
> >>>>>>>>>>>        they build from source).
> >>>>>>>>>>> 6. CI service
> >>>>>>>>>>>      * Our current artifact caching setup on Travis is
> basically a
> >>>>>>>>>>>        hack; we're basically abusing the Travis cache, which is
> >>>>> meant
> >>>>>>>>>>>        for long-term caching, to ship build artifacts across
> jobs.
> >>>>>>>> It's
> >>>>>>>>>>>        brittle at times due to timing/visibility issues and on
> >>>>>>>> branches
> >>>>>>>>>>>        the cleanup processes can interfere with running
> builds. It
> >>>>> is
> >>>>>>>>>>>        also not as effective as it could be.
> >>>>>>>>>>>      * There are CI services that provide build artifact
> caching
> >>> out
> >>>>>>>> of
> >>>>>>>>>>>        the box, which could be useful for us.
> >>>>>>>>>>>      * To date, no PoC for using another CI service has been
> >>>>> provided.
> >>>>>>>>>>>
> >>>>>
> >>>
>
>

Reply via email to