Correct. But I'm surprised about 2:50min to pull down the images. Maybe it makes sense to use ECR as mirror?
-Marco Joe Evans <joseph.ev...@gmail.com> schrieb am Do., 26. März 2020, 22:02: > +1 on rebuilding the containers regularly without caching layers. > > We are both pulling down a bunch of docker layers (when docker pulls an > image) and then building a new container to run the sanity build in. > Pulling down all the layers is what is taking so long (2m50s.) Within the > docker build, all the layers are cached, so it doesn't take long. Unless > I'm missing something, it doesn't make much sense to be rebuilding the > image every build. > > On Thu, Mar 26, 2020 at 1:12 PM Lausen, Leonard <lau...@amazon.com.invalid > > > wrote: > > > WRT Docker Cache: We need to add a mechanism to invalidate the cache and > > rebuild > > the containers on a set schedule. The builds break too often and the > > breakage is > > only detected when a contributor touches the Dockerfiles (manually > causing > > cache > > invalidation) > > > > On Thu, 2020-03-26 at 16:06 -0400, Aaron Markham wrote: > > > I think it is a good idea to do the sanity check first. Even at 10 > > minutes. > > > And also try to fix the docker cache situation, but those can be > separate > > > tasks. > > > > > > On Thu, Mar 26, 2020, 12:52 Marco de Abreu <marco.g.ab...@gmail.com> > > wrote: > > > > > > > Jenkins doesn't load for me, so let me ask this way: are we actually > > > > rebuilding every single time or do you mean the docker cache? Pulling > > the > > > > cache should only take a few seconds from my experience - docker > build > > > > should be a no-op in most cases. > > > > > > > > -Marco > > > > > > > > > > > > Joe Evans <joseph.ev...@gmail.com> schrieb am Do., 26. März 2020, > > 20:46: > > > > > > > > > The sanity-lint check pulls a docker image cache, builds a new > > container > > > > > and runs inside. The docker setup is taking around 3 minutes, at > > least: > > > > > > > > > > > > > > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fsanity/detail/master/1764/pipeline/39 > > > > > We could improve this by not having to build a new container every > > time. > > > > > Also, our CI containers are huge so it takes awhile to pull them > > down. > > > > I'm > > > > > sure we could reduce the size by being a bit more careful in > building > > > > them > > > > > too. > > > > > > > > > > Joe > > > > > > > > > > On Thu, Mar 26, 2020 at 12:33 PM Marco de Abreu < > > marco.g.ab...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > Do you know what's driving the duration for sanity? It used to be > > 50 > > > > sec > > > > > > execution and 60 sec preparation. > > > > > > > > > > > > -Marco > > > > > > > > > > > > Joe Evans <joseph.ev...@gmail.com> schrieb am Do., 26. März > 2020, > > > > 20:31: > > > > > > > Thanks Marco and Aaron for your input. > > > > > > > > > > > > > > > Can you show by how much the duration will increase? > > > > > > > > > > > > > > The average sanity build time is around 10min, while the > average > > > > build > > > > > > time > > > > > > > for unix-cpu is about 2 hours, so the entire build pipeline > would > > > > > > increase > > > > > > > by 2 hours if we required both unix-cpu and sanity to complete > in > > > > > > parallel. > > > > > > > I took a look at the CloudWatch metrics we're saving for > Jenkins > > > > jobs. > > > > > > Here > > > > > > > is the failure rate per job, based on builds triggered by PRs > in > > the > > > > > past > > > > > > > year. As you can see, the sanity build failure is still fairly > > high > > > > and > > > > > > > would save a lot of unneeded build jobs. > > > > > > > > > > > > > > Job Successful Failed Failure Rate > > > > > > > sanity 6900 2729 28.34% > > > > > > > unix-cpu 4268 4786 52.86% > > > > > > > unix-gpu 3686 5637 60.46% > > > > > > > centos-cpu 6777 2809 29.30% > > > > > > > centos-gpu 6318 3350 34.65% > > > > > > > clang 7879 1588 16.77% > > > > > > > edge 7654 1933 20.16% > > > > > > > miscellaneous 8090 1510 15.73% > > > > > > > website 7226 2179 23.17% > > > > > > > windows-cpu 6084 3621 37.31% > > > > > > > windows-gpu 5191 4721 47.63% > > > > > > > > > > > > > > We can start by requiring only the sanity job to complete > before > > > > > > triggering > > > > > > > the rest, and collect data to decide if it makes sense to > change > > it > > > > > from > > > > > > > there. Any objections to this approach? > > > > > > > > > > > > > > Thanks. > > > > > > > Joe > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 25, 2020 at 9:35 AM Marco de Abreu < > > > > > marco.g.ab...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Back then I have created a system which exports all Jenkins > > results > > > > > to > > > > > > > > cloud watch. It does not include individual test results but > > rather > > > > > > > stages > > > > > > > > and jobs. The data for the sanity check should be available > > there. > > > > > > > > > > > > > > > > Something I'd also be curious about is the percentage of the > > > > failures > > > > > > in > > > > > > > > one run. Speak, if a commit failed, have there been multiple > > jobs > > > > > > failing > > > > > > > > (indicating an error in the code) or only one or two > > (indicating > > > > > > > > flakyness). This should give us a proper understanding of how > > > > > > unnecessary > > > > > > > > these runs really are. > > > > > > > > > > > > > > > > -Marck > > > > > > > > > > > > > > > > Aaron Markham <aaron.s.mark...@gmail.com> schrieb am Mi., > 25. > > März > > > > > > 2020, > > > > > > > > 16:53: > > > > > > > > > > > > > > > > > +1 for sanity check - that's fast. > > > > > > > > > -1 for unix-cpu - that's slow and can just hang. > > > > > > > > > > > > > > > > > > So my suggestion would be to see the data apart - what's > the > > > > > failure > > > > > > > > > rate on the sanity check and the unix-cpu? Actually, can we > > get a > > > > > > > > > table of all of the tests with this data?! > > > > > > > > > If the sanity check fails... let's say 20% of the time, but > > only > > > > > > takes > > > > > > > > > a couple of minutes, then ya, let's stack it and do that > one > > > > first. > > > > > > > > > I think unix-cpu needs to be broken apart. It's too complex > > and > > > > > fails > > > > > > > > > in multiple ways. Isolate the brittle parts. Then we can > > > > > > > > > restart/disable those as needed, while all of the other > parts > > > > pass > > > > > > and > > > > > > > > > don't have to be rerun. > > > > > > > > > > > > > > > > > > On Wed, Mar 25, 2020 at 1:32 AM Marco de Abreu < > > > > > > > marco.g.ab...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > We had this structure in the past and the community was > > > > bothered > > > > > by > > > > > > > CI > > > > > > > > > > taking more time, thus we moved to the current model with > > > > > > everything > > > > > > > > > > parallelized. We'd basically revert that then. > > > > > > > > > > > > > > > > > > > > Can you show by how much the duration will increase? > > > > > > > > > > > > > > > > > > > > Also, we have zero test parallelisation, speak we are > > running > > > > one > > > > > > > test > > > > > > > > on > > > > > > > > > > 72 core machines (although multiple workers). Wouldn't it > > be > > > > way > > > > > > more > > > > > > > > > > efficient to add parallelisation and thus heavily reduce > > the > > > > time > > > > > > > spent > > > > > > > > > on > > > > > > > > > > the tasks instead of staggering? > > > > > > > > > > > > > > > > > > > > I feel concerned that these measures to save cost are > paid > > in > > > > the > > > > > > > form > > > > > > > > > of a > > > > > > > > > > worse user experience. I see a big potential to save > costs > > by > > > > > > > > increasing > > > > > > > > > > efficiency while actually improving the user experience > > due to > > > > CI > > > > > > > being > > > > > > > > > > faster. > > > > > > > > > > > > > > > > > > > > -Marco > > > > > > > > > > > > > > > > > > > > Joe Evans <joseph.ev...@gmail.com> schrieb am Mi., 25. > > März > > > > > 2020, > > > > > > > > 04:58: > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > First, I just wanted to introduce myself to the MXNet > > > > > community. > > > > > > > I’m > > > > > > > > > Joe > > > > > > > > > > > and will be working with Chai and the AWS team to > improve > > > > some > > > > > > > issues > > > > > > > > > > > around MXNet CI. One of our goals is to reduce the > costs > > > > > > associated > > > > > > > > > with > > > > > > > > > > > running MXNet CI. The task I’m working on now is this > > issue: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/17802 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Proposal: Staggered Jenkins CI pipeline > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Based on data collected from Jenkins, around 55% of the > > time > > > > > when > > > > > > > the > > > > > > > > > > > mxnet-validation CI build is triggered by a PR, either > > the > > > > > sanity > > > > > > > or > > > > > > > > > > > unix-cpu builds fail. When either of these builds fail, > > it > > > > > > doesn’t > > > > > > > > make > > > > > > > > > > > sense to run the rest of the pipelines and utilize all > > those > > > > > > > > resources > > > > > > > > > if > > > > > > > > > > > we’ve already identified a build or unit test failure. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > We are proposing changing the MXNet Jenkins CI pipeline > > by > > > > > > > requiring > > > > > > > > > the > > > > > > > > > > > *sanity* and *unix-cpu* builds to complete and pass > tests > > > > > > > > successfully > > > > > > > > > > > before starting the other build pipelines > > (centos-cpu/gpu, > > > > > > > unix-gpu, > > > > > > > > > > > windows-cpu/gpu, etc.) Once the sanity builds > > successfully > > > > > > > complete, > > > > > > > > > the > > > > > > > > > > > remaining build pipelines will be triggered and run in > > > > parallel > > > > > > (as > > > > > > > > > they > > > > > > > > > > > currently do.) The purpose of this change is to > identify > > > > faulty > > > > > > > code > > > > > > > > or > > > > > > > > > > > compatibility issues early and prevent further > execution > > of > > > > CI > > > > > > > > builds. > > > > > > > > > This > > > > > > > > > > > will increase the time required to test a PR, but will > > > > prevent > > > > > > > > > unnecessary > > > > > > > > > > > builds from running. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Does anyone have any concerns with this change or > > > > suggestions? > > > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > Joe Evans > > > > > > > > > > > > > > > > > > > > > > joseph.ev...@gmail.com > > > > > > > > > > > > > >