Great, http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10533/22/ seems to be passing without problems.
On Fri, May 4, 2018 at 6:07 AM, Jin, Hao <[email protected]> wrote: > The builds are running now, thanks! > > On 5/3/18, 8:16 PM, "Marco de Abreu" <[email protected]> > wrote: > > You're right, it seems like the Docker builds are hanging. I'm testing > the > new auto scaling feature on the test environment [1] and I noticed > that all > jobs hung at the exact same spot until 2:40AM German time. It seems > like > some APT servers were having problems and since apt does not have a > timeout > included, it hung the build instead of failing gracefully. It's > 05:13AM now > and it seems like my test builds recovered. I'll check the production > environment and see if it's working fine over there as well. I'll give > you > an update in here as soon a I know more details. > > -Marco > > [1]: > http://jenkins.mxnet-ci-dev.amazon-ml.com/job/incubator- > mxnet/job/ci-master/ > > On Fri, May 4, 2018 at 2:59 AM, Jin, Hao <[email protected]> wrote: > > > Thanks for fixing the servers! However I found that some of the > builds are > > taking extremely long time (not even starting after ~2 hrs): > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ > > incubator-mxnet/detail/PR-10645/18/pipeline/59 > > Seems like they are stuck during the setup phase? > > Hao > > > > On 5/3/18, 2:44 PM, "Marco de Abreu" <[email protected]> > > wrote: > > > > Alright, we're back up. > > > > On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu < > > [email protected]> wrote: > > > > > Seems like the CI will be down until some other people turn > off their > > > instances... > > > > > > Error > > > We currently do not have sufficient g3.8xlarge capacity in > zones with > > > support for 'gp2' volumes. Our system will be working on > provisioning > > > additional capacity. > > > > > > -Marco > > > > > > > > > On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <[email protected]> > wrote: > > > > > >> Thanks a lot Marco! > > >> Hao > > >> > > >> On 5/3/18, 12:02 PM, "Marco de Abreu" < > [email protected] > > > > > >> wrote: > > >> > > >> Hello, > > >> > > >> I'm already investigating the issue and it seems to be > related > > to the > > >> recently introduced KVStore tests. They tend to hang, > leading > > to job > > >> be > > >> forcefully terminated by Jenkins. The problem here is > that this > > does > > >> not > > >> terminate the underlying Docker containers, leading to > > unreleased > > >> resources. > > >> > > >> As an immediate solution, I will restart all slaves to > ensure > > the CI > > >> is > > >> running again. After that, I will try to find a solution > to > > detect and > > >> release these containers. > > >> > > >> Best regards, > > >> Marco > > >> > > >> On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <[email protected] > > > > wrote: > > >> > > >> > I’ve encountered 2 failed GPU builds due to > “initialization > > error: > > >> driver > > >> > error: failed to process request”, the links to the > failed > > builds > > >> are: > > >> > http://jenkins.mxnet-ci.amazon-ml.com/blue/ > > organizations/jenkins/ > > >> > incubator-mxnet/detail/PR-10645/17/pipeline/674 > > >> > http://jenkins.mxnet-ci.amazon-ml.com/blue/ > > organizations/jenkins/ > > >> > incubator-mxnet/detail/PR-10533/18/pipeline > > >> > > > >> > > > >> > > >> > > >> > > > > > > > > > > > >
