Sorry for the inconvenience. If there are any further issues, please let me know.
Best regards, Marco On Fri, May 4, 2018 at 6:21 AM, Marco de Abreu <[email protected] > wrote: > Great, http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ > incubator-mxnet/detail/PR-10533/22/ seems to be passing without problems. > > On Fri, May 4, 2018 at 6:07 AM, Jin, Hao <[email protected]> wrote: > >> The builds are running now, thanks! >> >> On 5/3/18, 8:16 PM, "Marco de Abreu" <[email protected]> >> wrote: >> >> You're right, it seems like the Docker builds are hanging. I'm >> testing the >> new auto scaling feature on the test environment [1] and I noticed >> that all >> jobs hung at the exact same spot until 2:40AM German time. It seems >> like >> some APT servers were having problems and since apt does not have a >> timeout >> included, it hung the build instead of failing gracefully. It's >> 05:13AM now >> and it seems like my test builds recovered. I'll check the production >> environment and see if it's working fine over there as well. I'll >> give you >> an update in here as soon a I know more details. >> >> -Marco >> >> [1]: >> http://jenkins.mxnet-ci-dev.amazon-ml.com/job/incubator-mxne >> t/job/ci-master/ >> >> On Fri, May 4, 2018 at 2:59 AM, Jin, Hao <[email protected]> wrote: >> >> > Thanks for fixing the servers! However I found that some of the >> builds are >> > taking extremely long time (not even starting after ~2 hrs): >> > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ >> > incubator-mxnet/detail/PR-10645/18/pipeline/59 >> > Seems like they are stuck during the setup phase? >> > Hao >> > >> > On 5/3/18, 2:44 PM, "Marco de Abreu" <[email protected]> >> > wrote: >> > >> > Alright, we're back up. >> > >> > On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu < >> > [email protected]> wrote: >> > >> > > Seems like the CI will be down until some other people turn >> off their >> > > instances... >> > > >> > > Error >> > > We currently do not have sufficient g3.8xlarge capacity in >> zones with >> > > support for 'gp2' volumes. Our system will be working on >> provisioning >> > > additional capacity. >> > > >> > > -Marco >> > > >> > > >> > > On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <[email protected]> >> wrote: >> > > >> > >> Thanks a lot Marco! >> > >> Hao >> > >> >> > >> On 5/3/18, 12:02 PM, "Marco de Abreu" < >> [email protected] >> > > >> > >> wrote: >> > >> >> > >> Hello, >> > >> >> > >> I'm already investigating the issue and it seems to be >> related >> > to the >> > >> recently introduced KVStore tests. They tend to hang, >> leading >> > to job >> > >> be >> > >> forcefully terminated by Jenkins. The problem here is >> that this >> > does >> > >> not >> > >> terminate the underlying Docker containers, leading to >> > unreleased >> > >> resources. >> > >> >> > >> As an immediate solution, I will restart all slaves to >> ensure >> > the CI >> > >> is >> > >> running again. After that, I will try to find a solution >> to >> > detect and >> > >> release these containers. >> > >> >> > >> Best regards, >> > >> Marco >> > >> >> > >> On Thu, May 3, 2018 at 8:55 PM, Jin, Hao < >> [email protected]> >> > wrote: >> > >> >> > >> > I’ve encountered 2 failed GPU builds due to >> “initialization >> > error: >> > >> driver >> > >> > error: failed to process request”, the links to the >> failed >> > builds >> > >> are: >> > >> > http://jenkins.mxnet-ci.amazon-ml.com/blue/ >> > organizations/jenkins/ >> > >> > incubator-mxnet/detail/PR-10645/17/pipeline/674 >> > >> > http://jenkins.mxnet-ci.amazon-ml.com/blue/ >> > organizations/jenkins/ >> > >> > incubator-mxnet/detail/PR-10533/18/pipeline >> > >> > >> > >> > >> > >> >> > >> >> > >> >> > > >> > >> > >> > >> >> >> >
