Alright, we're back up. On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu < [email protected]> wrote:
> Seems like the CI will be down until some other people turn off their > instances... > > Error > We currently do not have sufficient g3.8xlarge capacity in zones with > support for 'gp2' volumes. Our system will be working on provisioning > additional capacity. > > -Marco > > > On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <[email protected]> wrote: > >> Thanks a lot Marco! >> Hao >> >> On 5/3/18, 12:02 PM, "Marco de Abreu" <[email protected]> >> wrote: >> >> Hello, >> >> I'm already investigating the issue and it seems to be related to the >> recently introduced KVStore tests. They tend to hang, leading to job >> be >> forcefully terminated by Jenkins. The problem here is that this does >> not >> terminate the underlying Docker containers, leading to unreleased >> resources. >> >> As an immediate solution, I will restart all slaves to ensure the CI >> is >> running again. After that, I will try to find a solution to detect and >> release these containers. >> >> Best regards, >> Marco >> >> On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <[email protected]> wrote: >> >> > I’ve encountered 2 failed GPU builds due to “initialization error: >> driver >> > error: failed to process request”, the links to the failed builds >> are: >> > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ >> > incubator-mxnet/detail/PR-10645/17/pipeline/674 >> > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/ >> > incubator-mxnet/detail/PR-10533/18/pipeline >> > >> > >> >> >> >
