You're right, it seems like the Docker builds are hanging. I'm testing the
new auto scaling feature on the test environment [1] and I noticed that all
jobs hung at the exact same spot until 2:40AM German time. It seems like
some APT servers were having problems and since apt does not have a timeout
included, it hung the build instead of failing gracefully. It's 05:13AM now
and it seems like my test builds recovered. I'll check the production
environment and see if it's working fine over there as well. I'll give you
an update in here as soon a I know more details.

-Marco

[1]:
http://jenkins.mxnet-ci-dev.amazon-ml.com/job/incubator-mxnet/job/ci-master/

On Fri, May 4, 2018 at 2:59 AM, Jin, Hao <[email protected]> wrote:

> Thanks for fixing the servers! However I found that some of the builds are
> taking extremely long time (not even starting after ~2 hrs):
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
> incubator-mxnet/detail/PR-10645/18/pipeline/59
> Seems like they are stuck during the setup phase?
> Hao
>
> On 5/3/18, 2:44 PM, "Marco de Abreu" <[email protected]>
> wrote:
>
>     Alright, we're back up.
>
>     On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu <
>     [email protected]> wrote:
>
>     > Seems like the CI will be down until some other people turn off their
>     > instances...
>     >
>     > Error
>     > We currently do not have sufficient g3.8xlarge capacity in zones with
>     > support for 'gp2' volumes. Our system will be working on provisioning
>     > additional capacity.
>     >
>     > -Marco
>     >
>     >
>     > On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <[email protected]> wrote:
>     >
>     >> Thanks a lot Marco!
>     >> Hao
>     >>
>     >> On 5/3/18, 12:02 PM, "Marco de Abreu" <[email protected]
> >
>     >> wrote:
>     >>
>     >>     Hello,
>     >>
>     >>     I'm already investigating the issue and it seems to be related
> to the
>     >>     recently introduced KVStore tests. They tend to hang, leading
> to job
>     >> be
>     >>     forcefully terminated by Jenkins. The problem here is that this
> does
>     >> not
>     >>     terminate the underlying Docker containers, leading to
> unreleased
>     >> resources.
>     >>
>     >>     As an immediate solution, I will restart all slaves to ensure
> the CI
>     >> is
>     >>     running again. After that, I will try to find a solution to
> detect and
>     >>     release these containers.
>     >>
>     >>     Best regards,
>     >>     Marco
>     >>
>     >>     On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <[email protected]>
> wrote:
>     >>
>     >>     > I’ve encountered 2 failed GPU builds due to “initialization
> error:
>     >> driver
>     >>     > error: failed to process request”, the links to the failed
> builds
>     >> are:
>     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
>     >>     > incubator-mxnet/detail/PR-10645/17/pipeline/674
>     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
> organizations/jenkins/
>     >>     > incubator-mxnet/detail/PR-10533/18/pipeline
>     >>     >
>     >>     >
>     >>
>     >>
>     >>
>     >
>
>
>

Reply via email to