The builds are running now, thanks!

On 5/3/18, 8:16 PM, "Marco de Abreu" <[email protected]> wrote:

    You're right, it seems like the Docker builds are hanging. I'm testing the
    new auto scaling feature on the test environment [1] and I noticed that all
    jobs hung at the exact same spot until 2:40AM German time. It seems like
    some APT servers were having problems and since apt does not have a timeout
    included, it hung the build instead of failing gracefully. It's 05:13AM now
    and it seems like my test builds recovered. I'll check the production
    environment and see if it's working fine over there as well. I'll give you
    an update in here as soon a I know more details.
    
    -Marco
    
    [1]:
    http://jenkins.mxnet-ci-dev.amazon-ml.com/job/incubator-mxnet/job/ci-master/
    
    On Fri, May 4, 2018 at 2:59 AM, Jin, Hao <[email protected]> wrote:
    
    > Thanks for fixing the servers! However I found that some of the builds are
    > taking extremely long time (not even starting after ~2 hrs):
    > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/
    > incubator-mxnet/detail/PR-10645/18/pipeline/59
    > Seems like they are stuck during the setup phase?
    > Hao
    >
    > On 5/3/18, 2:44 PM, "Marco de Abreu" <[email protected]>
    > wrote:
    >
    >     Alright, we're back up.
    >
    >     On Thu, May 3, 2018 at 10:47 PM, Marco de Abreu <
    >     [email protected]> wrote:
    >
    >     > Seems like the CI will be down until some other people turn off 
their
    >     > instances...
    >     >
    >     > Error
    >     > We currently do not have sufficient g3.8xlarge capacity in zones 
with
    >     > support for 'gp2' volumes. Our system will be working on 
provisioning
    >     > additional capacity.
    >     >
    >     > -Marco
    >     >
    >     >
    >     > On Thu, May 3, 2018 at 9:40 PM, Jin, Hao <[email protected]> wrote:
    >     >
    >     >> Thanks a lot Marco!
    >     >> Hao
    >     >>
    >     >> On 5/3/18, 12:02 PM, "Marco de Abreu" <[email protected]
    > >
    >     >> wrote:
    >     >>
    >     >>     Hello,
    >     >>
    >     >>     I'm already investigating the issue and it seems to be related
    > to the
    >     >>     recently introduced KVStore tests. They tend to hang, leading
    > to job
    >     >> be
    >     >>     forcefully terminated by Jenkins. The problem here is that this
    > does
    >     >> not
    >     >>     terminate the underlying Docker containers, leading to
    > unreleased
    >     >> resources.
    >     >>
    >     >>     As an immediate solution, I will restart all slaves to ensure
    > the CI
    >     >> is
    >     >>     running again. After that, I will try to find a solution to
    > detect and
    >     >>     release these containers.
    >     >>
    >     >>     Best regards,
    >     >>     Marco
    >     >>
    >     >>     On Thu, May 3, 2018 at 8:55 PM, Jin, Hao <[email protected]>
    > wrote:
    >     >>
    >     >>     > I’ve encountered 2 failed GPU builds due to “initialization
    > error:
    >     >> driver
    >     >>     > error: failed to process request”, the links to the failed
    > builds
    >     >> are:
    >     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
    > organizations/jenkins/
    >     >>     > incubator-mxnet/detail/PR-10645/17/pipeline/674
    >     >>     > http://jenkins.mxnet-ci.amazon-ml.com/blue/
    > organizations/jenkins/
    >     >>     > incubator-mxnet/detail/PR-10533/18/pipeline
    >     >>     >
    >     >>     >
    >     >>
    >     >>
    >     >>
    >     >
    >
    >
    >
    

Reply via email to