Re: CI Update

2019-12-06 Thread Pedro Larroy
Hi all. CI is back to normal after Jake's commit:
https://github.com/apache/incubator-mxnet/pull/16968 please merge from
master.  If someone could look into the TVM building issues  described
above would be great.

On Tue, Dec 3, 2019 at 11:11 AM Pedro Larroy 
wrote:

> Some PRs were experiencing build timeouts in the past. I have diagnosed
> this to be a saturation of the EFS volume holding the compilation cache.
> Once CI is back online this problem is very likely to be solved and you
> should not see any more build timeout issues.
>
> On Tue, Dec 3, 2019 at 10:18 AM Pedro Larroy 
> wrote:
>
>> Also please take note that there's a stage building TVM which is
>> executing compilation serially and takes a lot of time which impacts CI
>> turnaround time:
>>
>> https://github.com/apache/incubator-mxnet/issues/16962
>>
>> Pedro
>>
>> On Tue, Dec 3, 2019 at 9:49 AM Pedro Larroy 
>> wrote:
>>
>>> Hi MXNet community. We are in the process of updating the base AMIs for
>>> CI with an updated CUDA driver to fix the CI blockage.
>>>
>>> We would need help from the community to diagnose some of the build
>>> errors which don't seem related to the infrastructure.
>>>
>>> I have observed this build failure with tvm when not installing the cuda
>>> driver in the container:
>>>
>>>
>>> https://pastebin.com/bQA0W2U4
>>>
>>> centos gpu builds and tests seem to run with the updated AMI and changes
>>> to the container.
>>>
>>>
>>> Thanks.
>>>
>>>
>>> On Mon, Dec 2, 2019 at 12:11 PM Pedro Larroy <
>>> pedro.larroy.li...@gmail.com> wrote:
>>>
 Small update about CI, which is blocked.

 Seems there's a nvidia driver compatibility problem in the base AMI
 that is running in GPU instances and the nvidia docker images that we use
 for building and testing.

 We are working on providing a fix by updating the base images as
 doesn't seem to be easy to fix by just changing the container.

 Thanks.

 Pedro.

>>>


Re: CI Update

2019-12-03 Thread Pedro Larroy
Some PRs were experiencing build timeouts in the past. I have diagnosed
this to be a saturation of the EFS volume holding the compilation cache.
Once CI is back online this problem is very likely to be solved and you
should not see any more build timeout issues.

On Tue, Dec 3, 2019 at 10:18 AM Pedro Larroy 
wrote:

> Also please take note that there's a stage building TVM which is executing
> compilation serially and takes a lot of time which impacts CI turnaround
> time:
>
> https://github.com/apache/incubator-mxnet/issues/16962
>
> Pedro
>
> On Tue, Dec 3, 2019 at 9:49 AM Pedro Larroy 
> wrote:
>
>> Hi MXNet community. We are in the process of updating the base AMIs for
>> CI with an updated CUDA driver to fix the CI blockage.
>>
>> We would need help from the community to diagnose some of the build
>> errors which don't seem related to the infrastructure.
>>
>> I have observed this build failure with tvm when not installing the cuda
>> driver in the container:
>>
>>
>> https://pastebin.com/bQA0W2U4
>>
>> centos gpu builds and tests seem to run with the updated AMI and changes
>> to the container.
>>
>>
>> Thanks.
>>
>>
>> On Mon, Dec 2, 2019 at 12:11 PM Pedro Larroy <
>> pedro.larroy.li...@gmail.com> wrote:
>>
>>> Small update about CI, which is blocked.
>>>
>>> Seems there's a nvidia driver compatibility problem in the base AMI that
>>> is running in GPU instances and the nvidia docker images that we use for
>>> building and testing.
>>>
>>> We are working on providing a fix by updating the base images as doesn't
>>> seem to be easy to fix by just changing the container.
>>>
>>> Thanks.
>>>
>>> Pedro.
>>>
>>


Re: CI Update

2019-12-03 Thread Pedro Larroy
Also please take note that there's a stage building TVM which is executing
compilation serially and takes a lot of time which impacts CI turnaround
time:

https://github.com/apache/incubator-mxnet/issues/16962

Pedro

On Tue, Dec 3, 2019 at 9:49 AM Pedro Larroy 
wrote:

> Hi MXNet community. We are in the process of updating the base AMIs for CI
> with an updated CUDA driver to fix the CI blockage.
>
> We would need help from the community to diagnose some of the build errors
> which don't seem related to the infrastructure.
>
> I have observed this build failure with tvm when not installing the cuda
> driver in the container:
>
>
> https://pastebin.com/bQA0W2U4
>
> centos gpu builds and tests seem to run with the updated AMI and changes
> to the container.
>
>
> Thanks.
>
>
> On Mon, Dec 2, 2019 at 12:11 PM Pedro Larroy 
> wrote:
>
>> Small update about CI, which is blocked.
>>
>> Seems there's a nvidia driver compatibility problem in the base AMI that
>> is running in GPU instances and the nvidia docker images that we use for
>> building and testing.
>>
>> We are working on providing a fix by updating the base images as doesn't
>> seem to be easy to fix by just changing the container.
>>
>> Thanks.
>>
>> Pedro.
>>
>


Re: CI Update

2019-12-03 Thread Pedro Larroy
Hi MXNet community. We are in the process of updating the base AMIs for CI
with an updated CUDA driver to fix the CI blockage.

We would need help from the community to diagnose some of the build errors
which don't seem related to the infrastructure.

I have observed this build failure with tvm when not installing the cuda
driver in the container:


https://pastebin.com/bQA0W2U4

centos gpu builds and tests seem to run with the updated AMI and changes to
the container.


Thanks.


On Mon, Dec 2, 2019 at 12:11 PM Pedro Larroy 
wrote:

> Small update about CI, which is blocked.
>
> Seems there's a nvidia driver compatibility problem in the base AMI that
> is running in GPU instances and the nvidia docker images that we use for
> building and testing.
>
> We are working on providing a fix by updating the base images as doesn't
> seem to be easy to fix by just changing the container.
>
> Thanks.
>
> Pedro.
>


CI Update

2019-12-02 Thread Pedro Larroy
Small update about CI, which is blocked.

Seems there's a nvidia driver compatibility problem in the base AMI that is
running in GPU instances and the nvidia docker images that we use for
building and testing.

We are working on providing a fix by updating the base images as doesn't
seem to be easy to fix by just changing the container.

Thanks.

Pedro.