[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-656293010 Interestingly this test is failing before we see the "hanging": ``` [2020-07-09T18:17:49.110Z] [gw1] [ 88%] FAILED tests/python/unittest/test_profiler.py::test_profiler ``` Oh right.. @leezu and I just see your comment above. That's actually happening before test_profiler is failing. I'm looking at the newer test run and notice the following: The timeout you describe above (from the previous test run) is happening around 3%. Looking at the newer test run, I'm also seeing a timeout, however this time its happening at around 51%. However, I think it's still related to the dataloader. In the newer test run we see the same Timeout, immediately followed by the log line "PASSED tests/python/unittest/test_gluon_data.py::test_dataloader_context" (so somehow the dataloader test is still passing!?). ``` [2020-07-02T19:44:30.029Z] [gw1] [ 50%] PASSED tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape15-mergesort] [2020-07-02T19:44:30.286Z] tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape15-heapsort] [2020-07-02T19:44:30.286Z] [gw1] [ 50%] PASSED tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape15-heapsort] [2020-07-02T19:44:30.543Z] tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape16-quicksort] [2020-07-02T19:44:30.798Z] [gw1] [ 51%] PASSED tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape16-quicksort] [2020-07-02T19:44:30.798Z] Timeout (0:20:00)! [2020-07-02T19:44:30.798Z] Thread 0x7fd3c0475700 (most recent call first): ... ... ... [2020-07-02T19:44:39.208Z] [gw0] [ 51%] PASSED tests/python/unittest/test_gluon_data.py::test_dataloader_context ``` We see that the dataloader test was started much earlier (at around 2%): ``` [2020-07-02T19:24:39.154Z] [gw0] [ 2%] PASSED tests/python/unittest/test_gluon_data.py::test_multi_worker [2020-07-02T19:24:39.154Z] tests/python/unittest/test_gluon_data.py::test_dataloader_context ``` Maybe I can isolate this timeout somehow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-656293010 Interestingly this test is failing before we see the "hanging": ``` [2020-07-09T18:17:49.110Z] [gw1] [ 88%] FAILED tests/python/unittest/test_profiler.py::test_profiler ``` Oh right.. @leezu and I just see your comment above. That's actually happening before test_profiler is failing. I'm looking at the newer test run and notice the following: The timeout you describe above (from the previous test run) is happening around 3%. Looking at the newer test run, I'm also seeing a timeout, however this time its happening at around 51%. However, I think it's still related to the dataloader. In the newer test run we see the same Timeout, immediately followed by the log line "PASSED tests/python/unittest/test_gluon_data.py::test_dataloader_context" (so somehow the dataloader test is still passing!?). ``` [2020-07-02T19:44:30.029Z] [gw1] [ 50%] PASSED tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape15-mergesort] [2020-07-02T19:44:30.286Z] tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape15-heapsort] [2020-07-02T19:44:30.286Z] [gw1] [ 50%] PASSED tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape15-heapsort] [2020-07-02T19:44:30.543Z] tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape16-quicksort] [2020-07-02T19:44:30.798Z] [gw1] [ 51%] PASSED tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape16-quicksort] [2020-07-02T19:44:30.798Z] Timeout (0:20:00)! [2020-07-02T19:44:30.798Z] Thread 0x7fd3c0475700 (most recent call first): ... [2020-07-02T19:44:39.208Z] [gw0] [ 51%] PASSED tests/python/unittest/test_gluon_data.py::test_dataloader_context ``` We see that the dataloader test was started much earlier (at around 2%): ``` [2020-07-02T19:24:39.154Z] [gw0] [ 2%] PASSED tests/python/unittest/test_gluon_data.py::test_multi_worker [2020-07-02T19:24:39.154Z] tests/python/unittest/test_gluon_data.py::test_dataloader_context ``` Maybe I can isolate this timeout somehow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-656293010 Interestingly this test is failing before we see the "hanging": ``` [2020-07-09T18:17:49.110Z] [gw1] [ 88%] FAILED tests/python/unittest/test_profiler.py::test_profiler ``` Oh right.. @leezu and I just see your comment above. That's actually happening before test_profiler is failing. I'm looking at the newer test run and notice the following: The timeout you describe above (from the previous test run) is happening around 3%. Looking at the newer test run, I'm also seeing a timeout, however this time its happening at around 51%. However, I think it's still related to the dataloader. In the newer test run we see the same Timeout, immediately followed by the log line "PASSED tests/python/unittest/test_gluon_data.py::test_dataloader_context" (so somehow the dataloader test is still passing). ``` [2020-07-02T19:44:30.029Z] [gw1] [ 50%] PASSED tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape15-mergesort] [2020-07-02T19:44:30.286Z] tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape15-heapsort] [2020-07-02T19:44:30.286Z] [gw1] [ 50%] PASSED tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape15-heapsort] [2020-07-02T19:44:30.543Z] tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape16-quicksort] [2020-07-02T19:44:30.798Z] [gw1] [ 51%] PASSED tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape16-quicksort] [2020-07-02T19:44:30.798Z] Timeout (0:20:00)! [2020-07-02T19:44:30.798Z] Thread 0x7fd3c0475700 (most recent call first): ... [2020-07-02T19:44:39.208Z] [gw0] [ 51%] PASSED tests/python/unittest/test_gluon_data.py::test_dataloader_context ``` We see that the dataloader test was started much earlier (at around 2%): ``` [2020-07-02T19:24:39.154Z] [gw0] [ 2%] PASSED tests/python/unittest/test_gluon_data.py::test_multi_worker [2020-07-02T19:24:39.154Z] tests/python/unittest/test_gluon_data.py::test_dataloader_context ``` Maybe I can isolate this timeout somehow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-656293010 Interestingly this test is failing before we see the "hanging": ``` [2020-07-09T18:17:49.110Z] [gw1] [ 88%] FAILED tests/python/unittest/test_profiler.py::test_profiler ``` Oh right.. @leezu and I just see your comment above. That's actually happening before test_profiler is failing. I'm looking at the newer test run and notice the following: The timeout you describe above (from the previous test run) is happening around 3%. Looking at the newer test run, I'm also seeing a timeout, however this time its happening at around 51%. However, I think it's still related to the dataloader. In the newer test run we see the same Timeout followed by the log line "PASSED tests/python/unittest/test_gluon_data.py::test_dataloader_context" (so somehow the dataloader test is still passing). ``` [2020-07-02T19:44:30.029Z] [gw1] [ 50%] PASSED tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape15-mergesort] [2020-07-02T19:44:30.286Z] tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape15-heapsort] [2020-07-02T19:44:30.286Z] [gw1] [ 50%] PASSED tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape15-heapsort] [2020-07-02T19:44:30.543Z] tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape16-quicksort] [2020-07-02T19:44:30.798Z] [gw1] [ 51%] PASSED tests/python/unittest/test_numpy_op.py::test_np_sort[True-int32-shape16-quicksort] [2020-07-02T19:44:30.798Z] Timeout (0:20:00)! [2020-07-02T19:44:30.798Z] Thread 0x7fd3c0475700 (most recent call first): ... [2020-07-02T19:44:39.208Z] [gw0] [ 51%] PASSED tests/python/unittest/test_gluon_data.py::test_dataloader_context ``` We see that the dataloader test was started much earlier (at around 2%): ``` [2020-07-02T19:24:39.154Z] [gw0] [ 2%] PASSED tests/python/unittest/test_gluon_data.py::test_multi_worker [2020-07-02T19:24:39.154Z] tests/python/unittest/test_gluon_data.py::test_dataloader_context ``` Maybe I can isolate this timeout somehow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-656293010 Interestingly this test is failing before we see the "hanging": ``` [2020-07-09T18:17:49.110Z] [gw1] [ 88%] FAILED tests/python/unittest/test_profiler.py::test_profiler ``` Oh right.. and I just see your comment above. That's actually happening before test_profiler is failing. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-656232421 Seems like the unit tests in the unix-cpu job are failing at this point ``` [2020-07-02T19:59:32.830Z] tests/python/unittest/test_profiler.py::test_gpu_memory_profiler_gluon [2020-07-02T19:59:32.830Z] [gw0] [ 89%] SKIPPED tests/python/unittest/test_profiler.py::test_gpu_memory_profiler_gluon [2020-07-02T19:59:32.830Z] tests/python/unittest/test_recordio.py::test_recordio [2020-07-02T22:59:39.221Z] Sending interrupt signal to process [2020-07-02T22:59:44.185Z] 2020-07-02 22:59:39,244 - root - WARN ``` Trying to reproduce it locally. Note: comparing this with the same job running on python 3.6, I see that the same test is running much later (at 99% completion compared to 89% in this job): ``` [2020-07-08T22:31:40.815Z] tests/python/unittest/test_profiler.py::test_gpu_memory_profiler_gluon [2020-07-08T22:31:40.815Z] [gw3] [ 99%] SKIPPED tests/python/unittest/test_profiler.py::test_gpu_memory_profiler_gluon [2020-07-08T22:31:40.815Z] tests/python/unittest/test_recordio.py::test_recordio [2020-07-08T22:31:40.815Z] [gw3] [ 99%] PASSED tests/python/unittest/test_recordio.py::test_recordio ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-653149747 Thanks @leezu, I think I found the underlying cause of the test failure in unittest/onnx/test_node.py::TestNode::test_import_export. In onnx 1.7, the input of the Pad operator has changed. We can see this by comparing https://github.com/onnx/onnx/blob/master/docs/Operators.md#Pad to https://github.com/onnx/onnx/blob/master/docs/Changelog.md#Pad-1. I believe I can fix this test and I'm working on that now. However the same test will not pass with onnx 1.5 anymore after that (but at least we know how to fix it, I guess). I assume the stacktrace you posted above from the unrelated cd job probably has some similar cause. Besides that I'm trying to make pylint happy.. but that's the smaller issue I think. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-653149747 Thanks @leezu, I think I found the underlying cause of the test failure in unittest/onnx/test_node.py::TestNode::test_import_export. In onnx 1.7, the input of the Pad operator has changed. We can see this by comparing https://github.com/onnx/onnx/blob/master/docs/Operators.md#Pad to https://github.com/onnx/onnx/blob/master/docs/Changelog.md#Pad-1. I believe I can fix this test and I'm working on that now. However the same test will not pass with onnx 1.5 anymore after that (but at least we know how to fix it, I guess). I assume the stacktrace you posted above from the unrelated cd job probably has some similar root cause. Besides that I'm trying to make pylint happy.. but that's the smaller issue I think. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-649099499 @Roshrini I found an issue when updating onnx from 1.5.0 to 1.7.0. The issue can be reproduced with python 3.6. The following code reproduces the issue. Do you have any idea what's going on? ``` import numpy as np from onnx import TensorProto from onnx import helper from onnx import mapping from mxnet.contrib.onnx.onnx2mx.import_onnx import GraphProto from mxnet.contrib.onnx.mx2onnx.export_onnx import MXNetGraph import mxnet as mx inputshape = (2, 3, 20, 20) input_tensor = [helper.make_tensor_value_info("input1", TensorProto.FLOAT, shape = inputshape)] outputshape = (2, 3, 17, 16) output_tensor = [helper.make_tensor_value_info("output", TensorProto.FLOAT, shape=outputshape)] onnx_attrs = {'kernel_shape': (4, 5), 'pads': (0, 0), 'strides': (1, 1), 'p': 1} nodes = [helper.make_node("LpPool", ["input1"], ["output"], **onnx_attrs)] graph = helper.make_graph(nodes, "test_lppool1", input_tensor, output_tensor) onnxmodel = helper.make_model(graph) graph = GraphProto() ctx = mx.cpu() sym, arg_params, aux_params = graph.from_onnx(onnxmodel.graph) metadata = graph.get_graph_metadata(onnxmodel.graph) input_data = metadata['input_tensor_data'] input_shape = [data[1] for data in input_data] """ Import ONNX model to mxnet model and then export to ONNX model and then import it back to mxnet for verifying the result""" params = {} params.update(arg_params) params.update(aux_params) converter = MXNetGraph() graph_proto = converter.create_onnx_graph_proto(sym, params, in_shape=input_shape, in_type=mapping.NP_TYPE_TO_TENSOR_TYPE[np.dtype('float32')]) ``` The line that is throwing the error is: ``` graph_proto = converter.create_onnx_graph_proto(sym, params, in_shape=input_shape, in_type=mapping.NP_TYPE_TO_TENSOR_TYPE[np.dtype('float32')]) ``` The error I'm seeing is: ``` File "/opt/anaconda3/envs/p36/lib/python3.6/site-packages/onnx/checker.py", line 54, in checker proto.SerializeToString(), ctx) onnx.onnx_cpp2py_export.checker.ValidationError: Node (pad0) has input size 1 not in range [min=2, max=3]. ==> Context: Bad node spec: input: "input1" output: "pad0" name: "pad0" op_type: "Pad" attribute { name: "mode" s: "constant" type: STRING } attribute { name: "pads" ints: 0 ints: 0 ints: 0 ints: 0 ints: 0 ints: 0 ints: 0 ints: 0 type: INTS } attribute { name: "value" f: 0 type: FLOAT } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-649099499 @Roshrini I found an issue when updating onnx from 1.5.0 to 1.7.0. The issue can be reproduced with python 3.6. The following core reproduces the issue. Do you have any idea what's going on? ``` import numpy as np from onnx import TensorProto from onnx import helper from onnx import mapping from mxnet.contrib.onnx.onnx2mx.import_onnx import GraphProto from mxnet.contrib.onnx.mx2onnx.export_onnx import MXNetGraph import mxnet as mx inputshape = (2, 3, 20, 20) input_tensor = [helper.make_tensor_value_info("input1", TensorProto.FLOAT, shape = inputshape)] outputshape = (2, 3, 17, 16) output_tensor = [helper.make_tensor_value_info("output", TensorProto.FLOAT, shape=outputshape)] onnx_attrs = {'kernel_shape': (4, 5), 'pads': (0, 0), 'strides': (1, 1), 'p': 1} nodes = [helper.make_node("LpPool", ["input1"], ["output"], **onnx_attrs)] graph = helper.make_graph(nodes, "test_lppool1", input_tensor, output_tensor) onnxmodel = helper.make_model(graph) graph = GraphProto() ctx = mx.cpu() sym, arg_params, aux_params = graph.from_onnx(onnxmodel.graph) metadata = graph.get_graph_metadata(onnxmodel.graph) input_data = metadata['input_tensor_data'] input_shape = [data[1] for data in input_data] """ Import ONNX model to mxnet model and then export to ONNX model and then import it back to mxnet for verifying the result""" params = {} params.update(arg_params) params.update(aux_params) converter = MXNetGraph() graph_proto = converter.create_onnx_graph_proto(sym, params, in_shape=input_shape, in_type=mapping.NP_TYPE_TO_TENSOR_TYPE[np.dtype('float32')]) ``` The line that is throwing the error is: ``` graph_proto = converter.create_onnx_graph_proto(sym, params, in_shape=input_shape, in_type=mapping.NP_TYPE_TO_TENSOR_TYPE[np.dtype('float32')]) ``` The error I'm seeing is: ``` File "/opt/anaconda3/envs/p36/lib/python3.6/site-packages/onnx/checker.py", line 54, in checker proto.SerializeToString(), ctx) onnx.onnx_cpp2py_export.checker.ValidationError: Node (pad0) has input size 1 not in range [min=2, max=3]. ==> Context: Bad node spec: input: "input1" output: "pad0" name: "pad0" op_type: "Pad" attribute { name: "mode" s: "constant" type: STRING } attribute { name: "pads" ints: 0 ints: 0 ints: 0 ints: 0 ints: 0 ints: 0 ints: 0 ints: 0 type: INTS } attribute { name: "value" f: 0 type: FLOAT } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-643807772 @leezu I think I got to a state where I can run the unittests with python3.8 and reproduce what is described in the issue ticket https://github.com/apache/incubator-mxnet/issues/18380 I ignored the lint errors for now by adding disable flags to the pylintrc file. As described in https://github.com/apache/incubator-mxnet/issues/18380 we're seeing an issue related to the usage of time.clock(). Besides that, I found the following issue: The test tests/python/unittest/onnx/test_node.py::TestNode::test_import_export seems to fail. In the Jenkins job I don't see the error, but when running the test locally with python3.8 and onnx 1.7, I'm getting: ``` > bkd_rep = backend.prepare(onnxmodel, operation='export', backend='mxnet') tests/python/unittest/onnx/test_node.py:164: tests/python/unittest/onnx/backend.py:104: in prepare sym, arg_params, aux_params = MXNetBackend.perform_import_export(sym, arg_params, aux_params, tests/python/unittest/onnx/backend.py:62: in perform_import_export graph_proto = converter.create_onnx_graph_proto(sym, params, in_shape=input_shape, python/mxnet/contrib/onnx/mx2onnx/export_onnx.py:308: in create_onnx_graph_proto ... E onnx.onnx_cpp2py_export.checker.ValidationError: Node (pad0) has input size 1 not in range [min=2, max=3]. E E ==> Context: Bad node spec: input: "input1" output: "pad0" name: "pad0" op_type: "Pad" attribute { name: "mode" s: "constant" type: STRING } attribute { name: "pads" ints: 0 ints: 0 ints: 0 ints: 0 in ts: 0 ints: 0 ints: 0 ints: 0 type: INTS } attribute { name: "value" f: 0 type: FLOAT } ../../../Library/Python/3.8/lib/python/site-packages/onnx/checker.py:53: ValidationError ``` I believe this is an issue in onnx 1.7 as it looks exactly like https://github.com/onnx/onnx/issues/2548. I also found that the test job Python3: MKL-CPU is not running through which seems to be due to a Timeout. I believe this is happening in test tests/python/conftest.py but the log output is not telling me what exactly goes wrong here. Do you have any idea how to reproduce this locally, or to get better insight into that failure? I will now look into the following: - can I work around the onnx 1.7 related issue? - even after aligning pylint and astroid, I'm seeing unexpected linting errors. The linter is telling me that ndarray is a bad class name. Why is that happening? Btw. I believe that you and @marcoabreu are getting pinged every time I do a commit to this PR. I don't think this PR is actually in a reviewable state yet but the purpose of it is more to see what's breaking when upgrading to Python3.8. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-643807772 @leezu I think I got to a state where I can run the unittests with python3.8 and reproduce what is described in the issue ticket https://github.com/apache/incubator-mxnet/issues/18380 I ignored the lint errors for now by adding disable flags to the pylintrc file. As described in https://github.com/apache/incubator-mxnet/issues/18380 we're seeing an issue related to the usage of time.clock(). Besides that, I found the following issue: The test tests/python/unittest/onnx/test_node.py::TestNode::test_import_export seems to fail. In the Jenkins job I don't see the error, but when running the test locally with python3.8 and onnx 1.7, I'm getting: ``` > bkd_rep = backend.prepare(onnxmodel, operation='export', backend='mxnet') tests/python/unittest/onnx/test_node.py:164: tests/python/unittest/onnx/backend.py:104: in prepare sym, arg_params, aux_params = MXNetBackend.perform_import_export(sym, arg_params, aux_params, tests/python/unittest/onnx/backend.py:62: in perform_import_export graph_proto = converter.create_onnx_graph_proto(sym, params, in_shape=input_shape, python/mxnet/contrib/onnx/mx2onnx/export_onnx.py:308: in create_onnx_graph_proto ... E onnx.onnx_cpp2py_export.checker.ValidationError: Node (pad0) has input size 1 not in range [min=2, max=3]. E E ==> Context: Bad node spec: input: "input1" output: "pad0" name: "pad0" op_type: "Pad" attribute { name: "mode" s: "constant" type: STRING } attribute { name: "pads" ints: 0 ints: 0 ints: 0 ints: 0 in ts: 0 ints: 0 ints: 0 ints: 0 type: INTS } attribute { name: "value" f: 0 type: FLOAT } ../../../Library/Python/3.8/lib/python/site-packages/onnx/checker.py:53: ValidationError ``` I believe this is an issue in onnx 1.7 as it looks exactly like https://github.com/onnx/onnx/issues/2548. I also found that the test job Python3: MKL-CPU is not running through which seems to be due to a Timeout. I believe this is happening in test tests/python/conftest.py but the log output is not telling me which test is failing and I can run the test successfully locally. Do you have any idea how to reproduce this locally, or to get better insight into that failure? I will now look into the following: - can I work around the onnx 1.7 related issue? - even after aligning pylint and astroid, I'm seeing unexpected linting errors. The linter is telling me that ndarray is a bad class name. Why is that happening? Btw. I believe that you and @marcoabreu are getting pinged every time I do a commit to this PR. I don't think this PR is actually in a reviewable state yet but the purpose of it is more to see what's breaking when upgrading to Python3.8. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-643807772 @leezu I think I got to a state where I can run the unittests with python3.8 and reproduce what is described in the issue ticket https://github.com/apache/incubator-mxnet/issues/18380 I ignored the lint errors for now by adding disable flags to the pylintrc file. As described in https://github.com/apache/incubator-mxnet/issues/18380 we're seeing an issue related to the usage of time.clock(). Besides that, I found the following issue: The test tests/python/unittest/onnx/test_node.py::TestNode::test_import_export seems to fail. In the Jenkins job I don't see the error, but when running the test locally with python3.8 and onnx 1.7, I'm getting: ``` > bkd_rep = backend.prepare(onnxmodel, operation='export', backend='mxnet') tests/python/unittest/onnx/test_node.py:164: tests/python/unittest/onnx/backend.py:104: in prepare sym, arg_params, aux_params = MXNetBackend.perform_import_export(sym, arg_params, aux_params, tests/python/unittest/onnx/backend.py:62: in perform_import_export graph_proto = converter.create_onnx_graph_proto(sym, params, in_shape=input_shape, python/mxnet/contrib/onnx/mx2onnx/export_onnx.py:308: in create_onnx_graph_proto ... E onnx.onnx_cpp2py_export.checker.ValidationError: Node (pad0) has input size 1 not in range [min=2, max=3]. E E ==> Context: Bad node spec: input: "input1" output: "pad0" name: "pad0" op_type: "Pad" attribute { name: "mode" s: "constant" type: STRING } attribute { name: "pads" ints: 0 ints: 0 ints: 0 ints: 0 in ts: 0 ints: 0 ints: 0 ints: 0 type: INTS } attribute { name: "value" f: 0 type: FLOAT } ../../../Library/Python/3.8/lib/python/site-packages/onnx/checker.py:53: ValidationError ``` I believe this is an issue in onnx 1.7 as it looks exactly like https://github.com/onnx/onnx/issues/2548. I also found that the test job Python3: MKL-CPU is not running through which seems to be due to a Timeout. I believe this is happening in test tests/python/conftest.py but the log output is not telling me which test is failing and I can run the test successfully locally. Do you have any idea how to reproduce this locally, or to get better insight into that failure? I will now look into the following: - can I work around the onnx 1.7 related issue? - even after aligning pylint and astroid, I'm seeing unexpected linting errors. The linter is telling me that ndarray is a bad class name. Why is that happening? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-639837099 In the build job ci/jenkins/mxnet-validation/unix-cpu the following command was previously failing: ` ci/build.py --docker-registry mxnetci --platform ubuntu_cpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh sanity_check ` I was able to reproduce the issue locally and I fixed it by making additional changes to ci/docker/Dockerfile.build.ubuntu and ci/docker/install/requirements. What I did in those files is the following: - making python3.8 the default python3 binary, by creating a symlink (see change in ci/docker/Dockerfile.build.ubuntu) - update requirement versions in ci/docker/install/requirements, so that `python3 -m pip install -r /work/requirements` in Dockerfile.build.ubuntu can be run successfully. I needed to update onnx, Cython and Pillow. The current versions were not installable via apt-get with python3.8. I was then able to build the image successfully and was able to successfully run the ci/build.py command I mentioned above. I'm now wondering if the failure in "continuous build / macosx-x86_64" that I'm seeing below is already a consequence of the onnx update I made (which is necessary in order to update to python 3.8, which in turn is the goal of this PR). My question is basically: what is each of the CI jobs below doing? In which of the jobs below should I be able to observe the test failures? Also let me know if you think this is going down a wrong route and I should try something different. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-639837099 In the build job ci/jenkins/mxnet-validation/unix-cpu the following command was previously failing: ` ci/build.py --docker-registry mxnetci --platform ubuntu_cpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh sanity_check ` I was able to reproduce the issue locally and I fixed it by making additional changes to ci/docker/Dockerfile.build.ubuntu and ci/docker/install/requirements. What I did in those files is the following: - making python3.8 the default python3 binary, by creating a symlink (see change in ci/docker/Dockerfile.build.ubuntu) - update requirement versions in ci/docker/install/requirements, so that `python3 -m pip install -r /work/requirements` in Dockerfile.build.ubuntu can be run successfully. I needed to update onnx, Cython and Pillow. The current versions were not installable via apt-get with python3.8. I was then able to build the image successfully and was able to successfully run the ci/build.py command I mentioned above. I'm now wondering if the failure in "continuous build / macosx-x86_64" that I'm seeing above is already a consequence of the onnx update I made (which is necessary in order to update to python 3.8, which in turn is the goal of this PR). My question is basically: what is each of the CI jobs below doing? In which of the jobs below should I be able to observe the test failures? Also let me know if you think this is going down a wrong route and I should try something different. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-639837099 In the build job ci/jenkins/mxnet-validation/unix-cpu the following command was previously failing: ` ci/build.py --docker-registry mxnetci --platform ubuntu_cpu --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh sanity_check ` I was able to reproduce the issue locally and I fixed it by making additional changes to ci/docker/Dockerfile.build.ubuntu and ci/docker/install/requirements. What I did in those files is the following: - making python3.8 the default python3 binary, by creating a symlink (see change in ci/docker/Dockerfile.build.ubuntu) - update requirement versions in ci/docker/install/requirements, so that `python3 -m pip install -r /work/requirements` in Dockerfile.build.ubuntu can be run successfully. I needed to update onnx, Cython and Pillow. The current versions were not installable via apt-get with python3.8. I was then able to build the image successfully and was able to successfully run the ci/build.py command I mentioned above. I'm now wondering if the failure in "continuous build / macosx-x86_64" that I'm seeing above is already a consequence of the onnx update I made (which is necessary in order to update to python 3.8, which in turn is the goal of this PR). My question is basically: what is each of the CI jobs below doing? In which of the above jobs should I be able to observe the test failures? Also let me know if you think this is going down a wrong route and I should try something different. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [incubator-mxnet] ma-hei edited a comment on pull request #18445: updating ubuntu_cpu base image to 20.04 to observe failing tests regarding Python 3.8
ma-hei edited a comment on pull request #18445: URL: https://github.com/apache/incubator-mxnet/pull/18445#issuecomment-637192961 @leezu I was hoping that I could observe the test failures related to Python3.8 in one of the ci/jenkins/mxnet-validation build jobs. I assume those jobs did not run because the ci/jenkins/mxnet-validation/sanity build failed. Does the failure of the sanity build look related to the python3.8 update I made in Dockerfile.build.ubuntu to you? To me it looks like the build stalled at the end and was automatically killed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org