leezu opened a new pull request #19185: URL: https://github.com/apache/incubator-mxnet/pull/19185
## Description ## Resubmit https://github.com/apache/incubator-mxnet/pull/19034 which was temporarily reverted due to oneDNN issues with GCC 8. @TaoLV can your team help debug / fix the oneDNN issues? When both gcc8 + oneDNN 1.6.3 is present, we get the following nan bugs: ``` [2020-09-17T17:48:04.979Z] ______________________ test_dc_hybridblock_deferred_init _______________________ [2020-09-17T17:48:04.979Z] [gw0] linux -- Python 3.6.9 /opt/rh/rh-python36/root/usr/bin/python3 [2020-09-17T17:48:04.979Z] [2020-09-17T17:48:04.979Z] def test_dc_hybridblock_deferred_init(): [2020-09-17T17:48:04.979Z] class MyBlock(mx.gluon.HybridBlock): [2020-09-17T17:48:04.979Z] def __init__(self): [2020-09-17T17:48:04.979Z] super().__init__() [2020-09-17T17:48:04.979Z] self.dense = mx.gluon.nn.Dense(units=10) [2020-09-17T17:48:04.979Z] self.weight = mx.gluon.Parameter('weight', allow_deferred_init=True) [2020-09-17T17:48:04.979Z] [2020-09-17T17:48:04.979Z] def infer_shape(self, x): [2020-09-17T17:48:04.979Z] self.weight.shape = (x.shape[1], ) [2020-09-17T17:48:04.979Z] [2020-09-17T17:48:04.979Z] def forward(self, x): [2020-09-17T17:48:04.979Z] return self.dense(x) + self.weight.data(x.context) [2020-09-17T17:48:04.979Z] [2020-09-17T17:48:04.979Z] net = MyBlock() [2020-09-17T17:48:04.979Z] net.initialize() [2020-09-17T17:48:04.979Z] > _assert_dc_gluon(_dc_gluon_simple_setup, net, numpy=False) [2020-09-17T17:48:04.979Z] [2020-09-17T17:48:04.979Z] tests/python/unittest/test_deferred_compute.py:504: [2020-09-17T17:48:04.979Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2020-09-17T17:48:04.979Z] tests/python/unittest/test_deferred_compute.py:421: in _assert_dc_gluon [2020-09-17T17:48:04.979Z] _all_same(ys_np, ys_hybrid_np) [2020-09-17T17:48:04.979Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ [2020-09-17T17:48:04.979Z] [2020-09-17T17:48:04.979Z] arrays1 = [array([ nan, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] 0.07460972, -0.08127148, -0.32424796,...33878, -0.10624887, [2020-09-17T17:48:04.979Z] 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] dtype=float32), ...] [2020-09-17T17:48:04.979Z] arrays2 = [array([ 0.01286458, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] 0.07460972, -0.08127148, -0.32424796,...33878, -0.10624887, [2020-09-17T17:48:04.979Z] 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] dtype=float32), ...] [2020-09-17T17:48:04.979Z] message = '' [2020-09-17T17:48:04.979Z] [2020-09-17T17:48:04.979Z] def _all_same(arrays1, arrays2, message=''): [2020-09-17T17:48:04.979Z] same = all(np.array_equal(a1, a2) for a1, a2 in zip(arrays1, arrays2)) [2020-09-17T17:48:04.979Z] if not same: [2020-09-17T17:48:04.979Z] > raise AssertionError('Arrays not equal ({}):\n{}\n\n{}'.format(message, arrays1, arrays2)) [2020-09-17T17:48:04.979Z] E AssertionError: Arrays not equal (): [2020-09-17T17:48:04.979Z] E [array([ nan, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32), array([ nan, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32), array([ nan, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32), array([ 0.01286458, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32), array([ 0.01286458, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32), array([ 0.01286458, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32), array([ 0.01286458, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32), array([ 0.01286458, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32)] [2020-09-17T17:48:04.979Z] E [2020-09-17T17:48:04.979Z] E [array([ 0.01286458, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32), array([ 0.01286458, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32), array([ 0.01286458, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32), array([ 0.01286458, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32), array([ 0.01286458, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32), array([ 0.01286458, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32), array([ 0.01286458, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32), array([ 0.01286458, 0.2107217 , -0.06851891, 0.16233878, -0.10624887, [2020-09-17T17:48:04.979Z] E 0.07460972, -0.08127148, -0.32424796, -0.0124862 , -0.1862593 ], [2020-09-17T17:48:04.979Z] E dtype=float32)] ``` Reverting either to gcc7 (which was done) or reverting the oneDNN update (https://github.com/apache/incubator-mxnet/pull/19180) fixes the issue. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
