pioy commented on pull request #19185:
URL: https://github.com/apache/incubator-mxnet/pull/19185#issuecomment-700603056


   The bugs occurs for onednn GEMM calculations with small dimensions (1<n<16).
   The bug is triggered by the range of values, that can be interpreted as NaN, 
in a one zmm registers (zmm24, zmm25, zmm26, or zmm27) just before calling the 
gemm kernel.
   Those values can be remaining of other calculations, likely integer 
operations (case of this PR). 
   I assume that float kernels does not return NaN in properly configured 
pipeline.
   The NaN values, if they are not overwritten by other avx512 kernels, may 
stay long; so they may come from operations that were executed much earlier in 
the pipeline.
   
   In result of the bug NaNs propagates to the result array. What may terminate 
execution of operators.
   
   The fix has been merged into master/1.8/1.7. It's ready for customer testing.
   See 
https://github.com/oneapi-src/oneDNN/commit/5ce95efe6f5e86cddbf704b637063cd8dc914125.
   There are some other fixes to be merged into 1.6 branch. The tag v1.6.4 will 
be added after those fixes get merged. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to