sxjscience opened a new issue #19045:
URL: https://github.com/apache/incubator-mxnet/issues/19045


   FYI @szha @sandeep-krishnamurthy @pengzhao-intel  @ptrendx @yzhliu   @leezu  
@xidulu @CassiniXu 
   
   Implementing operators is cumbersome and seems to involve more *"engineering 
than science"*. This is not the case, especially in how we may correctly test 
the implementation of an operator. Designing practical numerical computation 
systems involves lots of stability concerns and the following is my initial 
thoughts in the *Science* behind testing operators and feel free to add more.
   
   
   Let's consider the following example, in which you implemented a new 
operator and decides to test it.
   
   ```python
   def new_op(*args):
      # CODES
      return out
   ```
   
   ## How to test gradient
   
   ### The general testing function
   We can use the following way to obtain the gradient of this operator in 
MXNet.
   
   ```python
   # Generate the inputs, only have two inputs for an example
   a = mx.np.random.normal(0, 1, a_shape)
   b = mx.np.random.normal(0, 1, b_shape)
   
   # Test for different gradient requirement flags. 
   # Here, we use `add` to optimize the case in which the gradient accumulation 
can be sparse, e.g., in the embedding layer
   # In this example, we attach one with "add" and the other with "write".
   a.attach_grad('add')
   # When it's add, we randomly initialize a gradient
   original_a_grad = mx.np.random.normal(0, 1, a_shape)
   a.grad[:] = original_a_grad
   b.attach_grad('write')
   
   # Generate some random head gradients
   out1_grad = mx.np.random.normal(0, 1, out1_shape)
   out2_grad = mx.np.random.normal(0, 1, out2_shape)
   
   with mx.autograd.record():
      out1, out2 = new_op(a, b)
      loss = (out1 * out1_grad + out2_grad).sum()
      loss.backward()
   
   mx_a_grad = a.grad.asnumpy()
   mx_b_grad = b.grad.asnumpy()
   ```
   
   Then, in order to test it, you may try to obtain some *ground truth* 
gradient and compare with the numerical gradient generated by the autograd 
engine (even if you are using pytorch).
   
   ```python
   import numpy.testing as npt
   npt.assert_allclose(mx_a_grad, gt_a_grad + original_a_grad, 1E-3, 1E-3)
   npt.assert_allclose(mx_b_grad, gt_b_grad, 1E-3, 1E-3)
   
   ```
   
   ### How to obtain the ground-truth gradient?
   There are multiple ways to obtain this *ground truth* gradient and here is 
how science may play an important role. For some simpler operators, you can 
directly implement the gradient via numpy and try to test against that. 
However, another choice is to test with [Finite 
Difference](https://en.wikipedia.org/wiki/Finite_difference), which works even 
if the inner operator is a *black box*. In MXNet, we used the "central 
difference" variant as shown in 
https://github.com/apache/incubator-mxnet/blob/b0c39f7ea983639093c63d7d2486bbef083a55d6/python/mxnet/test_utils.py#L1013-L1023
   
   In addition, you can use the same Finite Difference technique to test for 
Jacobians, Hessians and general *higher order* gradient. In fact, using the 
finite difference to approximate Hessian is the key behind [Hessian-free 
optimization](http://www.cs.toronto.edu/~jmartens/docs/Deep_HessianFree.pdf). 
Also, you may refer to Chapter 8.1 of the book [Numerical 
Optimization](https://link.springer.com/book/10.1007/978-0-387-40065-5).
   
   Since random tests may cause flakyness, I think one solution is to always 
fix the seed when doing any random test.
   
   ### Mixed Data Types
   
   Well, things are more complicated when your operator needs to support 
**mixed precision**. For example, assume that you need to test with **fp16**. 
The obvious outcome is that you cannot use *fp16* to do the *Finite Difference* 
test, which won't be very accurate. Thus, I think the appropriate way is to 
always use *fp32* to generate the ground-truth gradient, cast that back to fp16 
and then compare the results. (For this part, we may need to have some further 
discussion). 
   
   ## Test the correctness of **random** operators.
   Things may become more complicated when your operator involves *randomness*. 
For example, you implemented an operator that generates Gaussian random 
variables and your task is confirm the others that it's generating Gaussian.
   
   In order to do that, you can try to conduct **statistical testing**. For 
example, you can compare the moments of the generated distribution and the 
ground-truth (mean + variance for Gaussian). A more principled way is to run 
standard statistical testings like the [Kolmogorov-Smirnov 
Test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test): 
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html, 
or the (Chi-Square 
Testing)[https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html]
 which helps you verify if the empirical distribution is the same as the 
ground-truth distribution.
   
   In the current MXNet, we provided the tool for Chi-Square testing: 
   
   
https://github.com/apache/incubator-mxnet/blob/b0c39f7ea983639093c63d7d2486bbef083a55d6/python/mxnet/test_utils.py#L2104


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to