dbsxdbsx opened a new issue #14544: Question on gradient calculation of 
example: reinforcement-learning/parallel_actor_critic
URL: https://github.com/apache/incubator-mxnet/issues/14544
 
 
   I wonder the inner mechanism  of gradient calculation with this example. 
Here I took some excerpts from model.py as below:
   ```
   class Agent(object):
       def __init__(self, input_size, act_space, config):
           super(Agent, self).__init__()
           self.input_size = input_size
           self.num_envs = config.num_envs
           self.ctx = config.ctx
           self.act_space = act_space
           self.config = config
   
           # Shared network.
           net = mx.sym.Variable('data')
           net = mx.sym.FullyConnected(
               data=net, name='fc1', num_hidden=config.hidden_size, 
no_bias=True)
           net = mx.sym.Activation(data=net, name='relu1', act_type="relu")
   
           # Policy network.
           policy_fc = mx.sym.FullyConnected(
               data=net, name='policy_fc', num_hidden=act_space, no_bias=True)
           policy = mx.sym.SoftmaxActivation(data=policy_fc, name='policy')
           policy = mx.sym.clip(data=policy, a_min=1e-5, a_max=1 - 1e-5)
           log_policy = mx.sym.log(data=policy, name='log_policy')
           out_policy = mx.sym.BlockGrad(data=policy, name='out_policy')
   
           # Negative entropy.
           neg_entropy = policy * log_policy
           neg_entropy = mx.sym.MakeLoss(
               data=neg_entropy, grad_scale=config.entropy_wt, 
name='neg_entropy')
   
           # Value network.
           value = mx.sym.FullyConnected(data=net, name='value', num_hidden=1)
   
           self.sym = mx.sym.Group([log_policy, value, neg_entropy, out_policy])
           self.model = mx.mod.Module(self.sym, data_names=('data',),
                                      label_names=None)
   
           self.paralell_num = config.num_envs * config.t_max
           self.model.bind(
               data_shapes=[('data', (self.paralell_num, input_size))],
               label_shapes=None,
               grad_req="write")
   
           self.model.init_params(config.init_func)
   
           optimizer_params = {'learning_rate': config.learning_rate,
                               'rescale_grad': 1.0}
           if config.grad_clip:
               optimizer_params['clip_gradient'] = config.clip_magnitude
   
           self.model.init_optimizer(
               kvstore='local', optimizer=config.update_rule,
               optimizer_params=optimizer_params)
   
       def act(self, ps):
           us = np.random.uniform(size=ps.shape[0])[:, np.newaxis]
           as_ = (np.cumsum(ps, axis=1) > us).argmax(axis=1)
           return as_
   
       def train_step(self, env_xs, env_as, env_rs, env_vs):
           # NOTE(reed): Reshape to set the data shape.
           self.model.reshape([('data', (len(env_xs), self.input_size))])
   
           xs = mx.nd.array(env_xs, ctx=self.ctx)
           as_ = np.array(list(chain.from_iterable(env_as)))
   
           # Compute discounted rewards and advantages.
           advs = []
           gamma, lambda_ = self.config.gamma, self.config.lambda_
           for i in range(len(env_vs)):
               # Compute advantages using Generalized Advantage Estimation;
               # see eqn. (16) of [Schulman 2016].
               delta_t = (env_rs[i] + gamma*np.array(env_vs[i][1:]) -
                          np.array(env_vs[i][:-1]))
               advs.extend(self._discount(delta_t, gamma * lambda_))
   
           # Negative generalized advantage estimations.
           neg_advs_v = -np.asarray(advs)
   
           # NOTE(reed): Only keeping the grads for selected actions.
           neg_advs_np = np.zeros((len(advs), self.act_space), dtype=np.float32)
           neg_advs_np[np.arange(neg_advs_np.shape[0]), as_] = neg_advs_v
           neg_advs = mx.nd.array(neg_advs_np, ctx=self.ctx)
   
           # NOTE(reed): The grads of values is actually negative advantages.
           v_grads = mx.nd.array(self.config.vf_wt * neg_advs_v[:, np.newaxis],
                                 ctx=self.ctx)
   
           data_batch = mx.io.DataBatch(data=[xs], label=None)
           self._forward_backward(data_batch=data_batch,
                                  out_grads=[neg_advs, v_grads])
   
           self._update_params()
   
       def _discount(self, x, gamma):
           return scipy.signal.lfilter([1], [1, -gamma], x[::-1], axis=0)[::-1]
   
       def _forward_backward(self, data_batch, out_grads=None):
           self.model.forward(data_batch, is_train=True)
           self.model.backward(out_grads=out_grads)
   ```
   First, I know for actor critic algorithm,  policy advantages should be 
maximized while state value difference should be minimized. And the absolute 
value for the 2 values are the same.
   Then, my quesiton:
   1. for policy advantages,  for gluon version, I know `log_policy` got to be 
multipled with advantage, then treat the negative result as policy_loss to 
`autograd.backward`. But there is so such manipulation here in function 
`train_step`, does that mean the `log_policy`  is multipled implicitly?
   2.for state value estimation, to minimize difference, usually L1Loss or 
L2Loss is taken  to `autograd.backward` in gluon version. But there seems a 
trick used here as the comment  said, "  # NOTE(reed): The grads of values is 
actually negative advantages.", Similar question like 1, what is exactly 
calculated here with `v_grads` and corresponding net output ` value` when back 
propagating?
   
    I  know this question maybe somehow related to math, and sorry for my poor 
math with gradient calculation. Hope someone could give an clear answer, 
thanks.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to