Jason House wrote:

Given that doing one parameter at a time may be less ideal, I don't know if my method would really inherit those properties or not.

Probably not, because the Hessian has significant non-diagonal values. But I expect it would still converge in less iterations than MM.


    Still, MM has the advantage that it can be applied in batch mode, thus
    saving a lot of computation.


What do you mean by batch mode? Are you implying something like what's at the top of page 387 in Hunter's paper? [1]. I initially it as that, but I don't really believe that any more. It looks like too much work to be useful.

No, batch mode is the contrary of what is explained on that page. It means all the parameters are updated at the same time, so that the 1/Ej are computed only once.


For each iteration of batch updates, do you pass over the data computing whatever you can and then pick a set of mutually exclusive items to update? How bad would it be to update everything, including those on the same team?

The algorithm may not converge. In the beginning, when I had only a few features, I updated everything at the same time, and it worked. But when I added more features, it diverged.


In what I'm envisioning for my method, computations of 1/Ej would also be shared for each computations. I'd also keep a running sum of the selection probability and the square of the selection probability. Once at the end of the data, I'd apply my equation to compute the update of all parameters.

If what you do is Newton's method with diagonal approximation of the Hessian, you may have to add a small negative constant to terms of that diagonal. One advantage of MM is that it has proved convergence. It is difficult to get such a guarantee with approximate Newton.

Maybe conjugate gradient would work, too (chapter 10.6):
http://www.nrbook.com/a/bookcpdf.php

Rémi
_______________________________________________
computer-go mailing list
[email protected]
http://www.computer-go.org/mailman/listinfo/computer-go/

Reply via email to