Re: [computer-go] ELO Ratings of move pattern
I was doing an in-depth review of Hunter's paper over the extended weekend... I think I now see the value of the MM algorithms, and the understand the divergence problem. MM has the nice property that convergence is guaranteed. Newton-Raphson (N-R) carries no such guarantee. The paper includes a simple (83 variable?) example where N-R fails. That doesn't mean that it has no value, but any self-respecting implementation may need to replace N-R steps with an MM steps if N-R fails to increase the objective function (Hunter calls this modified Newton-Raphson) There are a few critical properties of a minorization function. Below I use Q to represent the minorized version, and L to represent to original objective function. 1. It must be guaranteed that Q( old x ) == L( old x ) 2. It must be guaranteed that L( new x ) = Q( new x ) 3. It must be guaranteed that Q( new x ) Q( old x ) Combining those together L( new x ) = Q(new x) Q( old x) = L (old x) ... or L(new x) L(old x)... ensuring forward progression. Properties 12 are a basic requirement for minorized functions. #3 is a criteria on how to select between candidate minorized functions. The math of the new function must remain easy to achieve this. Usually this will mean defining a function that is easy to maximize. In practice this is further translated to mean that the partial derivatives of Q are independent of all other variables being adjusted. Without this, optimizing variables independently (and doing a batch update) breaks property 3 and convergence guarantees are lost. This completely explains the divergence Remi observed and rationalizes why his heuristic is correct (don't update more than one member of any team). That restores the independence that makes finding a true maximum easy. I've convinced myself that exactly replicating Remi's solution is the correct first step in doing the ELO computations. If speed is a factor, my second step may be to compute multiple update strategies (such as N-R) and then use whichever gives the largest jump in performance. On Dec 21, 2007 2:07 PM, Rémi Coulom [EMAIL PROTECTED] wrote: Jason House wrote: For each iteration of batch updates, do you pass over the data computing whatever you can and then pick a set of mutually exclusive items to update? How bad would it be to update everything, including those on the same team? The algorithm may not converge. In the beginning, when I had only a few features, I updated everything at the same time, and it worked. But when I added more features, it diverged. One advantage of MM is that it has proved convergence. It is difficult to get such a guarantee with approximate Newton. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 21, 2007 8:53 AM, Rémi Coulom [EMAIL PROTECTED] wrote: Hi, Minorization-maximization is a simple optimization method, and I agree that it is likely that more efficient algorithms can be applied. Newton's method implies estimating the inverse of the Hessian matrix. Really computing the inverse has a cost cubic in the size of the matrix, so it is not practical with tens of thousands of parameters. Nevertheless, Newton's method can be applied to each parameter one by one. If I understand correctly, this is what Jason proposed. One step of Newton's method on just one parameter has a computational cost very similar to MM, and is much more efficient (it should converge in just one or two iterations). You understand correctly. Mine was based off what I remembered of Newton's method from High School. We didn't mess with Hessian matrices! Hunter's paper [1] does compare MM with the matrix form of the Newton-Raphson method. It shows convergence in far fewer iterations (6 vs. 26 in an example with 82 parameters) Given that doing one parameter at a time may be less ideal, I don't know if my method would really inherit those properties or not. Still, MM has the advantage that it can be applied in batch mode, thus saving a lot of computation. What do you mean by batch mode? Are you implying something like what's at the top of page 387 in Hunter's paper? [1]. I initially it as that, but I don't really believe that any more. It looks like too much work to be useful. For each iteration of batch updates, do you pass over the data computing whatever you can and then pick a set of mutually exclusive items to update? How bad would it be to update everything, including those on the same team? In what I'm envisioning for my method, computations of 1/Ej would also be shared for each computations. I'd also keep a running sum of the selection probability and the square of the selection probability. Once at the end of the data, I'd apply my equation to compute the update of all parameters. [1] http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=bodyid=pdf_1handle=euclid.aos/1079120141 ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 21, 2007 8:53 AM, Rémi Coulom [EMAIL PROTECTED] wrote: Hi, Minorization-maximization is a simple optimization method, and I agree that it is likely that more efficient algorithms can be applied. Newton's method implies estimating the inverse of the Hessian matrix. Really computing the inverse has a cost cubic in the size of the matrix, so it is not practical with tens of thousands of parameters. Well, you don't need the exact solution to the system of linear equations, so you can apply an iterative method instead of doing full Gaussian elimination. Well, and I wasn't thinking of having tens of thousands of parameters... How much data do you intend to use to get decent coefficients in a regression with such number of coefficients to be determined? Aren't they going to be extremely noisy? I am sure MM is a perfectly good algorithm for this purpose, but it has the serious down side that I don't understand it. :) I do understand the general idea behind it and how it works in some simple cases, but I don't know enough to adapt it to my particular needs. Nevertheless, Newton's method can be applied to each parameter one by one. If I understand correctly, this is what Jason proposed. One step of Newton's method on just one parameter has a computational cost very similar to MM, and is much more efficient (it should converge in just one or two iterations). Hmmm... That sounds similar to just considering the diagonal part of the Hessian matrix, or something of that sort. If the features are sort of orthogonal in some sense (uncorrelated?), this should be a good approximation, but I don't know how well this would apply to our particular regression problems. Remi, do you dump the training data to a simple text file and you then feed that to a program that does the regression, or does the program compute it on the fly from the database of positions? Álvaro. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 21, 2007 10:03 AM, Álvaro Begué [EMAIL PROTECTED] wrote: I am sure MM is a perfectly good algorithm for this purpose, but it has the serious down side that I don't understand it. :) I do understand the general idea behind it and how it works in some simple cases, but I don't know enough to adapt it to my particular needs. Based on my analysis, the assumptions in MM are not always that accurate. It's partly what made me derive my own method. Of course, I simply used what I knew. Nevertheless, Newton's method can be applied to each parameter one by one. If I understand correctly, this is what Jason proposed. One step of Newton's method on just one parameter has a computational cost very similar to MM, and is much more efficient (it should converge in just one or two iterations). Hmmm... That sounds similar to just considering the diagonal part of the Hessian matrix, or something of that sort. If the features are sort of orthogonal in some sense (uncorrelated?), this should be a good approximation, but I don't know how well this would apply to our particular regression problems. I doubt (very highly) that parameters are uncorrelated. Remi, do you dump the training data to a simple text file and you then feed that to a program that does the regression, or does the program compute it on the fly from the database of positions? I'm not Remi, but I can at least say that I plan to dump my data to a text file. I view that as having a few advantages. It removes logic from my engine that is almost orthogonal to its true purpose and allows me to easily share my tool with others (I plan to release it under GPLv3 when done). I'm expecting the iterative passes over the very large input set to be the limiting factor for the tool. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
Jason House wrote: Given that doing one parameter at a time may be less ideal, I don't know if my method would really inherit those properties or not. Probably not, because the Hessian has significant non-diagonal values. But I expect it would still converge in less iterations than MM. Still, MM has the advantage that it can be applied in batch mode, thus saving a lot of computation. What do you mean by batch mode? Are you implying something like what's at the top of page 387 in Hunter's paper? [1]. I initially it as that, but I don't really believe that any more. It looks like too much work to be useful. No, batch mode is the contrary of what is explained on that page. It means all the parameters are updated at the same time, so that the 1/Ej are computed only once. For each iteration of batch updates, do you pass over the data computing whatever you can and then pick a set of mutually exclusive items to update? How bad would it be to update everything, including those on the same team? The algorithm may not converge. In the beginning, when I had only a few features, I updated everything at the same time, and it worked. But when I added more features, it diverged. In what I'm envisioning for my method, computations of 1/Ej would also be shared for each computations. I'd also keep a running sum of the selection probability and the square of the selection probability. Once at the end of the data, I'd apply my equation to compute the update of all parameters. If what you do is Newton's method with diagonal approximation of the Hessian, you may have to add a small negative constant to terms of that diagonal. One advantage of MM is that it has proved convergence. It is difficult to get such a guarantee with approximate Newton. Maybe conjugate gradient would work, too (chapter 10.6): http://www.nrbook.com/a/bookcpdf.php Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 5, 2007 4:44 AM, Lars [EMAIL PROTECTED] wrote: I have some questions concernig this paper of Remi: http://remi.coulom.free.fr/Amsterdam2007/MMGoPatterns.pdf @Remi: How many iterations you had used? Anyone of you have similar or other experiences with the algorithm? I seem to have more time to think than to code lately. I believe I've derived an alternate update method. I'd be curious how quickly it converges. Using Minorization-Maximization, Remi derived new gamma = wins/sum(C/E) Using Newton-Raphson method, I derived fractional change = (estimation error) * (fractional uncertainty)^2 new gamma = gamma * (1 + fractional change) Where estimation error = (wins-expected wins) fractional uncertainty = sigma/gamma fractional change = new gamma/gamma - 1 Changing gamma based on both the estimation error and the uncertainty that gamma is correct makes intuitive sense to me. Sigma matches remi's definition. Because strengths are multiplied together, fractional uncertainty makes more sense. I'm avoiding the difficult issue of explaining why, but I do point people to consider two things: 1. If gamma=gamma+sigma, what does that do to team strength? (It multiplies the team strength by 1 + fractional uncertainty) 2. Look at the similar definitions of fractional uncertainty and fractional change. A less intuitive form is: fractional change = (wins-expected wins) / (expected wins^2 - wins) Using Remi's notation, expected wins = gamma*sum(C/E) Regardless of if someone else plays with the alternate equation, I'll be playing with both after the holidays. I'd also be curious what thoughts people have about this. To derive this, I applied the Newton-Raphson method to find where the partial derivative of the log-evidence is zero (AKA, where log-evidence is maximized). If L = log evidence, and *L = 1st partial derivative of L with respect to gamma and **L = 2nd partial derivative of L with respect to gamma, then new gamma - gamma = *L/**L. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
I was trying to come up with my own algorithm to maximize likelihood and I am having a hard time getting it all in my mind. I managed to write a working algorithm for the case of logistic regression, but it was kind of brittle and I didn't know how to extend it to the softmax case, which is what I wanted. Over lunch I thought of another way of doing it that would be very general and easy to implement. Basically, I can compute the log-likelihood for a particular setting of the weights, and I can compute the gradient and the Jacobian matrix (the derivative with respect to each weight or pair of weights, respectively). Then I can approximate the function I am minimizing using a paraboloid computed from that data, compute the minimum of the paraboloid and take that point to be the new setting of the weights. Has anyone tried something like this? It seems like the natural way to do it to me. Anyway, I would really appreciate if other people that are trying this kind of thing could exchange some test datasets with me to see if I get to similar weights. I think I will make the code available for anyone to use after I get it to be satisfactory enough. Álvaro. On Dec 20, 2007 11:43 AM, Jason House [EMAIL PROTECTED] wrote: On Dec 5, 2007 4:44 AM, Lars [EMAIL PROTECTED] wrote: I have some questions concernig this paper of Remi: http://remi.coulom.free.fr/Amsterdam2007/MMGoPatterns.pdf @Remi: How many iterations you had used? Anyone of you have similar or other experiences with the algorithm? I seem to have more time to think than to code lately. I believe I've derived an alternate update method. I'd be curious how quickly it converges. Using Minorization-Maximization, Remi derived new gamma = wins/sum(C/E) Using Newton-Raphson method, I derived fractional change = (estimation error) * (fractional uncertainty)^2 new gamma = gamma * (1 + fractional change) Where estimation error = (wins-expected wins) fractional uncertainty = sigma/gamma fractional change = new gamma/gamma - 1 Changing gamma based on both the estimation error and the uncertainty that gamma is correct makes intuitive sense to me. Sigma matches remi's definition. Because strengths are multiplied together, fractional uncertainty makes more sense. I'm avoiding the difficult issue of explaining why, but I do point people to consider two things: 1. If gamma=gamma+sigma, what does that do to team strength? (It multiplies the team strength by 1 + fractional uncertainty) 2. Look at the similar definitions of fractional uncertainty and fractional change. A less intuitive form is: fractional change = (wins-expected wins) / (expected wins^2 - wins) Using Remi's notation, expected wins = gamma*sum(C/E) Regardless of if someone else plays with the alternate equation, I'll be playing with both after the holidays. I'd also be curious what thoughts people have about this. To derive this, I applied the Newton-Raphson method to find where the partial derivative of the log-evidence is zero (AKA, where log-evidence is maximized). If L = log evidence, and *L = 1st partial derivative of L with respect to gamma and **L = 2nd partial derivative of L with respect to gamma, then new gamma - gamma = *L/**L. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 20, 2007 11:43 AM, Jason House [EMAIL PROTECTED] wrote: I seem to have more time to think than to code lately. I believe I've derived an alternate update method. Thinking more, I realize I messed up a three things... For one, Newton-Raphson requires new gamma - gamma = -*L/**L instead of new gamma - gamma = *L/**L Another, 1/sigma^2 = -**L instead of 1/sigma^2 = **L Finally, **L = sum[ (selection probability)^2 ] - wins instead of **L = (estimated wins)^2 - wins Using Newton-Raphson method, I derived fractional change = (estimation error) * (fractional uncertainty)^2 new gamma = gamma * (1 + fractional change) A less intuitive form is: fractional change = (wins-expected wins) / (expected wins^2 - wins) Using Remi's notation, expected wins = gamma*sum(C/E) This becomes: fractional change = ( wins - sum(selection probability) ) / ( wins - sum((selection probability)^2) ) selection probability = gamma*C/E ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 20, 2007 5:39 PM, Álvaro Begué [EMAIL PROTECTED] wrote: I was trying to come up with my own algorithm to maximize likelihood and I am having a hard time getting it all in my mind. I managed to write a working algorithm for the case of logistic regression, but it was kind of brittle and I didn't know how to extend it to the softmax case, which is what I wanted. Over lunch I thought of another way of doing it that would be very general and easy to implement. Basically, I can compute the log-likelihood for a particular setting of the weights, and I can compute the gradient and the Jacobian matrix (the derivative with respect to each weight or pair of weights, respectively). Then I can approximate the function I am minimizing using a paraboloid computed from that data, compute the minimum of the paraboloid and take that point to be the new setting of the weights. Has anyone tried something like this? It seems like the natural way to do it to me. My knowledge of the Jacobian matrix is limited. http://en.wikipedia.org/wiki/Jacobian seems to imply that your definition of the Jacobian matrix is unusual. To fit a paraboloid, don't you need the 2nd derivative as well? I'd assume you'd fit a parabola in each direction (for each gamma) and then sum them together into one paraboloid Interestingly enough, I believe assuming a paraboloid is equivalent to assume a linear first derivative (which is what I did). Of course, I have no idea how the Jacobian comes into play, so I assume I'm missing a few things. Is it possible to share a bit more of your final method? Anyway, I would really appreciate if other people that are trying this kind of thing could exchange some test datasets with me to see if I get to similar weights. I think I will make the code available for anyone to use after I get it to be satisfactory enough. I'm not that far yet. I'll share when I have it, although I plan to try and use the same starting data set (games) as Remi. I have yet to code calculation of some of the features. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 20, 2007 10:36 PM, Jason House [EMAIL PROTECTED] wrote: On Dec 20, 2007 5:39 PM, Álvaro Begué [EMAIL PROTECTED] wrote: Over lunch I thought of another way of doing it that would be very general and easy to implement. Basically, I can compute the log-likelihood for a particular setting of the weights, and I can compute the gradient and the Jacobian matrix (the derivative with respect to each weight or pair of weights, respectively). Then I can approximate the function I am minimizing using a paraboloid computed from that data, compute the minimum of the paraboloid and take that point to be the new setting of the weights. Has anyone tried something like this? It seems like the natural way to do it to me. My knowledge of the Jacobian matrix is limited. http://en.wikipedia.org/wiki/Jacobian seems to imply that your definition of the Jacobian matrix is unusual. Ooops! It's been too long since I learned these things. What I meant to say was the Hessian matrix ( http://en.wikipedia.org/wiki/Hessian_matrix ), which does contain the second derivatives. You are right that assuming the derivative changes linearly and using a paraboloid are the exact same thing. So we are probably thinking of the same method, although I think I can code it in very elegantly (maybe slow too?). Álvaro. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
Jason House wrote: On Dec 6, 2007 11:38 AM, Rémi Coulom [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Jason House wrote: This may serve as a good test of if there is enough data to assign values to the patterns. I did not mention this in my paper, but you can rather easily estimate uncertainty margins around Elo values. This involves computing the second-order derivative of the probability distribution with respect to log(gamma). Since the distribution has a shape that looks very much like a Gaussian, the second-order derivative at the maximum is a good estimation of -1/sigma². That is how I compute confidence intervals in bayeselo. What do you mean by the probability distribution with respect to log(gamma)? Do you mean a plot of the prediction rate with only the gamma of interest varying? No the prediction rate, but the probability of the training data. More precisely, the logarithm of that probability. If you have P(x)=A*exp(-x²/2sigma²), then log(P(x))=log(A)-x²/2sigma², and d²(log(P(x)))/dx²=-1/sigma². This means that, for a Gaussian probability distribution, the second-order derivative directly gives the variance. For distributions that look similar to a Gaussian, the second-order derivative is a good approximation. Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 6, 2007 11:38 AM, Rémi Coulom [EMAIL PROTECTED] wrote: Jason House wrote: This may serve as a good test of if there is enough data to assign values to the patterns. I did not mention this in my paper, but you can rather easily estimate uncertainty margins around Elo values. This involves computing the second-order derivative of the probability distribution with respect to log(gamma). Since the distribution has a shape that looks very much like a Gaussian, the second-order derivative at the maximum is a good estimation of -1/sigma². That is how I compute confidence intervals in bayeselo. What do you mean by the probability distribution with respect to log(gamma)? Do you mean a plot of the prediction rate with only the gamma of interest varying? ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 12, 2007 2:59 PM, Rémi Coulom [EMAIL PROTECTED] wrote: Do you mean a plot of the prediction rate with only the gamma of interest varying? No the prediction rate, but the probability of the training data. More precisely, the logarithm of that probability. I still don't know what you mean by this. If you have P(x)=A*exp(-x²/2sigma²), then log(P(x))=log(A)-x²/2sigma², and d²(log(P(x)))/dx²=-1/sigma². This means that, for a Gaussian probability distribution, the second-order derivative directly gives the variance. For distributions that look similar to a Gaussian, the second-order derivative is a good approximation. This part I understand ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 12, 2007 3:05 PM, Jason House [EMAIL PROTECTED] wrote: On Dec 12, 2007 2:59 PM, Rémi Coulom [EMAIL PROTECTED] wrote: Do you mean a plot of the prediction rate with only the gamma of interest varying? No the prediction rate, but the probability of the training data. More precisely, the logarithm of that probability. I still don't know what you mean by this. He probably should use the word likelihood instead of probability. http://en.wikipedia.org/wiki/Likelihood_function If you have P(x)=A*exp(-x²/2sigma²), then log(P(x))=log(A)-x²/2sigma², and d²(log(P(x)))/dx²=-1/sigma². This means that, for a Gaussian probability distribution, the second-order derivative directly gives the variance. For distributions that look similar to a Gaussian, the second-order derivative is a good approximation. This part I understand ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 12, 2007 3:09 PM, Álvaro Begué [EMAIL PROTECTED] wrote: On Dec 12, 2007 3:05 PM, Jason House [EMAIL PROTECTED] wrote: On Dec 12, 2007 2:59 PM, Rémi Coulom [EMAIL PROTECTED] wrote: Do you mean a plot of the prediction rate with only the gamma of interest varying? No the prediction rate, but the probability of the training data. More precisely, the logarithm of that probability. I still don't know what you mean by this. He probably should use the word likelihood instead of probability. http://en.wikipedia.org/wiki/Likelihood_function Clearly I'm missing something, because I still don't understand. Let's take a simple example of a move is on the 3rd line and has a gamma value of 1.75. What is the equation or sequence of discrete values that I can take the derivative of? Doing conditional probabilities based on move is on 3rd line and move is selected (AKA pure training data) seems to yield a fixed value rather than something approximating a normal distribution. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 12, 2007 3:31 PM, Jason House [EMAIL PROTECTED] wrote: On Dec 12, 2007 3:09 PM, Álvaro Begué [EMAIL PROTECTED] wrote: On Dec 12, 2007 3:05 PM, Jason House [EMAIL PROTECTED] wrote: On Dec 12, 2007 2:59 PM, Rémi Coulom [EMAIL PROTECTED] wrote: Do you mean a plot of the prediction rate with only the gamma of interest varying? No the prediction rate, but the probability of the training data. More precisely, the logarithm of that probability. I still don't know what you mean by this. He probably should use the word likelihood instead of probability. http://en.wikipedia.org/wiki/Likelihood_function Clearly I'm missing something, because I still don't understand. Let's take a simple example of a move is on the 3rd line and has a gamma value of 1.75. What is the equation or sequence of discrete values that I can take the derivative of? We start with a database of games, and we are trying to find a set of gamma values. For a given set of gamma values, we can compute the probability of all the moves happening exactly as they happened in the database. So if the first move is E4 and we had E4 as having a probability of 0.005, we start with that, then we take the next move, and multiply 0.005 by the probability of the second move, etc. By the end of the database, we'll have some number like 3.523E-9308 which is the probability of all of the moves in the database happening. This is the probability of the database if it had been generated by a random process following the probability distributions modeled by the set gamma values. You can see this as a function of the gamma values. This function is usually called likelihood function. In order to pick the best gammas, we choose the ones with the maximum likelihood. Sometimes we use the logarithm of the likelihood instead, which has the interpretation of being minus the amount of information in the database, plus it's not a number with gazillion 0s after the decimal point. Now, around the point where the maximum likelihood happens, you can try to move one of the gammas and see how much it hurts the likelihood. For some features it will hurt a lot, which means that the value has to be very close to the one you computed, or you'll get a bad model, and for some features it will hurt very little, which means that there are other settings of the value that are sort of equivalent. The second derivative of the likelihood (or the log of the likelihood, I don't think it should matter much), will tell you how narrow a peak you are at. Does that make some sense? ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
Jason House wrote: On Dec 12, 2007 3:09 PM, Álvaro Begué [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: On Dec 12, 2007 3:05 PM, Jason House [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: On Dec 12, 2007 2:59 PM, Rémi Coulom [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Do you mean a plot of the prediction rate with only the gamma of interest varying? No the prediction rate, but the probability of the training data. More precisely, the logarithm of that probability. I still don't know what you mean by this. He probably should use the word likelihood instead of probability. http://en.wikipedia.org/wiki/Likelihood_function Clearly I'm missing something, because I still don't understand. Let's take a simple example of a move is on the 3rd line and has a gamma value of 1.75. What is the equation or sequence of discrete values that I can take the derivative of? Doing conditional probabilities based on move is on 3rd line and move is selected (AKA pure training data) seems to yield a fixed value rather than something approximating a normal distribution. Consider, in the Elo rating analogy, a player with a win and a loss to player whose gamma is 1. There you have P(gamma)=gamma/(1+gamma)², whose maximum is at gamma = 1. It is that probability that I am talking about. It is the probability that is maximized by the MM algorithm. Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 12, 2007 4:27 PM, Álvaro Begué [EMAIL PROTECTED] wrote: Clearly I'm missing something, because I still don't understand. Let's take a simple example of a move is on the 3rd line and has a gamma value of 1.75. What is the equation or sequence of discrete values that I can take the derivative of? We start with a database of games, and we are trying to find a set of gamma values. For a given set of gamma values, we can compute the probability of all the moves happening exactly as they happened in the database. So if the first move is E4 and we had E4 as having a probability of 0.005, we start with that, then we take the next move, and multiply 0.005 by the probability of the second move, etc. By the end of the database, we'll have some number like 3.523E-9308 which is the probability of all of the moves in the database happening. This is the probability of the database if it had been generated by a random process following the probability distributions modeled by the set gamma values. You can see this as a function of the gamma values. This function is usually called likelihood function. In order to pick the best gammas, we choose the ones with the maximum likelihood. Sometimes we use the logarithm of the likelihood instead, which has the interpretation of being minus the amount of information in the database, plus it's not a number with gazillion 0s after the decimal point. Now, around the point where the maximum likelihood happens, you can try to move one of the gammas and see how much it hurts the likelihood. For some features it will hurt a lot, which means that the value has to be very close to the one you computed, or you'll get a bad model, and for some features it will hurt very little, which means that there are other settings of the value that are sort of equivalent. The second derivative of the likelihood (or the log of the likelihood, I don't think it should matter much), will tell you how narrow a peak you are at. Does that make some sense? It makes perfect sense. Thanks. If I do the math right, using the notations of Remi's paper, d^2 log(L) / (dgamma_i)^2 = sum_j [C_ij^2/(C_ij*gamma_i+D_ij)^2 - A_ij^2/(A_ij*gamma_i+B_ij)^2] = 1 / (sigma for ELO of gamma_i)^2 ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
Thanks for the file! This should be very helpful when I try to reproduce results. It looks like you are not taking advantage of symmetries. For instance, 88|0|17.033168 88|1|12.263955 and 164|0|17.388714 164|1|25.862695 Are identical except for swapping the roles of white and black (88 == 1120 in base 4, 164 == 2210 in base 4), and therefore they should have the same scores. If you do take symmetries of this type into consideration, you'll reduce the number of gamma values you have to compute by a factor of almost 16 (8 symmetries in the D4 group, times 2 colors). This should result in faster learning and better estimates. Álvaro. On Dec 6, 2007 4:28 AM, Lars [EMAIL PROTECTED] wrote: While you can find the gamma-values of non-shape patterns in the paper of Remi, I can give you the list of shape-pattern gamma-values. The source-code is mixed with my engine-code, which i don't like to publish. But the algorithm is more or less easy to implement.. I'll try to attach the file to this mail, if it don't work I send it to your address. The file have the following coding: It's a text-file with one pattern each line. Lines look as: Pattern-ID|Move-right(0=black, 1=white)|gamma-value\n The Pattern-IDs have the following coding: In the binary representation of the ID you have 2 Bits for every field. So you have 18 Bits at all. 00 = Empty, 01 = Black, 02 = White, 03 = Edge The pairs are ordered as: 1 2 3 4 5 6 7 8 9 in the ID it is 123456789 While you can rotate and mirror the patterns only one respresentation of the various possible representations is contained in the file. On Dec 5, 2007 3:17 PM, Lars [EMAIL PROTECTED] wrote: Thank you for your explanations! You were right, it was a bug, and now it works really fine. Are you (or Remi) willing to publish any of the following? 1. Source code for extracting ELO ratings for moves 2. Full list of gamma values, including patterns Either of those would help kick start those of us that are just starting to go down a similar road. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
Oh, I see you have applied the symmetries, but not the swapping of roles. Still, this should probably be done and cut the number of gamma values in half. Álvaro. On Dec 6, 2007 7:13 AM, Álvaro Begué [EMAIL PROTECTED] wrote: Thanks for the file! This should be very helpful when I try to reproduce results. It looks like you are not taking advantage of symmetries. For instance, 88|0|17.033168 88|1|12.263955 and 164|0|17.388714 164|1|25.862695 Are identical except for swapping the roles of white and black (88 == 1120 in base 4, 164 == 2210 in base 4), and therefore they should have the same scores. If you do take symmetries of this type into consideration, you'll reduce the number of gamma values you have to compute by a factor of almost 16 (8 symmetries in the D4 group, times 2 colors). This should result in faster learning and better estimates. Álvaro. On Dec 6, 2007 4:28 AM, Lars [EMAIL PROTECTED] wrote: While you can find the gamma-values of non-shape patterns in the paper of Remi, I can give you the list of shape-pattern gamma-values. The source-code is mixed with my engine-code, which i don't like to publish. But the algorithm is more or less easy to implement.. I'll try to attach the file to this mail, if it don't work I send it to your address. The file have the following coding: It's a text-file with one pattern each line. Lines look as: Pattern-ID|Move-right(0=black, 1=white)|gamma-value\n The Pattern-IDs have the following coding: In the binary representation of the ID you have 2 Bits for every field. So you have 18 Bits at all. 00 = Empty, 01 = Black, 02 = White, 03 = Edge The pairs are ordered as: 1 2 3 4 5 6 7 8 9 in the ID it is 123456789 While you can rotate and mirror the patterns only one respresentation of the various possible representations is contained in the file. On Dec 5, 2007 3:17 PM, Lars [EMAIL PROTECTED] wrote: Thank you for your explanations! You were right, it was a bug, and now it works really fine. Are you (or Remi) willing to publish any of the following? 1. Source code for extracting ELO ratings for moves 2. Full list of gamma values, including patterns Either of those would help kick start those of us that are just starting to go down a similar road. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
But then you lose information on player-to-move, right? On Dec 6, 2007 7:18 AM, Álvaro Begué [EMAIL PROTECTED] wrote: Oh, I see you have applied the symmetries, but not the swapping of roles. Still, this should probably be done and cut the number of gamma values in half. Álvaro. On Dec 6, 2007 7:13 AM, Álvaro Begué [EMAIL PROTECTED] wrote: Thanks for the file! This should be very helpful when I try to reproduce results. It looks like you are not taking advantage of symmetries. For instance, 88|0|17.033168 88|1|12.263955 and 164|0|17.388714 164|1|25.862695 Are identical except for swapping the roles of white and black (88 == 1120 in base 4, 164 == 2210 in base 4), and therefore they should have the same scores. If you do take symmetries of this type into consideration, you'll reduce the number of gamma values you have to compute by a factor of almost 16 (8 symmetries in the D4 group, times 2 colors). This should result in faster learning and better estimates. Álvaro. On Dec 6, 2007 4:28 AM, Lars [EMAIL PROTECTED] wrote: While you can find the gamma-values of non-shape patterns in the paper of Remi, I can give you the list of shape-pattern gamma-values. The source-code is mixed with my engine-code, which i don't like to publish. But the algorithm is more or less easy to implement.. I'll try to attach the file to this mail, if it don't work I send it to your address. The file have the following coding: It's a text-file with one pattern each line. Lines look as: Pattern-ID|Move-right(0=black, 1=white)|gamma-value\n The Pattern-IDs have the following coding: In the binary representation of the ID you have 2 Bits for every field. So you have 18 Bits at all. 00 = Empty, 01 = Black, 02 = White, 03 = Edge The pairs are ordered as: 1 2 3 4 5 6 7 8 9 in the ID it is 123456789 While you can rotate and mirror the patterns only one respresentation of the various possible representations is contained in the file. On Dec 5, 2007 3:17 PM, Lars [EMAIL PROTECTED] wrote: Thank you for your explanations! You were right, it was a bug, and now it works really fine. Are you (or Remi) willing to publish any of the following? 1. Source code for extracting ELO ratings for moves 2. Full list of gamma values, including patterns Either of those would help kick start those of us that are just starting to go down a similar road. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
Oh, I didn't notice at first that the player-to-move was encoded seperatly from the pattern shape. On Dec 6, 2007 9:53 AM, Álvaro Begué [EMAIL PROTECTED] wrote: On Dec 6, 2007 9:31 AM, Chris Fant [EMAIL PROTECTED] wrote: But then you lose information on player-to-move, right? No. What I am saying is that it is as urgent for black to play on . . . . . X X O . as it is for white to play in . . . . . O O X . Precisely one way of implementing what I said is by only storing patterns with black to move. If it is white's turn, make turn the 01s into 10s and the 10s into 01s in the pattern and then do the lookup. Some code like this would do: unsigned switch_players_in_pattern(unsigned x){ return ((x0x2)1) | ((x0x1)1); } Álvaro. On Dec 6, 2007 7:18 AM, Álvaro Begué [EMAIL PROTECTED] wrote: Oh, I see you have applied the symmetries, but not the swapping of roles. Still, this should probably be done and cut the number of gamma values in half. Álvaro. On Dec 6, 2007 7:13 AM, Álvaro Begué [EMAIL PROTECTED] wrote: Thanks for the file! This should be very helpful when I try to reproduce results. It looks like you are not taking advantage of symmetries. For instance, 88|0|17.033168 88|1|12.263955 and 164|0|17.388714 164|1|25.862695 Are identical except for swapping the roles of white and black (88 == 1120 in base 4, 164 == 2210 in base 4), and therefore they should have the same scores. If you do take symmetries of this type into consideration, you'll reduce the number of gamma values you have to compute by a factor of almost 16 (8 symmetries in the D4 group, times 2 colors). This should result in faster learning and better estimates. Álvaro. On Dec 6, 2007 4:28 AM, Lars [EMAIL PROTECTED] wrote: While you can find the gamma-values of non-shape patterns in the paper of Remi, I can give you the list of shape-pattern gamma-values. The source-code is mixed with my engine-code, which i don't like to publish. But the algorithm is more or less easy to implement.. I'll try to attach the file to this mail, if it don't work I send it to your address. The file have the following coding: It's a text-file with one pattern each line. Lines look as: Pattern-ID|Move-right(0=black, 1=white)|gamma-value\n The Pattern-IDs have the following coding: In the binary representation of the ID you have 2 Bits for every field. So you have 18 Bits at all. 00 = Empty, 01 = Black, 02 = White, 03 = Edge The pairs are ordered as: 1 2 3 4 5 6 7 8 9 in the ID it is 123456789 While you can rotate and mirror the patterns only one respresentation of the various possible representations is contained in the file. On Dec 5, 2007 3:17 PM, Lars [EMAIL PROTECTED] wrote: Thank you for your explanations! You were right, it was a bug, and now it works really fine. Are you (or Remi) willing to publish any of the following? 1. Source code for extracting ELO ratings for moves 2. Full list of gamma values, including patterns Either of those would help kick start those of us that are just starting to go down a similar road. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 6, 2007 7:13 AM, Álvaro Begué [EMAIL PROTECTED] wrote: 88|0|17.033168 88|1|12.263955 and 164|0|17.388714 164|1|25.862695 Are identical except for swapping the roles of white and black Curiously, the gamma values in your example are way different 17.033168 vs 25.862595 and 12.263955 vs 17.388714 ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 6, 2007 10:13 AM, Álvaro Begué [EMAIL PROTECTED] wrote: On Dec 6, 2007 10:06 AM, Jason House [EMAIL PROTECTED] wrote: On Dec 6, 2007 7:13 AM, Álvaro Begué [EMAIL PROTECTED] wrote: 88|0|17.033168 88|1|12.263955 and 164|0|17.388714 164|1|25.862695 Are identical except for swapping the roles of white and black Curiously, the gamma values in your example are way different 17.033168 vs 25.862595 and 12.263955 vs 17.388714 That was exactly my point. Those should theoretically be identical, which means that the difference comes purely from noise. The games that were used for training probably don't have enough examples of these patterns to get a good estimate of their true strength. This may serve as a good test of if there is enough data to assign values to the patterns. Actually, the gamma values are only determined up to a multiplication of all of them by a constant. Because patterns with white to move and patterns with black to move never compete with each other, they may have drifted in such a way that the discrepancy is not as large as it seems (since both ratios 25.862595/17.033168 and 17.388714/12.263955 are similar). That may be a fluke. Other pairs have a much different ratio 0|0|1.463386 0|1|1.337276 Still, forcing those gamma values to be identical seems like the right thing to do. I agree ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
Lars wrote: I have some questions concernig this paper of Remi: http://remi.coulom.free.fr/Amsterdam2007/MMGoPatterns.pdf 1. Which sense make the prior (Section 3.3 in the paper and where is the application? I understand it the way that you put 2 more competitions to each pattern in the minorization-maximization formula. But how does this produce a probability distribution with mean 0 and standard deviation 302?? This probability distribution is in the scale of Elo ratings. The formula is Elo = 400*log10(gamma). The probability distribution of getting one win and one loss against one player of rating zero has standard deviation 302 (and mean zero). Is the prior only necessary to make sure that every pattern have at least one win? It also has a regularization effect that avoids overfitting. That is to say, even if a pattern has, say, one win and ten losses against a player of rating zero, the prior will make it two wins and eleven losses, thus moving the strength of the pattern a little bit towards zero. It avoids high ratings for patterns that have been seen rarely. 2. I had run the algorithm on 400 games (including handicap-games) from the same game-records source Remi used (Section 3.2), but i used an other month. I concidered only 3x3 shape-patterns and simple non-shape pattern including dist-to-boarder, dist-to-prevMove, atari, capture, extension.. After 1 iterations of the algorithm, I got strenght-parameter values completely different to the results of the paper (Table 1). Most of my parameters (including all the dist-to-boarder parameters!) have values less than 1.0. Does anyone have some explanations on this? After 5 iteration it is even worse. Most of the low values (less than 1.0) gets even lower, the high values even higher. This sounds like a bug. The number of iterations depends on features. I do not remember exactly how many iterations it took for mine. In general 2-3 iterations are enough to get very reasonable values, a dozen to have good convergence. Note that the values of some features are relative and not absolute. For instance, distance to the previous move: if you multiply all of them by a constant value, it will not change the likelihood. Nevertheless, if you have no bug in your implementation, you should get values that look like mine. My suggestion would be to test your code with very small amounts of artificial data. For instance, start by two players A and B, and, say A beats B twice and B beats A once. Gammas should converge to gamma_A = 2 * gamma_B. You can make up similar tests with teams. Rémi ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] ELO Ratings of move pattern
On Dec 5, 2007 4:44 AM, Lars [EMAIL PROTECTED] wrote: 2. I had run the algorithm on 400 games (including handicap-games) from the same game-records source Remi used (Section 3.2), but i used an other month. I concidered only 3x3 shape-patterns and simple non-shape pattern including dist-to-boarder, dist-to-prevMove, atari, capture, extension.. After 1 iterations of the algorithm, I got strenght-parameter values completely different to the results of the paper (Table 1). Most of my parameters (including all the dist-to-boarder parameters!) have values less than 1.0. Does anyone have some explanations on this? After 5 iteration it is even worse. Most of the low values (less than 1.0) gets even lower, the high values even higher. Some of Remi's features will score every move on the board. When this occurs, any scalar multiplication of those feature values will work. This seems to apply to both distance to move features and the MC Owner feature. When I get around to doing this (hopefully soon), I'll probably leave out something to better pin the values down. Essentially, omissions are like forcing a value of 1.0. As far as what the values should be in absolute terms, I can't yet comment :( Anyone of you have similar or other experiences with the algorithm? The holidays are slowing me down. I hope to be ready to compare notes in January. I'm really glad to see someone else doing this. Any lessons learned will be incredibly helpful... ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/