Re: [Computer-go] Kas Cup - results and prizes
Just curious... I understand how you update the counters lock-free, but surely you must have a lock to protect adding a new node to the tree? Do this impact scaling at some point? Are there any other infrequent locks like this? David -Original Message- From: computer-go-boun...@dvandva.org [mailto:computer-go- boun...@dvandva.org] On Behalf Of Petr Baudis Sent: Friday, August 10, 2012 12:47 PM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes On Fri, Aug 10, 2012 at 09:26:31AM -0700, David Fotland wrote: Because my current approach seems to work just as well (or maybe better), and I haven't had time to code up a shared try and tune it up to validate that assumption. Chaslot's paper indicates perhaps that not having a shared tree is stronger. My guess is that they are about the same, so it's not worth the effort to change. In Pachi, having a shared tree makes all the difference when scaling up to more threads. See the graph (really awful one, sorry, it's old!) at http://pachi.or.cz/root-vs-shared.png If you have some information sharing near the root, I imagine it might be similar to Pachi's distributed engine performance (or just slightly better). But that is still far behind in scaling compared to the shared tree in our experience. P.S.: There are two important things, virtual loss (not necessarily 1 simulation but possibly more) and mainly lockless updates. The latter also means that sane code should be really easy to modify to use single shared tree instead of multiple trees. Petr Pasky Baudis ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
David, I understand how you update the counters lock-free, but surely you must have a lock to protect adding a new node to the tree? We have a single atomic test-and-set instruction to expand a node (add children) only once. It is not a blocking mutex. If a thread finds that the node is not yet expanded but some other thread is already allocating new children, it simply goes on, starting a playout at this point. Do this impact scaling at some point? Maybe above 24 cores, but we couldn't measure this. Pachi scales perfectly up to 24 cores in single-machine mode. See the reference given by Pasky earlier in this thread: Fig. 9 of http://pasky.or.cz/go/pachi-tr.pdf Jean-loup ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
Interesting, I'd have thought it would matter quite bit, especially with higher numbers of threads. One thing I found (quite a few years back now already) is that you can optimize a lot by doing the following: when one node has so many more wins than the second best that it can't be overtaken even if the second best wins all of the remaining playouts, abort thinking. With a couple of extensions to this general idea (aborting not just when it's impossible, just very unlikely to be overtaken) I found that a player that does 64K lightweight simulations using this method spends the same time and plays the same level as one that does a fixed 32K simulations. Roughly. The higher the number of simulations, the bigger the savings. This type of optimization must be much harder with root-level parallelization, so you'd have to factor that in when comparing methods. Mark On Aug 10, 2012, at 9:55 PM, David Fotland fotl...@smart-games.com wrote: Not much memory overhead. If you look at your tree you will find that most nodes are only visited one or two times. There is a lot of noise in the fringes of the tree, so there are few duplicates. This also means that not sharing most of the tree has no impact on strength. David -Original Message- From: computer-go-boun...@dvandva.org [mailto:computer-go- boun...@dvandva.org] On Behalf Of Michael Williams Sent: Friday, August 10, 2012 9:42 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes I imagine you can get around the lack of implicit information sharing that you get with a shared tree by explicitly sharing information near the root. But doesn't having separate trees mean a large memory overhead due to duplicate nodes? On Fri, Aug 10, 2012 at 9:26 AM, David Fotland fotl...@smart-games.com wrote: Because my current approach seems to work just as well (or maybe better), and I haven't had time to code up a shared try and tune it up to validate that assumption. Chaslot's paper indicates perhaps that not having a shared tree is stronger. My guess is that they are about the same, so it's not worth the effort to change. david -Original Message- From: computer-go-boun...@dvandva.org [mailto:computer-go- boun...@dvandva.org] On Behalf Of Michael Williams Sent: Friday, August 10, 2012 12:06 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes Why don't you use a shared tree? On Thu, Aug 9, 2012 at 11:49 PM, David Fotland fotl...@smart-games.com wrote: On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k with 8 threads, a 64% increase, so the 2600 scales a little better than the 3770, but the 3770 is still a litte bit faster. david From: computer-go-boun...@dvandva.org [mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der Werf Sent: Thursday, August 09, 2012 4:41 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes I don't have an i7-2600, but I could run oakfoam on the 3930. I just downloaded it and it does compile. If you give me a list of gtp commands to run the benchmark, then I will send you the output back. Erik On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote: This is very interesting, I have not more than 10% with oakfoam on i7-2600K. Would be interesting if it is the processor or if you e.g. access more often memory instead of cache due to your code... Do you have the chance to run your program on a i7-2600? or do you have to much time and try https://bitbucket.org/francoisvn/oakfoam/wiki/Home on your i7-3930. If so, I would be very much interested in the number you get in the beginning of a 19x19 game without book:) Detlef Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der Werf: On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote: On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote: Hyperthreading does the trick, I have the experience it increases the performance by about 10%. I think this is due to waiting for RAM I/O or things like that Yes. With hyperthreading, performance per thread goes down significantly, but total performance goes up by about 15%. In the Pentium 4 era, hyperthreading did not usually pay off, but with i7, its performance is much better. The basic idea is that there are two instruction pipelines that share the same ALU and other processor units; if one of the pipelines stalls (usually due to memory fetch), the other can use the ALU in the meantime, or the two threads may use different parts of the CPU altogether based on what the instructions do. 10-15%, really, that low? For my program (on an i7-3930K, going from 6 to 12 threads) it is more in the order of 40% extra
Re: [Computer-go] Kas Cup - results and prizes
Sorry, that's not right of course. The 64K version spends on average the same time, but plays 100 ELO stronger. Otherwise there would be no point :) Mark On Aug 12, 2012, at 6:07 PM, Mark Boon tesujisoftw...@gmail.com wrote: Interesting, I'd have thought it would matter quite bit, especially with higher numbers of threads. One thing I found (quite a few years back now already) is that you can optimize a lot by doing the following: when one node has so many more wins than the second best that it can't be overtaken even if the second best wins all of the remaining playouts, abort thinking. With a couple of extensions to this general idea (aborting not just when it's impossible, just very unlikely to be overtaken) I found that a player that does 64K lightweight simulations using this method spends the same time and plays the same level as one that does a fixed 32K simulations. Roughly. The higher the number of simulations, the bigger the savings. This type of optimization must be much harder with root-level parallelization, so you'd have to factor that in when comparing methods. Mark On Aug 10, 2012, at 9:55 PM, David Fotland fotl...@smart-games.com wrote: Not much memory overhead. If you look at your tree you will find that most nodes are only visited one or two times. There is a lot of noise in the fringes of the tree, so there are few duplicates. This also means that not sharing most of the tree has no impact on strength. David -Original Message- From: computer-go-boun...@dvandva.org [mailto:computer-go- boun...@dvandva.org] On Behalf Of Michael Williams Sent: Friday, August 10, 2012 9:42 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes I imagine you can get around the lack of implicit information sharing that you get with a shared tree by explicitly sharing information near the root. But doesn't having separate trees mean a large memory overhead due to duplicate nodes? On Fri, Aug 10, 2012 at 9:26 AM, David Fotland fotl...@smart-games.com wrote: Because my current approach seems to work just as well (or maybe better), and I haven't had time to code up a shared try and tune it up to validate that assumption. Chaslot's paper indicates perhaps that not having a shared tree is stronger. My guess is that they are about the same, so it's not worth the effort to change. david -Original Message- From: computer-go-boun...@dvandva.org [mailto:computer-go- boun...@dvandva.org] On Behalf Of Michael Williams Sent: Friday, August 10, 2012 12:06 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes Why don't you use a shared tree? On Thu, Aug 9, 2012 at 11:49 PM, David Fotland fotl...@smart-games.com wrote: On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k with 8 threads, a 64% increase, so the 2600 scales a little better than the 3770, but the 3770 is still a litte bit faster. david From: computer-go-boun...@dvandva.org [mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der Werf Sent: Thursday, August 09, 2012 4:41 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes I don't have an i7-2600, but I could run oakfoam on the 3930. I just downloaded it and it does compile. If you give me a list of gtp commands to run the benchmark, then I will send you the output back. Erik On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote: This is very interesting, I have not more than 10% with oakfoam on i7-2600K. Would be interesting if it is the processor or if you e.g. access more often memory instead of cache due to your code... Do you have the chance to run your program on a i7-2600? or do you have to much time and try https://bitbucket.org/francoisvn/oakfoam/wiki/Home on your i7-3930. If so, I would be very much interested in the number you get in the beginning of a 19x19 game without book:) Detlef Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der Werf: On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote: On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote: Hyperthreading does the trick, I have the experience it increases the performance by about 10%. I think this is due to waiting for RAM I/O or things like that Yes. With hyperthreading, performance per thread goes down significantly, but total performance goes up by about 15%. In the Pentium 4 era, hyperthreading did not usually pay off, but with i7, its performance is much better. The basic idea is that there are two instruction pipelines that share the same ALU and other processor units; if one of the pipelines stalls (usually due to memory fetch), the other can use the ALU in the meantime, or the two
Re: [Computer-go] Kas Cup - results and prizes
I've been using this abort early to save time idea almost since the beginning. It works fine with root parallelization. David -Original Message- From: computer-go-boun...@dvandva.org [mailto:computer-go- boun...@dvandva.org] On Behalf Of Mark Boon Sent: Sunday, August 12, 2012 9:07 PM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes Interesting, I'd have thought it would matter quite bit, especially with higher numbers of threads. One thing I found (quite a few years back now already) is that you can optimize a lot by doing the following: when one node has so many more wins than the second best that it can't be overtaken even if the second best wins all of the remaining playouts, abort thinking. With a couple of extensions to this general idea (aborting not just when it's impossible, just very unlikely to be overtaken) I found that a player that does 64K lightweight simulations using this method spends the same time and plays the same level as one that does a fixed 32K simulations. Roughly. The higher the number of simulations, the bigger the savings. This type of optimization must be much harder with root-level parallelization, so you'd have to factor that in when comparing methods. Mark On Aug 10, 2012, at 9:55 PM, David Fotland fotl...@smart-games.com wrote: Not much memory overhead. If you look at your tree you will find that most nodes are only visited one or two times. There is a lot of noise in the fringes of the tree, so there are few duplicates. This also means that not sharing most of the tree has no impact on strength. David -Original Message- From: computer-go-boun...@dvandva.org [mailto:computer-go- boun...@dvandva.org] On Behalf Of Michael Williams Sent: Friday, August 10, 2012 9:42 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes I imagine you can get around the lack of implicit information sharing that you get with a shared tree by explicitly sharing information near the root. But doesn't having separate trees mean a large memory overhead due to duplicate nodes? On Fri, Aug 10, 2012 at 9:26 AM, David Fotland fotl...@smart-games.com wrote: Because my current approach seems to work just as well (or maybe better), and I haven't had time to code up a shared try and tune it up to validate that assumption. Chaslot's paper indicates perhaps that not having a shared tree is stronger. My guess is that they are about the same, so it's not worth the effort to change. david -Original Message- From: computer-go-boun...@dvandva.org [mailto:computer-go- boun...@dvandva.org] On Behalf Of Michael Williams Sent: Friday, August 10, 2012 12:06 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes Why don't you use a shared tree? On Thu, Aug 9, 2012 at 11:49 PM, David Fotland fotl...@smart-games.com wrote: On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k with 8 threads, a 64% increase, so the 2600 scales a little better than the 3770, but the 3770 is still a litte bit faster. david From: computer-go-boun...@dvandva.org [mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der Werf Sent: Thursday, August 09, 2012 4:41 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes I don't have an i7-2600, but I could run oakfoam on the 3930. I just downloaded it and it does compile. If you give me a list of gtp commands to run the benchmark, then I will send you the output back. Erik On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote: This is very interesting, I have not more than 10% with oakfoam on i7-2600K. Would be interesting if it is the processor or if you e.g. access more often memory instead of cache due to your code... Do you have the chance to run your program on a i7-2600? or do you have to much time and try https://bitbucket.org/francoisvn/oakfoam/wiki/Home on your i7-3930. If so, I would be very much interested in the number you get in the beginning of a 19x19 game without book:) Detlef Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der Werf: On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote: On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote: Hyperthreading does the trick, I have the experience it increases the performance by about 10%. I think this is due to waiting for RAM I/O or things like that Yes. With hyperthreading, performance per thread goes down significantly, but total performance goes up by about 15%. In the Pentium 4 era, hyperthreading did not usually pay off, but with i7, its performance is much better. The basic idea
Re: [Computer-go] Kas Cup - results and prizes
I'm happy with MFGO's scaling. I'm running a scaling test now, 4 threads vs 8 threads, fixed 32K total playouts per move, 19x19, no pondering. Ideally the win rate should be 50%, since the total playouts are the same. Has anyone tried this kind of scaling experiment, and is willing to share results? David -Original Message- From: computer-go-boun...@dvandva.org [mailto:computer-go- boun...@dvandva.org] On Behalf Of Petr Baudis Sent: Friday, August 10, 2012 12:47 PM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes On Fri, Aug 10, 2012 at 09:26:31AM -0700, David Fotland wrote: Because my current approach seems to work just as well (or maybe better), and I haven't had time to code up a shared try and tune it up to validate that assumption. Chaslot's paper indicates perhaps that not having a shared tree is stronger. My guess is that they are about the same, so it's not worth the effort to change. In Pachi, having a shared tree makes all the difference when scaling up to more threads. See the graph (really awful one, sorry, it's old!) at http://pachi.or.cz/root-vs-shared.png If you have some information sharing near the root, I imagine it might be similar to Pachi's distributed engine performance (or just slightly better). But that is still far behind in scaling compared to the shared tree in our experience. P.S.: There are two important things, virtual loss (not necessarily 1 simulation but possibly more) and mainly lockless updates. The latter also means that sane code should be really easy to modify to use single shared tree instead of multiple trees. Petr Pasky Baudis ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
Yes, root parallelization with some sharing. http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it was good and I tried it and it works well. Hardware is really important. But so are really smart playouts. The slower I make my playouts the stronger the program gets. David -Original Message- From: computer-go-boun...@dvandva.org [mailto:computer-go- boun...@dvandva.org] On Behalf Of Peter Drake Sent: Friday, August 10, 2012 10:45 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes On Thu, Aug 9, 2012 at 11:42 PM, David Fotland fotl...@smart-games.com wrote: Or it might be an artifact of the way I do search, since I think I might be the only engine that doesn't use a single shared tree, and the old Many Faces of Go engine is single threaded. If not a single shared tree, what are you doing? Root parallelism? I've been wondering why other programs are pulling ahead of Orego, and now I'm starting to suspect the answer may be (in part) hardware. -- Peter Drake https://sites.google.com/a/lclark.edu/drake/ ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
Not much memory overhead. If you look at your tree you will find that most nodes are only visited one or two times. There is a lot of noise in the fringes of the tree, so there are few duplicates. This also means that not sharing most of the tree has no impact on strength. David -Original Message- From: computer-go-boun...@dvandva.org [mailto:computer-go- boun...@dvandva.org] On Behalf Of Michael Williams Sent: Friday, August 10, 2012 9:42 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes I imagine you can get around the lack of implicit information sharing that you get with a shared tree by explicitly sharing information near the root. But doesn't having separate trees mean a large memory overhead due to duplicate nodes? On Fri, Aug 10, 2012 at 9:26 AM, David Fotland fotl...@smart-games.com wrote: Because my current approach seems to work just as well (or maybe better), and I haven't had time to code up a shared try and tune it up to validate that assumption. Chaslot's paper indicates perhaps that not having a shared tree is stronger. My guess is that they are about the same, so it's not worth the effort to change. david -Original Message- From: computer-go-boun...@dvandva.org [mailto:computer-go- boun...@dvandva.org] On Behalf Of Michael Williams Sent: Friday, August 10, 2012 12:06 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes Why don't you use a shared tree? On Thu, Aug 9, 2012 at 11:49 PM, David Fotland fotl...@smart-games.com wrote: On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k with 8 threads, a 64% increase, so the 2600 scales a little better than the 3770, but the 3770 is still a litte bit faster. david From: computer-go-boun...@dvandva.org [mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der Werf Sent: Thursday, August 09, 2012 4:41 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes I don't have an i7-2600, but I could run oakfoam on the 3930. I just downloaded it and it does compile. If you give me a list of gtp commands to run the benchmark, then I will send you the output back. Erik On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote: This is very interesting, I have not more than 10% with oakfoam on i7-2600K. Would be interesting if it is the processor or if you e.g. access more often memory instead of cache due to your code... Do you have the chance to run your program on a i7-2600? or do you have to much time and try https://bitbucket.org/francoisvn/oakfoam/wiki/Home on your i7-3930. If so, I would be very much interested in the number you get in the beginning of a 19x19 game without book:) Detlef Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der Werf: On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote: On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote: Hyperthreading does the trick, I have the experience it increases the performance by about 10%. I think this is due to waiting for RAM I/O or things like that Yes. With hyperthreading, performance per thread goes down significantly, but total performance goes up by about 15%. In the Pentium 4 era, hyperthreading did not usually pay off, but with i7, its performance is much better. The basic idea is that there are two instruction pipelines that share the same ALU and other processor units; if one of the pipelines stalls (usually due to memory fetch), the other can use the ALU in the meantime, or the two threads may use different parts of the CPU altogether based on what the instructions do. 10-15%, really, that low? For my program (on an i7-3930K, going from 6 to 12 threads) it is more in the order of 40% extra simulations per second. Erik ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
On Sat, Aug 11, 2012 at 12:46:19AM -0700, David Fotland wrote: I'm happy with MFGO's scaling. I'm running a scaling test now, 4 threads vs 8 threads, fixed 32K total playouts per move, 19x19, no pondering. Ideally the win rate should be 50%, since the total playouts are the same. Has anyone tried this kind of scaling experiment, and is willing to share results? With Pachi, the winrate in this scenario would be 50%; our thread scaling incurs basically no strength loss compared to sequential playouts. This is visible in the Lousy Graph http://pachi.or.cz/root-vs-shared.png as second triplet of bars (labelled Sequential) vs. the last one, and in Fig. 9 of http://pasky.or.cz/go/pachi-tr.pdf (see text for detailed description of the graph). Petr Pasky Baudis ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
On Sat, Aug 11, 2012 at 12:52:12AM -0700, David Fotland wrote: Yes, root parallelization with some sharing. http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it was good and I tried it and it works well. The paper is not so relevant now, since the standard method of most programs is lockless tree parallelization, which is not covered. The locking overhead is quite significant, I'd expect, as locking instructions can AFAIK take hundreds of cycles. That said, root parallelization overperforming sequential simulations is something I never managed to reproduce and that seems rather surprising to me. It might have something to do with the way priors are done in the tree or some other engine-specific factors. Petr Pasky Baudis ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
On Aug 11, 2012, at 10:59 AM, Petr Baudis pa...@ucw.cz wrote: That said, root parallelization overperforming sequential simulations is something I never managed to reproduce and that seems rather surprising to me. It might have something to do with the way priors are done in the tree or some other engine-specific factors. When I saw that result, I immediately concluded that the engine locked into specific lines of play prematurely. I can imagine a few exceptional cases, but I don't think root parallelism should ever outperform a single thread doing the equivalent number of simulations. ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
Petr Baudis: 20120811145900.gv19...@machine.or.cz: On Sat, Aug 11, 2012 at 12:52:12AM -0700, David Fotland wrote: Yes, root parallelization with some sharing. http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it was good and I tried it and it works well. The paper is not so relevant now, since the standard method of most programs is lockless tree parallelization, which is not covered. The locking overhead is quite significant, I'd expect, as locking instructions can AFAIK take hundreds of cycles. With spin-lock or hardware test-and-set instructions, locking overhead is very small. That said, root parallelization overperforming sequential simulations is something I never managed to reproduce and that seems rather surprising to me. It might have something to do with the way priors are done in the tree or some other engine-specific factors. I believe IBM Power processor's architecture may caused the super-linear acceralaton. Hideki -- Hideki Kato mailto:hideki_ka...@ybb.ne.jp ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
I wonder if spin-lock hurts hyperthreading. On Sat, Aug 11, 2012 at 7:22 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote: Petr Baudis: 20120811145900.gv19...@machine.or.cz: On Sat, Aug 11, 2012 at 12:52:12AM -0700, David Fotland wrote: Yes, root parallelization with some sharing. http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it was good and I tried it and it works well. The paper is not so relevant now, since the standard method of most programs is lockless tree parallelization, which is not covered. The locking overhead is quite significant, I'd expect, as locking instructions can AFAIK take hundreds of cycles. With spin-lock or hardware test-and-set instructions, locking overhead is very small. That said, root parallelization overperforming sequential simulations is something I never managed to reproduce and that seems rather surprising to me. It might have something to do with the way priors are done in the tree or some other engine-specific factors. I believe IBM Power processor's architecture may caused the super-linear acceralaton. Hideki -- Hideki Kato mailto:hideki_ka...@ybb.ne.jp ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
Michael Williams: CAB0EdYWgs=gjsnt9rmkj8-utlg72nyxnafvd-8iq_pjjf_j...@mail.gmail.com: I wonder if spin-lock hurts hyperthreading. Why do you think so? If a spin-lock accesses memory and waits, simply another thread runs. That's all. Hideki On Sat, Aug 11, 2012 at 7:22 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote: Petr Baudis: 20120811145900.gv19...@machine.or.cz: On Sat, Aug 11, 2012 at 12:52:12AM -0700, David Fotland wrote: Yes, root parallelization with some sharing. http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it was good and I tried it and it works well. The paper is not so relevant now, since the standard method of most programs is lockless tree parallelization, which is not covered. The locking overhead is quite significant, I'd expect, as locking instructions can AFAIK take hundreds of cycles. With spin-lock or hardware test-and-set instructions, locking overhead is very small. That said, root parallelization overperforming sequential simulations is something I never managed to reproduce and that seems rather surprising to me. It might have something to do with the way priors are done in the tree or some other engine-specific factors. I believe IBM Power processor's architecture may caused the super-linear acceralaton. Hideki -- Hideki Kato mailto:hideki_ka...@ybb.ne.jp ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go -- Hideki Kato mailto:hideki_ka...@ybb.ne.jp ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
Because two hyperthreads some of the same hardware. And some of that hardware is required to do the spinning. Just a thought. Found this with a quick search: http://archives.postgresql.org/pgsql-patches/2003-12/msg00345.php On Sat, Aug 11, 2012 at 8:45 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote: Michael Williams: CAB0EdYWgs=gjsnt9rmkj8-utlg72nyxnafvd-8iq_pjjf_j...@mail.gmail.com: I wonder if spin-lock hurts hyperthreading. Why do you think so? If a spin-lock accesses memory and waits, simply another thread runs. That's all. Hideki On Sat, Aug 11, 2012 at 7:22 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote: Petr Baudis: 20120811145900.gv19...@machine.or.cz: On Sat, Aug 11, 2012 at 12:52:12AM -0700, David Fotland wrote: Yes, root parallelization with some sharing. http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it was good and I tried it and it works well. The paper is not so relevant now, since the standard method of most programs is lockless tree parallelization, which is not covered. The locking overhead is quite significant, I'd expect, as locking instructions can AFAIK take hundreds of cycles. With spin-lock or hardware test-and-set instructions, locking overhead is very small. That said, root parallelization overperforming sequential simulations is something I never managed to reproduce and that seems rather surprising to me. It might have something to do with the way priors are done in the tree or some other engine-specific factors. I believe IBM Power processor's architecture may caused the super-linear acceralaton. Hideki -- Hideki Kato mailto:hideki_ka...@ybb.ne.jp ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go -- Hideki Kato mailto:hideki_ka...@ybb.ne.jp ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
two hyperthreads SHARE some of the same hardware On Sat, Aug 11, 2012 at 9:34 PM, Michael Williams michaelwilliam...@gmail.com wrote: Because two hyperthreads some of the same hardware. And some of that hardware is required to do the spinning. Just a thought. Found this with a quick search: http://archives.postgresql.org/pgsql-patches/2003-12/msg00345.php On Sat, Aug 11, 2012 at 8:45 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote: Michael Williams: CAB0EdYWgs=gjsnt9rmkj8-utlg72nyxnafvd-8iq_pjjf_j...@mail.gmail.com: I wonder if spin-lock hurts hyperthreading. Why do you think so? If a spin-lock accesses memory and waits, simply another thread runs. That's all. Hideki On Sat, Aug 11, 2012 at 7:22 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote: Petr Baudis: 20120811145900.gv19...@machine.or.cz: On Sat, Aug 11, 2012 at 12:52:12AM -0700, David Fotland wrote: Yes, root parallelization with some sharing. http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it was good and I tried it and it works well. The paper is not so relevant now, since the standard method of most programs is lockless tree parallelization, which is not covered. The locking overhead is quite significant, I'd expect, as locking instructions can AFAIK take hundreds of cycles. With spin-lock or hardware test-and-set instructions, locking overhead is very small. That said, root parallelization overperforming sequential simulations is something I never managed to reproduce and that seems rather surprising to me. It might have something to do with the way priors are done in the tree or some other engine-specific factors. I believe IBM Power processor's architecture may caused the super-linear acceralaton. Hideki -- Hideki Kato mailto:hideki_ka...@ybb.ne.jp ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go -- Hideki Kato mailto:hideki_ka...@ybb.ne.jp ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
Michael Williams: cab0edyxgbi9-+su8dh1trfegr6kw4yku7plhgbvzyi-yjww...@mail.gmail.com: Because two hyperthreads some of the same hardware. And some of that hardware is required to do the spinning. Just a thought. Found this with a quick search: http://archives.postgresql.org/pgsql-patches/2003-12/msg00345.php I believe it's too old and cannot apply modern hyperthreading. Hideki On Sat, Aug 11, 2012 at 8:45 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote: Michael Williams: CAB0EdYWgs=gjsnt9rmkj8-utlg72nyxnafvd-8iq_pjjf_j...@mail.gmail.com: I wonder if spin-lock hurts hyperthreading. Why do you think so? If a spin-lock accesses memory and waits, simply another thread runs. That's all. Hideki On Sat, Aug 11, 2012 at 7:22 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote: Petr Baudis: 20120811145900.gv19...@machine.or.cz: On Sat, Aug 11, 2012 at 12:52:12AM -0700, David Fotland wrote: Yes, root parallelization with some sharing. http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it was good and I tried it and it works well. The paper is not so relevant now, since the standard method of most programs is lockless tree parallelization, which is not covered. The locking overhead is quite significant, I'd expect, as locking instructions can AFAIK take hundreds of cycles. With spin-lock or hardware test-and-set instructions, locking overhead is very small. That said, root parallelization overperforming sequential simulations is something I never managed to reproduce and that seems rather surprising to me. It might have something to do with the way priors are done in the tree or some other engine-specific factors. I believe IBM Power processor's architecture may caused the super-linear acceralaton. Hideki -- Hideki Kato mailto:hideki_ka...@ybb.ne.jp ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go -- Hideki Kato mailto:hideki_ka...@ybb.ne.jp ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go -- Hideki Kato mailto:hideki_ka...@ybb.ne.jp ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
On my core i7-3770, 4 threads is 12.5K playouts/sec (19x19, average of first four moves by white), and 8 threads is 19.8K, a 58% increase. This is much higher than I expected. It seems Intel has improved hyperthreading since the last time I tried it. Or it might be an artifact of the way I do search, since I think I might be the only engine that doesn't use a single shared tree, and the old Many Faces of Go engine is single threaded. David -Original Message- From: computer-go-boun...@dvandva.org [mailto:computer-go- boun...@dvandva.org] On Behalf Of Petr Baudis Sent: Thursday, August 09, 2012 6:23 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes On Thu, Aug 09, 2012 at 08:09:55PM +0900, Hideki Kato wrote: Erik van der Werf: CAKkgGrM83_HsQ5Z2HJupkj=gDeh3+4GM- jmlvevtjroufqn...@mail.gmail.com: 10-15%, really, that low? For my program (on an i7-3930K, going from 6 to 12 threads) it is more in the order of 40% extra simulations per second. In general that number highly depends on the code, architecture of the processor (Intel's are usually better than AMD's), memory speed, cache size, use of ALUs, etc. For Zen, the number is also about 40% on both an i7 3930K (6 to 12 threads) and an i7 920 (4 to 8 threads). For Zen, I'm not surprised, since I assume that in simulations, you are matching some larger patterns which involves a lot of time-consuming hash table lookups which is ideal for hyperthreading. Not sure about stv. I think it matters a lot on whether you are matching patterns by explicit test code snippets or by a hash table. I measured the hyperthreading effect about 2 years ago with a lot older Pachi version. I think today, the hyperthreading effect would also be higher, but I cannot test it right now. Pasky, modern processors are much more complicated :). There are more than two sets of general registers, which are used not only for hyperthreading but also register renaming, for example. Sure, I just tried to sketch a rough explanation. I did not know that hyperthreading could reduce opportunity for register renaming, though. Petr Pasky Baudis ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k with 8 threads, a 64% increase, so the 2600 scales a little better than the 3770, but the 3770 is still a litte bit faster. david From: computer-go-boun...@dvandva.org [mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der Werf Sent: Thursday, August 09, 2012 4:41 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes I don't have an i7-2600, but I could run oakfoam on the 3930. I just downloaded it and it does compile. If you give me a list of gtp commands to run the benchmark, then I will send you the output back. Erik On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote: This is very interesting, I have not more than 10% with oakfoam on i7-2600K. Would be interesting if it is the processor or if you e.g. access more often memory instead of cache due to your code... Do you have the chance to run your program on a i7-2600? or do you have to much time and try https://bitbucket.org/francoisvn/oakfoam/wiki/Home on your i7-3930. If so, I would be very much interested in the number you get in the beginning of a 19x19 game without book:) Detlef Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der Werf: On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote: On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote: Hyperthreading does the trick, I have the experience it increases the performance by about 10%. I think this is due to waiting for RAM I/O or things like that Yes. With hyperthreading, performance per thread goes down significantly, but total performance goes up by about 15%. In the Pentium 4 era, hyperthreading did not usually pay off, but with i7, its performance is much better. The basic idea is that there are two instruction pipelines that share the same ALU and other processor units; if one of the pipelines stalls (usually due to memory fetch), the other can use the ALU in the meantime, or the two threads may use different parts of the CPU altogether based on what the instructions do. 10-15%, really, that low? For my program (on an i7-3930K, going from 6 to 12 threads) it is more in the order of 40% extra simulations per second. Erik ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
Why don't you use a shared tree? On Thu, Aug 9, 2012 at 11:49 PM, David Fotland fotl...@smart-games.com wrote: On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k with 8 threads, a 64% increase, so the 2600 scales a little better than the 3770, but the 3770 is still a litte bit faster. david From: computer-go-boun...@dvandva.org [mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der Werf Sent: Thursday, August 09, 2012 4:41 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes I don't have an i7-2600, but I could run oakfoam on the 3930. I just downloaded it and it does compile. If you give me a list of gtp commands to run the benchmark, then I will send you the output back. Erik On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote: This is very interesting, I have not more than 10% with oakfoam on i7-2600K. Would be interesting if it is the processor or if you e.g. access more often memory instead of cache due to your code... Do you have the chance to run your program on a i7-2600? or do you have to much time and try https://bitbucket.org/francoisvn/oakfoam/wiki/Home on your i7-3930. If so, I would be very much interested in the number you get in the beginning of a 19x19 game without book:) Detlef Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der Werf: On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote: On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote: Hyperthreading does the trick, I have the experience it increases the performance by about 10%. I think this is due to waiting for RAM I/O or things like that Yes. With hyperthreading, performance per thread goes down significantly, but total performance goes up by about 15%. In the Pentium 4 era, hyperthreading did not usually pay off, but with i7, its performance is much better. The basic idea is that there are two instruction pipelines that share the same ALU and other processor units; if one of the pipelines stalls (usually due to memory fetch), the other can use the ALU in the meantime, or the two threads may use different parts of the CPU altogether based on what the instructions do. 10-15%, really, that low? For my program (on an i7-3930K, going from 6 to 12 threads) it is more in the order of 40% extra simulations per second. Erik ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
Because my current approach seems to work just as well (or maybe better), and I haven't had time to code up a shared try and tune it up to validate that assumption. Chaslot's paper indicates perhaps that not having a shared tree is stronger. My guess is that they are about the same, so it's not worth the effort to change. david -Original Message- From: computer-go-boun...@dvandva.org [mailto:computer-go- boun...@dvandva.org] On Behalf Of Michael Williams Sent: Friday, August 10, 2012 12:06 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes Why don't you use a shared tree? On Thu, Aug 9, 2012 at 11:49 PM, David Fotland fotl...@smart-games.com wrote: On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k with 8 threads, a 64% increase, so the 2600 scales a little better than the 3770, but the 3770 is still a litte bit faster. david From: computer-go-boun...@dvandva.org [mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der Werf Sent: Thursday, August 09, 2012 4:41 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes I don't have an i7-2600, but I could run oakfoam on the 3930. I just downloaded it and it does compile. If you give me a list of gtp commands to run the benchmark, then I will send you the output back. Erik On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote: This is very interesting, I have not more than 10% with oakfoam on i7-2600K. Would be interesting if it is the processor or if you e.g. access more often memory instead of cache due to your code... Do you have the chance to run your program on a i7-2600? or do you have to much time and try https://bitbucket.org/francoisvn/oakfoam/wiki/Home on your i7-3930. If so, I would be very much interested in the number you get in the beginning of a 19x19 game without book:) Detlef Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der Werf: On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote: On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote: Hyperthreading does the trick, I have the experience it increases the performance by about 10%. I think this is due to waiting for RAM I/O or things like that Yes. With hyperthreading, performance per thread goes down significantly, but total performance goes up by about 15%. In the Pentium 4 era, hyperthreading did not usually pay off, but with i7, its performance is much better. The basic idea is that there are two instruction pipelines that share the same ALU and other processor units; if one of the pipelines stalls (usually due to memory fetch), the other can use the ALU in the meantime, or the two threads may use different parts of the CPU altogether based on what the instructions do. 10-15%, really, that low? For my program (on an i7-3930K, going from 6 to 12 threads) it is more in the order of 40% extra simulations per second. Erik ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
I imagine you can get around the lack of implicit information sharing that you get with a shared tree by explicitly sharing information near the root. But doesn't having separate trees mean a large memory overhead due to duplicate nodes? On Fri, Aug 10, 2012 at 9:26 AM, David Fotland fotl...@smart-games.com wrote: Because my current approach seems to work just as well (or maybe better), and I haven't had time to code up a shared try and tune it up to validate that assumption. Chaslot's paper indicates perhaps that not having a shared tree is stronger. My guess is that they are about the same, so it's not worth the effort to change. david -Original Message- From: computer-go-boun...@dvandva.org [mailto:computer-go- boun...@dvandva.org] On Behalf Of Michael Williams Sent: Friday, August 10, 2012 12:06 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes Why don't you use a shared tree? On Thu, Aug 9, 2012 at 11:49 PM, David Fotland fotl...@smart-games.com wrote: On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k with 8 threads, a 64% increase, so the 2600 scales a little better than the 3770, but the 3770 is still a litte bit faster. david From: computer-go-boun...@dvandva.org [mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der Werf Sent: Thursday, August 09, 2012 4:41 AM To: computer-go@dvandva.org Subject: Re: [Computer-go] Kas Cup - results and prizes I don't have an i7-2600, but I could run oakfoam on the 3930. I just downloaded it and it does compile. If you give me a list of gtp commands to run the benchmark, then I will send you the output back. Erik On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote: This is very interesting, I have not more than 10% with oakfoam on i7-2600K. Would be interesting if it is the processor or if you e.g. access more often memory instead of cache due to your code... Do you have the chance to run your program on a i7-2600? or do you have to much time and try https://bitbucket.org/francoisvn/oakfoam/wiki/Home on your i7-3930. If so, I would be very much interested in the number you get in the beginning of a 19x19 game without book:) Detlef Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der Werf: On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote: On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote: Hyperthreading does the trick, I have the experience it increases the performance by about 10%. I think this is due to waiting for RAM I/O or things like that Yes. With hyperthreading, performance per thread goes down significantly, but total performance goes up by about 15%. In the Pentium 4 era, hyperthreading did not usually pay off, but with i7, its performance is much better. The basic idea is that there are two instruction pipelines that share the same ALU and other processor units; if one of the pipelines stalls (usually due to memory fetch), the other can use the ALU in the meantime, or the two threads may use different parts of the CPU altogether based on what the instructions do. 10-15%, really, that low? For my program (on an i7-3930K, going from 6 to 12 threads) it is more in the order of 40% extra simulations per second. Erik ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
On Fri, Aug 10, 2012 at 09:26:31AM -0700, David Fotland wrote: Because my current approach seems to work just as well (or maybe better), and I haven't had time to code up a shared try and tune it up to validate that assumption. Chaslot's paper indicates perhaps that not having a shared tree is stronger. My guess is that they are about the same, so it's not worth the effort to change. In Pachi, having a shared tree makes all the difference when scaling up to more threads. See the graph (really awful one, sorry, it's old!) at http://pachi.or.cz/root-vs-shared.png If you have some information sharing near the root, I imagine it might be similar to Pachi's distributed engine performance (or just slightly better). But that is still far behind in scaling compared to the shared tree in our experience. P.S.: There are two important things, virtual loss (not necessarily 1 simulation but possibly more) and mainly lockless updates. The latter also means that sane code should be really easy to modify to use single shared tree instead of multiple trees. Petr Pasky Baudis ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
Erik van der Werf: CAKkgGrM83_HsQ5Z2HJupkj=gdeh3+4gm-jmlvevtjroufqn...@mail.gmail.com: On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote: On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote: Hyperthreading does the trick, I have the experience it increases the performance by about 10%. I think this is due to waiting for RAM I/O or things like that Yes. With hyperthreading, performance per thread goes down significantly, but total performance goes up by about 15%. In the Pentium 4 era, hyperthreading did not usually pay off, but with i7, its performance is much better. The basic idea is that there are two instruction pipelines that share the same ALU and other processor units; if one of the pipelines stalls (usually due to memory fetch), the other can use the ALU in the meantime, or the two threads may use different parts of the CPU altogether based on what the instructions do. 10-15%, really, that low? For my program (on an i7-3930K, going from 6 to 12 threads) it is more in the order of 40% extra simulations per second. In general that number highly depends on the code, architecture of the processor (Intel's are usually better than AMD's), memory speed, cache size, use of ALUs, etc. For Zen, the number is also about 40% on both an i7 3930K (6 to 12 threads) and an i7 920 (4 to 8 threads). Pasky, modern processors are much more complicated :). There are more than two sets of general registers, which are used not only for hyperthreading but also register renaming, for example. Hideki -- Hideki Kato mailto:hideki_ka...@ybb.ne.jp ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
Re: [Computer-go] Kas Cup - results and prizes
On Thu, Aug 09, 2012 at 08:09:55PM +0900, Hideki Kato wrote: Erik van der Werf: CAKkgGrM83_HsQ5Z2HJupkj=gdeh3+4gm-jmlvevtjroufqn...@mail.gmail.com: 10-15%, really, that low? For my program (on an i7-3930K, going from 6 to 12 threads) it is more in the order of 40% extra simulations per second. In general that number highly depends on the code, architecture of the processor (Intel's are usually better than AMD's), memory speed, cache size, use of ALUs, etc. For Zen, the number is also about 40% on both an i7 3930K (6 to 12 threads) and an i7 920 (4 to 8 threads). For Zen, I'm not surprised, since I assume that in simulations, you are matching some larger patterns which involves a lot of time-consuming hash table lookups which is ideal for hyperthreading. Not sure about stv. I think it matters a lot on whether you are matching patterns by explicit test code snippets or by a hash table. I measured the hyperthreading effect about 2 years ago with a lot older Pachi version. I think today, the hyperthreading effect would also be higher, but I cannot test it right now. Pasky, modern processors are much more complicated :). There are more than two sets of general registers, which are used not only for hyperthreading but also register renaming, for example. Sure, I just tried to sketch a rough explanation. I did not know that hyperthreading could reduce opportunity for register renaming, though. Petr Pasky Baudis ___ Computer-go mailing list Computer-go@dvandva.org http://dvandva.org/cgi-bin/mailman/listinfo/computer-go