Re: [Computer-go] Kas Cup - results and prizes

2012-08-14 Thread David Fotland
Just curious...  I understand how you update the counters lock-free, but
surely you must have a lock to protect adding a new node to the tree?  Do
this impact scaling at some point?  Are there any other infrequent locks
like this?

David

 -Original Message-
 From: computer-go-boun...@dvandva.org [mailto:computer-go-
 boun...@dvandva.org] On Behalf Of Petr Baudis
 Sent: Friday, August 10, 2012 12:47 PM
 To: computer-go@dvandva.org
 Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 On Fri, Aug 10, 2012 at 09:26:31AM -0700, David Fotland wrote:
  Because my current approach seems to work just as well (or maybe
  better), and I haven't had time to code up a shared try and tune it up
  to validate that assumption.  Chaslot's paper indicates perhaps that
  not having a shared tree is stronger.  My guess is that they are about
  the same, so it's not worth the effort to change.
 
 In Pachi, having a shared tree makes all the difference when scaling up
 to more threads. See the graph (really awful one, sorry, it's old!) at
 
   http://pachi.or.cz/root-vs-shared.png
 
 If you have some information sharing near the root, I imagine it might
 be similar to Pachi's distributed engine performance (or just slightly
 better). But that is still far behind in scaling compared to the shared
 tree in our experience.
 
 P.S.: There are two important things, virtual loss (not necessarily 1
 simulation but possibly more) and mainly lockless updates. The latter
 also means that sane code should be really easy to modify to use single
 shared tree instead of multiple trees.
 
   Petr Pasky Baudis
 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-14 Thread Jean-loup Gailly
David,

 I understand how you update the counters lock-free, but
 surely you must have a lock to protect adding a new node to the tree?

We have a single atomic test-and-set instruction to expand a node (add
children)
only once. It is not a blocking mutex. If a thread finds that the node is
not yet expanded but some other thread is already allocating new children,
it simply
goes on, starting a playout at this point.

 Do this impact scaling at some point?

Maybe above 24 cores, but we couldn't measure this. Pachi scales perfectly
up to 24 cores
in single-machine mode. See the reference given by Pasky earlier in this
thread:
Fig. 9 of http://pasky.or.cz/go/pachi-tr.pdf

Jean-loup
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Re: [Computer-go] Kas Cup - results and prizes

2012-08-12 Thread Mark Boon
Interesting, I'd have thought it would matter quite bit, especially with higher 
numbers of threads.

One thing I found (quite a few years back now already) is that you can optimize 
a lot by doing the following: when one node has so many more wins than the 
second best that it can't be overtaken even if the second best wins all of the 
remaining playouts, abort thinking. With a couple of extensions to this general 
idea (aborting not just when it's impossible, just very unlikely to be 
overtaken) I found that a player that does 64K lightweight simulations using 
this method spends the same time and plays the same level as one that does a 
fixed 32K simulations. Roughly. The higher the number of simulations, the 
bigger the savings.

This type of optimization must be much harder with root-level parallelization, 
so you'd have to factor that in when comparing methods.

Mark

On Aug 10, 2012, at 9:55 PM, David Fotland fotl...@smart-games.com wrote:

 Not much memory overhead.  If you look at your tree you will find that most
 nodes are only visited one or two times.  There is a lot of noise in the
 fringes of the tree, so there are few duplicates.  This also means that not
 sharing most of the tree has no impact on strength.
 
 David
 
 -Original Message-
 From: computer-go-boun...@dvandva.org [mailto:computer-go-
 boun...@dvandva.org] On Behalf Of Michael Williams
 Sent: Friday, August 10, 2012 9:42 AM
 To: computer-go@dvandva.org
 Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 I imagine you can get around the lack of implicit information sharing
 that you get with a shared tree by explicitly sharing information near
 the root.
 
 But doesn't having separate trees mean a large memory overhead due to
 duplicate nodes?
 
 
 On Fri, Aug 10, 2012 at 9:26 AM, David Fotland fotl...@smart-games.com
 wrote:
 Because my current approach seems to work just as well (or maybe
 better), and I haven't had time to code up a shared try and tune it up
 to validate that assumption.  Chaslot's paper indicates perhaps that
 not having a shared tree is stronger.  My guess is that they are about
 the same, so it's not worth the effort to change.
 
 david
 
 -Original Message-
 From: computer-go-boun...@dvandva.org [mailto:computer-go-
 boun...@dvandva.org] On Behalf Of Michael Williams
 Sent: Friday, August 10, 2012 12:06 AM
 To: computer-go@dvandva.org
 Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 Why don't you use a shared tree?
 
 
 On Thu, Aug 9, 2012 at 11:49 PM, David Fotland
 fotl...@smart-games.com
 wrote:
 On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k
 with
 8 threads, a 64% increase, so the 2600 scales a little better than
 the 3770, but the 3770 is still a litte bit faster.
 
 
 
 david
 
 
 
 From: computer-go-boun...@dvandva.org
 [mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der
 Werf
 Sent: Thursday, August 09, 2012 4:41 AM
 
 
 To: computer-go@dvandva.org
 Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 
 
 I don't have an i7-2600, but I could run oakfoam on the 3930. I
 just downloaded it and it does compile. If you give me a list of
 gtp commands to run the benchmark, then I will send you the output
 back.
 
 
 
 Erik
 
 
 
 On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote:
 
 This is very interesting,
 
 I have not more than 10% with oakfoam on i7-2600K. Would be
 interesting if it is the processor or if you e.g. access more often
 memory instead of cache due to your code...
 
 Do you have the chance to run your program on a i7-2600? or do you
 have to much time and try
 https://bitbucket.org/francoisvn/oakfoam/wiki/Home
 on your i7-3930. If so, I would be very much interested in the
 number you get in the beginning of a 19x19 game without book:)
 
 
 Detlef
 
 Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der
 Werf:
 
 On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote:
On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote:
 Hyperthreading does the trick, I have the experience it
increases the
 performance by about 10%. I think this is due to waiting
 for
RAM I/O or
 things like that
 
 
Yes. With hyperthreading, performance per thread goes down
significantly, but total performance goes up by about 15%.
 In
the
Pentium 4 era, hyperthreading did not usually pay off, but
with i7,
its performance is much better. The basic idea is that
 there
are two
instruction pipelines that share the same ALU and other
processor units;
if one of the pipelines stalls (usually due to memory
 fetch),
the other
can use the ALU in the meantime, or the two threads may
 use
different
parts of the CPU altogether based on what the instructions
 do.
 
 
 
 10-15%, really, that low? For my program (on an i7-3930K, going
 from
 6 to 12 threads) it is more in the order of 40% extra

Re: [Computer-go] Kas Cup - results and prizes

2012-08-12 Thread Mark Boon
Sorry, that's not right of course. The 64K version spends on average the same 
time, but plays 100 ELO stronger. Otherwise there would be no point :)

Mark

On Aug 12, 2012, at 6:07 PM, Mark Boon tesujisoftw...@gmail.com wrote:

 Interesting, I'd have thought it would matter quite bit, especially with 
 higher numbers of threads.
 
 One thing I found (quite a few years back now already) is that you can 
 optimize a lot by doing the following: when one node has so many more wins 
 than the second best that it can't be overtaken even if the second best wins 
 all of the remaining playouts, abort thinking. With a couple of extensions to 
 this general idea (aborting not just when it's impossible, just very unlikely 
 to be overtaken) I found that a player that does 64K lightweight simulations 
 using this method spends the same time and plays the same level as one that 
 does a fixed 32K simulations. Roughly. The higher the number of simulations, 
 the bigger the savings.
 
 This type of optimization must be much harder with root-level 
 parallelization, so you'd have to factor that in when comparing methods.
 
 Mark
 
 On Aug 10, 2012, at 9:55 PM, David Fotland fotl...@smart-games.com wrote:
 
 Not much memory overhead.  If you look at your tree you will find that most
 nodes are only visited one or two times.  There is a lot of noise in the
 fringes of the tree, so there are few duplicates.  This also means that not
 sharing most of the tree has no impact on strength.
 
 David
 
 -Original Message-
 From: computer-go-boun...@dvandva.org [mailto:computer-go-
 boun...@dvandva.org] On Behalf Of Michael Williams
 Sent: Friday, August 10, 2012 9:42 AM
 To: computer-go@dvandva.org
 Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 I imagine you can get around the lack of implicit information sharing
 that you get with a shared tree by explicitly sharing information near
 the root.
 
 But doesn't having separate trees mean a large memory overhead due to
 duplicate nodes?
 
 
 On Fri, Aug 10, 2012 at 9:26 AM, David Fotland fotl...@smart-games.com
 wrote:
 Because my current approach seems to work just as well (or maybe
 better), and I haven't had time to code up a shared try and tune it up
 to validate that assumption.  Chaslot's paper indicates perhaps that
 not having a shared tree is stronger.  My guess is that they are about
 the same, so it's not worth the effort to change.
 
 david
 
 -Original Message-
 From: computer-go-boun...@dvandva.org [mailto:computer-go-
 boun...@dvandva.org] On Behalf Of Michael Williams
 Sent: Friday, August 10, 2012 12:06 AM
 To: computer-go@dvandva.org
 Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 Why don't you use a shared tree?
 
 
 On Thu, Aug 9, 2012 at 11:49 PM, David Fotland
 fotl...@smart-games.com
 wrote:
 On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k
 with
 8 threads, a 64% increase, so the 2600 scales a little better than
 the 3770, but the 3770 is still a litte bit faster.
 
 
 
 david
 
 
 
 From: computer-go-boun...@dvandva.org
 [mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der
 Werf
 Sent: Thursday, August 09, 2012 4:41 AM
 
 
 To: computer-go@dvandva.org
 Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 
 
 I don't have an i7-2600, but I could run oakfoam on the 3930. I
 just downloaded it and it does compile. If you give me a list of
 gtp commands to run the benchmark, then I will send you the output
 back.
 
 
 
 Erik
 
 
 
 On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote:
 
 This is very interesting,
 
 I have not more than 10% with oakfoam on i7-2600K. Would be
 interesting if it is the processor or if you e.g. access more often
 memory instead of cache due to your code...
 
 Do you have the chance to run your program on a i7-2600? or do you
 have to much time and try
 https://bitbucket.org/francoisvn/oakfoam/wiki/Home
 on your i7-3930. If so, I would be very much interested in the
 number you get in the beginning of a 19x19 game without book:)
 
 
 Detlef
 
 Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der
 Werf:
 
 On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote:
   On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote:
 Hyperthreading does the trick, I have the experience it
   increases the
 performance by about 10%. I think this is due to waiting
 for
   RAM I/O or
 things like that
 
 
   Yes. With hyperthreading, performance per thread goes down
   significantly, but total performance goes up by about 15%.
 In
   the
   Pentium 4 era, hyperthreading did not usually pay off, but
   with i7,
   its performance is much better. The basic idea is that
 there
   are two
   instruction pipelines that share the same ALU and other
   processor units;
   if one of the pipelines stalls (usually due to memory
 fetch),
   the other
   can use the ALU in the meantime, or the two

Re: [Computer-go] Kas Cup - results and prizes

2012-08-12 Thread David Fotland
I've been using this abort early to save time idea almost since the
beginning.  It works fine with root parallelization.

David

 -Original Message-
 From: computer-go-boun...@dvandva.org [mailto:computer-go-
 boun...@dvandva.org] On Behalf Of Mark Boon
 Sent: Sunday, August 12, 2012 9:07 PM
 To: computer-go@dvandva.org
 Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 Interesting, I'd have thought it would matter quite bit, especially with
 higher numbers of threads.
 
 One thing I found (quite a few years back now already) is that you can
 optimize a lot by doing the following: when one node has so many more
 wins than the second best that it can't be overtaken even if the second
 best wins all of the remaining playouts, abort thinking. With a couple
 of extensions to this general idea (aborting not just when it's
 impossible, just very unlikely to be overtaken) I found that a player
 that does 64K lightweight simulations using this method spends the same
 time and plays the same level as one that does a fixed 32K simulations.
 Roughly. The higher the number of simulations, the bigger the savings.
 
 This type of optimization must be much harder with root-level
 parallelization, so you'd have to factor that in when comparing methods.
 
 Mark
 
 On Aug 10, 2012, at 9:55 PM, David Fotland fotl...@smart-games.com
 wrote:
 
  Not much memory overhead.  If you look at your tree you will find that
  most nodes are only visited one or two times.  There is a lot of noise
  in the fringes of the tree, so there are few duplicates.  This also
  means that not sharing most of the tree has no impact on strength.
 
  David
 
  -Original Message-
  From: computer-go-boun...@dvandva.org [mailto:computer-go-
  boun...@dvandva.org] On Behalf Of Michael Williams
  Sent: Friday, August 10, 2012 9:42 AM
  To: computer-go@dvandva.org
  Subject: Re: [Computer-go] Kas Cup - results and prizes
 
  I imagine you can get around the lack of implicit information sharing
  that you get with a shared tree by explicitly sharing information
  near the root.
 
  But doesn't having separate trees mean a large memory overhead due to
  duplicate nodes?
 
 
  On Fri, Aug 10, 2012 at 9:26 AM, David Fotland
  fotl...@smart-games.com
  wrote:
  Because my current approach seems to work just as well (or maybe
  better), and I haven't had time to code up a shared try and tune it
  up to validate that assumption.  Chaslot's paper indicates perhaps
  that not having a shared tree is stronger.  My guess is that they
  are about the same, so it's not worth the effort to change.
 
  david
 
  -Original Message-
  From: computer-go-boun...@dvandva.org [mailto:computer-go-
  boun...@dvandva.org] On Behalf Of Michael Williams
  Sent: Friday, August 10, 2012 12:06 AM
  To: computer-go@dvandva.org
  Subject: Re: [Computer-go] Kas Cup - results and prizes
 
  Why don't you use a shared tree?
 
 
  On Thu, Aug 9, 2012 at 11:49 PM, David Fotland
  fotl...@smart-games.com
  wrote:
  On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k
  with
  8 threads, a 64% increase, so the 2600 scales a little better than
  the 3770, but the 3770 is still a litte bit faster.
 
 
 
  david
 
 
 
  From: computer-go-boun...@dvandva.org
  [mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der
  Werf
  Sent: Thursday, August 09, 2012 4:41 AM
 
 
  To: computer-go@dvandva.org
  Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 
 
  I don't have an i7-2600, but I could run oakfoam on the 3930. I
  just downloaded it and it does compile. If you give me a list of
  gtp commands to run the benchmark, then I will send you the output
  back.
 
 
 
  Erik
 
 
 
  On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote:
 
  This is very interesting,
 
  I have not more than 10% with oakfoam on i7-2600K. Would be
  interesting if it is the processor or if you e.g. access more
  often memory instead of cache due to your code...
 
  Do you have the chance to run your program on a i7-2600? or do you
  have to much time and try
  https://bitbucket.org/francoisvn/oakfoam/wiki/Home
  on your i7-3930. If so, I would be very much interested in the
  number you get in the beginning of a 19x19 game without book:)
 
 
  Detlef
 
  Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der
  Werf:
 
  On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz
 wrote:
 On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote:
  Hyperthreading does the trick, I have the experience it
 increases the
  performance by about 10%. I think this is due to waiting
  for
 RAM I/O or
  things like that
 
 
 Yes. With hyperthreading, performance per thread goes down
 significantly, but total performance goes up by about 15%.
  In
 the
 Pentium 4 era, hyperthreading did not usually pay off, but
 with i7,
 its performance is much better. The basic idea

Re: [Computer-go] Kas Cup - results and prizes

2012-08-11 Thread David Fotland
I'm happy with MFGO's scaling.  I'm running a scaling test now, 4 threads vs
8 threads, fixed 32K total playouts per move, 19x19, no pondering.  Ideally
the win rate should be 50%, since the total playouts are the same.  Has
anyone tried this kind of scaling experiment, and is willing to share
results?

David

 -Original Message-
 From: computer-go-boun...@dvandva.org [mailto:computer-go-
 boun...@dvandva.org] On Behalf Of Petr Baudis
 Sent: Friday, August 10, 2012 12:47 PM
 To: computer-go@dvandva.org
 Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 On Fri, Aug 10, 2012 at 09:26:31AM -0700, David Fotland wrote:
  Because my current approach seems to work just as well (or maybe
  better), and I haven't had time to code up a shared try and tune it up
  to validate that assumption.  Chaslot's paper indicates perhaps that
  not having a shared tree is stronger.  My guess is that they are about
  the same, so it's not worth the effort to change.
 
 In Pachi, having a shared tree makes all the difference when scaling up
 to more threads. See the graph (really awful one, sorry, it's old!) at
 
   http://pachi.or.cz/root-vs-shared.png
 
 If you have some information sharing near the root, I imagine it might
 be similar to Pachi's distributed engine performance (or just slightly
 better). But that is still far behind in scaling compared to the shared
 tree in our experience.
 
 P.S.: There are two important things, virtual loss (not necessarily 1
 simulation but possibly more) and mainly lockless updates. The latter
 also means that sane code should be really easy to modify to use single
 shared tree instead of multiple trees.
 
   Petr Pasky Baudis
 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-11 Thread David Fotland
Yes, root parallelization with some sharing.
http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it
was good and I tried it and it works well.

Hardware is really important.  But so are really smart playouts.  The slower
I make my playouts the stronger the program gets.

David

 -Original Message-
 From: computer-go-boun...@dvandva.org [mailto:computer-go-
 boun...@dvandva.org] On Behalf Of Peter Drake
 Sent: Friday, August 10, 2012 10:45 AM
 To: computer-go@dvandva.org
 Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 On Thu, Aug 9, 2012 at 11:42 PM, David Fotland fotl...@smart-games.com
 wrote:
 
  Or it might be an artifact of the way I do search, since I think I
  might be the only engine that doesn't use a single shared tree, and
  the old Many Faces of Go engine is single threaded.
 
 If not a single shared tree, what are you doing? Root parallelism?
 
 I've been wondering why other programs are pulling ahead of Orego, and
 now I'm starting to suspect the answer may be (in part) hardware.
 
 --
 Peter Drake
 https://sites.google.com/a/lclark.edu/drake/
 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-11 Thread David Fotland
Not much memory overhead.  If you look at your tree you will find that most
nodes are only visited one or two times.  There is a lot of noise in the
fringes of the tree, so there are few duplicates.  This also means that not
sharing most of the tree has no impact on strength.

David

 -Original Message-
 From: computer-go-boun...@dvandva.org [mailto:computer-go-
 boun...@dvandva.org] On Behalf Of Michael Williams
 Sent: Friday, August 10, 2012 9:42 AM
 To: computer-go@dvandva.org
 Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 I imagine you can get around the lack of implicit information sharing
 that you get with a shared tree by explicitly sharing information near
 the root.
 
 But doesn't having separate trees mean a large memory overhead due to
 duplicate nodes?
 
 
 On Fri, Aug 10, 2012 at 9:26 AM, David Fotland fotl...@smart-games.com
 wrote:
  Because my current approach seems to work just as well (or maybe
  better), and I haven't had time to code up a shared try and tune it up
  to validate that assumption.  Chaslot's paper indicates perhaps that
  not having a shared tree is stronger.  My guess is that they are about
  the same, so it's not worth the effort to change.
 
  david
 
  -Original Message-
  From: computer-go-boun...@dvandva.org [mailto:computer-go-
  boun...@dvandva.org] On Behalf Of Michael Williams
  Sent: Friday, August 10, 2012 12:06 AM
  To: computer-go@dvandva.org
  Subject: Re: [Computer-go] Kas Cup - results and prizes
 
  Why don't you use a shared tree?
 
 
  On Thu, Aug 9, 2012 at 11:49 PM, David Fotland
  fotl...@smart-games.com
  wrote:
   On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k
   with
   8 threads, a 64% increase, so the 2600 scales a little better than
   the 3770, but the 3770 is still a litte bit faster.
  
  
  
   david
  
  
  
   From: computer-go-boun...@dvandva.org
   [mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der
   Werf
   Sent: Thursday, August 09, 2012 4:41 AM
  
  
   To: computer-go@dvandva.org
   Subject: Re: [Computer-go] Kas Cup - results and prizes
  
  
  
   I don't have an i7-2600, but I could run oakfoam on the 3930. I
   just downloaded it and it does compile. If you give me a list of
   gtp commands to run the benchmark, then I will send you the output
 back.
  
  
  
   Erik
  
  
  
   On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote:
  
   This is very interesting,
  
   I have not more than 10% with oakfoam on i7-2600K. Would be
   interesting if it is the processor or if you e.g. access more often
   memory instead of cache due to your code...
  
   Do you have the chance to run your program on a i7-2600? or do you
   have to much time and try
   https://bitbucket.org/francoisvn/oakfoam/wiki/Home
   on your i7-3930. If so, I would be very much interested in the
   number you get in the beginning of a 19x19 game without book:)
  
  
   Detlef
  
   Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der
 Werf:
  
   On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote:
   On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote:
Hyperthreading does the trick, I have the experience it
   increases the
performance by about 10%. I think this is due to waiting
  for
   RAM I/O or
things like that
  
  
   Yes. With hyperthreading, performance per thread goes down
   significantly, but total performance goes up by about 15%.
 In
   the
   Pentium 4 era, hyperthreading did not usually pay off, but
   with i7,
   its performance is much better. The basic idea is that
 there
   are two
   instruction pipelines that share the same ALU and other
   processor units;
   if one of the pipelines stalls (usually due to memory
 fetch),
   the other
   can use the ALU in the meantime, or the two threads may
 use
   different
   parts of the CPU altogether based on what the instructions
  do.
  
  
  
   10-15%, really, that low? For my program (on an i7-3930K, going
   from
   6 to 12 threads) it is more in the order of 40% extra simulations
   per second.
  
  
   Erik
  
  
  
   ___
   Computer-go mailing list
   Computer-go@dvandva.org
   http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
  
  
   ___
   Computer-go mailing list
   Computer-go@dvandva.org
   http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
  
  
  
  
   ___
   Computer-go mailing list
   Computer-go@dvandva.org
   http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
  ___
  Computer-go mailing list
  Computer-go@dvandva.org
  http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Re: [Computer-go] Kas Cup - results and prizes

2012-08-11 Thread Petr Baudis
On Sat, Aug 11, 2012 at 12:46:19AM -0700, David Fotland wrote:
 I'm happy with MFGO's scaling.  I'm running a scaling test now, 4 threads vs
 8 threads, fixed 32K total playouts per move, 19x19, no pondering.  Ideally
 the win rate should be 50%, since the total playouts are the same.  Has
 anyone tried this kind of scaling experiment, and is willing to share
 results?

With Pachi, the winrate in this scenario would be 50%; our thread
scaling incurs basically no strength loss compared to sequential
playouts.

This is visible in the Lousy Graph http://pachi.or.cz/root-vs-shared.png
as second triplet of bars (labelled Sequential) vs. the last one,
and in Fig. 9 of http://pasky.or.cz/go/pachi-tr.pdf (see text for
detailed description of the graph).

Petr Pasky Baudis
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-11 Thread Petr Baudis
On Sat, Aug 11, 2012 at 12:52:12AM -0700, David Fotland wrote:
 Yes, root parallelization with some sharing.
 http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it
 was good and I tried it and it works well.

The paper is not so relevant now, since the standard method of most
programs is lockless tree parallelization, which is not covered.
The locking overhead is quite significant, I'd expect, as locking
instructions can AFAIK take hundreds of cycles.

That said, root parallelization overperforming sequential simulations
is something I never managed to reproduce and that seems rather
surprising to me. It might have something to do with the way priors
are done in the tree or some other engine-specific factors.

Petr Pasky Baudis
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-11 Thread Jason House
On Aug 11, 2012, at 10:59 AM, Petr Baudis pa...@ucw.cz wrote:

 
 That said, root parallelization overperforming sequential simulations
 is something I never managed to reproduce and that seems rather
 surprising to me. It might have something to do with the way priors
 are done in the tree or some other engine-specific factors.

When I saw that result, I immediately concluded that the engine locked into 
specific lines of play prematurely. I can imagine a few exceptional cases, but 
I don't think root parallelism should ever outperform a single thread doing the 
equivalent number of simulations.
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-11 Thread Hideki Kato
Petr Baudis: 20120811145900.gv19...@machine.or.cz:
On Sat, Aug 11, 2012 at 12:52:12AM -0700, David Fotland wrote:
 Yes, root parallelization with some sharing.
 http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it
 was good and I tried it and it works well.

The paper is not so relevant now, since the standard method of most
programs is lockless tree parallelization, which is not covered.
The locking overhead is quite significant, I'd expect, as locking
instructions can AFAIK take hundreds of cycles.

With spin-lock or hardware test-and-set instructions, locking overhead 
is very small.

That said, root parallelization overperforming sequential simulations
is something I never managed to reproduce and that seems rather
surprising to me. It might have something to do with the way priors
are done in the tree or some other engine-specific factors.

I believe IBM Power processor's architecture may caused the super-linear 
acceralaton.

Hideki
-- 
Hideki Kato mailto:hideki_ka...@ybb.ne.jp
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-11 Thread Michael Williams
I wonder if spin-lock hurts hyperthreading.


On Sat, Aug 11, 2012 at 7:22 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote:
 Petr Baudis: 20120811145900.gv19...@machine.or.cz:
On Sat, Aug 11, 2012 at 12:52:12AM -0700, David Fotland wrote:
 Yes, root parallelization with some sharing.
 http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it
 was good and I tried it and it works well.

The paper is not so relevant now, since the standard method of most
programs is lockless tree parallelization, which is not covered.
The locking overhead is quite significant, I'd expect, as locking
instructions can AFAIK take hundreds of cycles.

 With spin-lock or hardware test-and-set instructions, locking overhead
 is very small.

That said, root parallelization overperforming sequential simulations
is something I never managed to reproduce and that seems rather
surprising to me. It might have something to do with the way priors
are done in the tree or some other engine-specific factors.

 I believe IBM Power processor's architecture may caused the super-linear
 acceralaton.

 Hideki
 --
 Hideki Kato mailto:hideki_ka...@ybb.ne.jp
 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-11 Thread Hideki Kato
Michael Williams: 
CAB0EdYWgs=gjsnt9rmkj8-utlg72nyxnafvd-8iq_pjjf_j...@mail.gmail.com:
I wonder if spin-lock hurts hyperthreading.

Why do you think so?  If a spin-lock accesses memory and waits, 
simply another thread runs.  That's all.

Hideki

On Sat, Aug 11, 2012 at 7:22 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote:
 Petr Baudis: 20120811145900.gv19...@machine.or.cz:
On Sat, Aug 11, 2012 at 12:52:12AM -0700, David Fotland wrote:
 Yes, root parallelization with some sharing.
 http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it
 was good and I tried it and it works well.

The paper is not so relevant now, since the standard method of most
programs is lockless tree parallelization, which is not covered.
The locking overhead is quite significant, I'd expect, as locking
instructions can AFAIK take hundreds of cycles.

 With spin-lock or hardware test-and-set instructions, locking overhead
 is very small.

That said, root parallelization overperforming sequential simulations
is something I never managed to reproduce and that seems rather
surprising to me. It might have something to do with the way priors
are done in the tree or some other engine-specific factors.

 I believe IBM Power processor's architecture may caused the super-linear
 acceralaton.

 Hideki
 --
 Hideki Kato mailto:hideki_ka...@ybb.ne.jp
 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
-- 
Hideki Kato mailto:hideki_ka...@ybb.ne.jp
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-11 Thread Michael Williams
Because two hyperthreads some of the same hardware.  And some of that
hardware is required to do the spinning.  Just a thought.

Found this with a quick search:
http://archives.postgresql.org/pgsql-patches/2003-12/msg00345.php


On Sat, Aug 11, 2012 at 8:45 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote:
 Michael Williams: 
 CAB0EdYWgs=gjsnt9rmkj8-utlg72nyxnafvd-8iq_pjjf_j...@mail.gmail.com:
I wonder if spin-lock hurts hyperthreading.

 Why do you think so?  If a spin-lock accesses memory and waits,
 simply another thread runs.  That's all.

 Hideki

On Sat, Aug 11, 2012 at 7:22 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote:
 Petr Baudis: 20120811145900.gv19...@machine.or.cz:
On Sat, Aug 11, 2012 at 12:52:12AM -0700, David Fotland wrote:
 Yes, root parallelization with some sharing.
 http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it
 was good and I tried it and it works well.

The paper is not so relevant now, since the standard method of most
programs is lockless tree parallelization, which is not covered.
The locking overhead is quite significant, I'd expect, as locking
instructions can AFAIK take hundreds of cycles.

 With spin-lock or hardware test-and-set instructions, locking overhead
 is very small.

That said, root parallelization overperforming sequential simulations
is something I never managed to reproduce and that seems rather
surprising to me. It might have something to do with the way priors
are done in the tree or some other engine-specific factors.

 I believe IBM Power processor's architecture may caused the super-linear
 acceralaton.

 Hideki
 --
 Hideki Kato mailto:hideki_ka...@ybb.ne.jp
 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
 --
 Hideki Kato mailto:hideki_ka...@ybb.ne.jp
 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-11 Thread Michael Williams
two hyperthreads SHARE some of the same hardware

On Sat, Aug 11, 2012 at 9:34 PM, Michael Williams
michaelwilliam...@gmail.com wrote:
 Because two hyperthreads some of the same hardware.  And some of that
 hardware is required to do the spinning.  Just a thought.

 Found this with a quick search:
 http://archives.postgresql.org/pgsql-patches/2003-12/msg00345.php


 On Sat, Aug 11, 2012 at 8:45 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote:
 Michael Williams: 
 CAB0EdYWgs=gjsnt9rmkj8-utlg72nyxnafvd-8iq_pjjf_j...@mail.gmail.com:
I wonder if spin-lock hurts hyperthreading.

 Why do you think so?  If a spin-lock accesses memory and waits,
 simply another thread runs.  That's all.

 Hideki

On Sat, Aug 11, 2012 at 7:22 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote:
 Petr Baudis: 20120811145900.gv19...@machine.or.cz:
On Sat, Aug 11, 2012 at 12:52:12AM -0700, David Fotland wrote:
 Yes, root parallelization with some sharing.
 http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it
 was good and I tried it and it works well.

The paper is not so relevant now, since the standard method of most
programs is lockless tree parallelization, which is not covered.
The locking overhead is quite significant, I'd expect, as locking
instructions can AFAIK take hundreds of cycles.

 With spin-lock or hardware test-and-set instructions, locking overhead
 is very small.

That said, root parallelization overperforming sequential simulations
is something I never managed to reproduce and that seems rather
surprising to me. It might have something to do with the way priors
are done in the tree or some other engine-specific factors.

 I believe IBM Power processor's architecture may caused the super-linear
 acceralaton.

 Hideki
 --
 Hideki Kato mailto:hideki_ka...@ybb.ne.jp
 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
 --
 Hideki Kato mailto:hideki_ka...@ybb.ne.jp
 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-11 Thread Hideki Kato
Michael Williams: 
cab0edyxgbi9-+su8dh1trfegr6kw4yku7plhgbvzyi-yjww...@mail.gmail.com:
Because two hyperthreads some of the same hardware.  And some of that
hardware is required to do the spinning.  Just a thought.

Found this with a quick search:
http://archives.postgresql.org/pgsql-patches/2003-12/msg00345.php

I believe it's too old and cannot apply modern hyperthreading.

Hideki

On Sat, Aug 11, 2012 at 8:45 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote:
 Michael Williams: 
 CAB0EdYWgs=gjsnt9rmkj8-utlg72nyxnafvd-8iq_pjjf_j...@mail.gmail.com:
I wonder if spin-lock hurts hyperthreading.

 Why do you think so?  If a spin-lock accesses memory and waits,
 simply another thread runs.  That's all.

 Hideki

On Sat, Aug 11, 2012 at 7:22 PM, Hideki Kato hideki_ka...@ybb.ne.jp wrote:
 Petr Baudis: 20120811145900.gv19...@machine.or.cz:
On Sat, Aug 11, 2012 at 12:52:12AM -0700, David Fotland wrote:
 Yes, root parallelization with some sharing.
 http://www.personeel.unimaas.nl/G-Chaslot/papers/parallelMCTS.pdf said it
 was good and I tried it and it works well.

The paper is not so relevant now, since the standard method of most
programs is lockless tree parallelization, which is not covered.
The locking overhead is quite significant, I'd expect, as locking
instructions can AFAIK take hundreds of cycles.

 With spin-lock or hardware test-and-set instructions, locking overhead
 is very small.

That said, root parallelization overperforming sequential simulations
is something I never managed to reproduce and that seems rather
surprising to me. It might have something to do with the way priors
are done in the tree or some other engine-specific factors.

 I believe IBM Power processor's architecture may caused the super-linear
 acceralaton.

 Hideki
 --
 Hideki Kato mailto:hideki_ka...@ybb.ne.jp
 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
 --
 Hideki Kato mailto:hideki_ka...@ybb.ne.jp
 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
-- 
Hideki Kato mailto:hideki_ka...@ybb.ne.jp
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-10 Thread David Fotland
On my core i7-3770, 4 threads is 12.5K playouts/sec (19x19, average of first
four moves by white), and 8 threads is 19.8K, a 58% increase.  This is much
higher than I expected.  It seems Intel has improved hyperthreading since
the last time I tried it.  Or it might be an artifact of the way I do
search, since I think I might be the only engine that doesn't use a single
shared tree, and the old Many Faces of Go engine is single threaded.

David

 -Original Message-
 From: computer-go-boun...@dvandva.org [mailto:computer-go-
 boun...@dvandva.org] On Behalf Of Petr Baudis
 Sent: Thursday, August 09, 2012 6:23 AM
 To: computer-go@dvandva.org
 Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 On Thu, Aug 09, 2012 at 08:09:55PM +0900, Hideki Kato wrote:
  Erik van der Werf: CAKkgGrM83_HsQ5Z2HJupkj=gDeh3+4GM-
 jmlvevtjroufqn...@mail.gmail.com:
  10-15%, really, that low? For my program (on an i7-3930K, going from
  6 to
  12 threads) it is more in the order of 40% extra simulations per
 second.
 
  In general that number highly depends on the code, architecture of the
  processor (Intel's are usually better than AMD's), memory speed, cache
  size, use of ALUs, etc.  For Zen, the number is also about 40% on both
  an i7 3930K (6 to 12 threads) and an i7 920 (4 to 8 threads).
 
 For Zen, I'm not surprised, since I assume that in simulations, you are
 matching some larger patterns which involves a lot of time-consuming
 hash table lookups which is ideal for hyperthreading. Not sure about
 stv. I think it matters a lot on whether you are matching patterns by
 explicit test code snippets or by a hash table.
 
 I measured the hyperthreading effect about 2 years ago with a lot older
 Pachi version. I think today, the hyperthreading effect would also be
 higher, but I cannot test it right now.
 
  Pasky, modern processors are much more complicated :).  There are more
  than two sets of general registers, which are used not only for
  hyperthreading but also register renaming, for example.
 
 Sure, I just tried to sketch a rough explanation. I did not know that
 hyperthreading could reduce opportunity for register renaming, though.
 
   Petr Pasky Baudis
 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-10 Thread David Fotland
On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k with 8
threads, a 64% increase, so the 2600 scales a little better than the 3770,
but the 3770 is still a litte bit faster.

 

david

 

From: computer-go-boun...@dvandva.org
[mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der Werf
Sent: Thursday, August 09, 2012 4:41 AM
To: computer-go@dvandva.org
Subject: Re: [Computer-go] Kas Cup - results and prizes

 

I don't have an i7-2600, but I could run oakfoam on the 3930. I just
downloaded it and it does compile. If you give me a list of gtp commands to
run the benchmark, then I will send you the output back.

 

Erik

 

On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote:

This is very interesting,

I have not more than 10% with oakfoam on i7-2600K. Would be interesting
if it is the processor or if you e.g. access more often memory instead
of cache due to your code...

Do you have the chance to run your program on a i7-2600? or do you have
to much time and try https://bitbucket.org/francoisvn/oakfoam/wiki/Home
on your i7-3930. If so, I would be very much interested in the number
you get in the beginning of a 19x19 game without book:)


Detlef

Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der Werf:

 On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote:
 On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote:
  Hyperthreading does the trick, I have the experience it
 increases the
  performance by about 10%. I think this is due to waiting for
 RAM I/O or
  things like that


 Yes. With hyperthreading, performance per thread goes down
 significantly, but total performance goes up by about 15%. In
 the
 Pentium 4 era, hyperthreading did not usually pay off, but
 with i7,
 its performance is much better. The basic idea is that there
 are two
 instruction pipelines that share the same ALU and other
 processor units;
 if one of the pipelines stalls (usually due to memory fetch),
 the other
 can use the ALU in the meantime, or the two threads may use
 different
 parts of the CPU altogether based on what the instructions do.



 10-15%, really, that low? For my program (on an i7-3930K, going from 6
 to 12 threads) it is more in the order of 40% extra simulations per
 second.


 Erik



 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

 

___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

Re: [Computer-go] Kas Cup - results and prizes

2012-08-10 Thread Michael Williams
Why don't you use a shared tree?


On Thu, Aug 9, 2012 at 11:49 PM, David Fotland fotl...@smart-games.com wrote:
 On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k with 8
 threads, a 64% increase, so the 2600 scales a little better than the 3770,
 but the 3770 is still a litte bit faster.



 david



 From: computer-go-boun...@dvandva.org
 [mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der Werf
 Sent: Thursday, August 09, 2012 4:41 AM


 To: computer-go@dvandva.org
 Subject: Re: [Computer-go] Kas Cup - results and prizes



 I don't have an i7-2600, but I could run oakfoam on the 3930. I just
 downloaded it and it does compile. If you give me a list of gtp commands to
 run the benchmark, then I will send you the output back.



 Erik



 On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote:

 This is very interesting,

 I have not more than 10% with oakfoam on i7-2600K. Would be interesting
 if it is the processor or if you e.g. access more often memory instead
 of cache due to your code...

 Do you have the chance to run your program on a i7-2600? or do you have
 to much time and try https://bitbucket.org/francoisvn/oakfoam/wiki/Home
 on your i7-3930. If so, I would be very much interested in the number
 you get in the beginning of a 19x19 game without book:)


 Detlef

 Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der Werf:

 On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote:
 On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote:
  Hyperthreading does the trick, I have the experience it
 increases the
  performance by about 10%. I think this is due to waiting for
 RAM I/O or
  things like that


 Yes. With hyperthreading, performance per thread goes down
 significantly, but total performance goes up by about 15%. In
 the
 Pentium 4 era, hyperthreading did not usually pay off, but
 with i7,
 its performance is much better. The basic idea is that there
 are two
 instruction pipelines that share the same ALU and other
 processor units;
 if one of the pipelines stalls (usually due to memory fetch),
 the other
 can use the ALU in the meantime, or the two threads may use
 different
 parts of the CPU altogether based on what the instructions do.



 10-15%, really, that low? For my program (on an i7-3930K, going from 6
 to 12 threads) it is more in the order of 40% extra simulations per
 second.


 Erik



 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go




 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-10 Thread David Fotland
Because my current approach seems to work just as well (or maybe better),
and I haven't had time to code up a shared try and tune it up to validate
that assumption.  Chaslot's paper indicates perhaps that not having a shared
tree is stronger.  My guess is that they are about the same, so it's not
worth the effort to change.

david

 -Original Message-
 From: computer-go-boun...@dvandva.org [mailto:computer-go-
 boun...@dvandva.org] On Behalf Of Michael Williams
 Sent: Friday, August 10, 2012 12:06 AM
 To: computer-go@dvandva.org
 Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 Why don't you use a shared tree?
 
 
 On Thu, Aug 9, 2012 at 11:49 PM, David Fotland fotl...@smart-games.com
 wrote:
  On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k with
  8 threads, a 64% increase, so the 2600 scales a little better than the
  3770, but the 3770 is still a litte bit faster.
 
 
 
  david
 
 
 
  From: computer-go-boun...@dvandva.org
  [mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der
  Werf
  Sent: Thursday, August 09, 2012 4:41 AM
 
 
  To: computer-go@dvandva.org
  Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 
 
  I don't have an i7-2600, but I could run oakfoam on the 3930. I just
  downloaded it and it does compile. If you give me a list of gtp
  commands to run the benchmark, then I will send you the output back.
 
 
 
  Erik
 
 
 
  On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote:
 
  This is very interesting,
 
  I have not more than 10% with oakfoam on i7-2600K. Would be
  interesting if it is the processor or if you e.g. access more often
  memory instead of cache due to your code...
 
  Do you have the chance to run your program on a i7-2600? or do you
  have to much time and try
  https://bitbucket.org/francoisvn/oakfoam/wiki/Home
  on your i7-3930. If so, I would be very much interested in the number
  you get in the beginning of a 19x19 game without book:)
 
 
  Detlef
 
  Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der Werf:
 
  On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote:
  On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote:
   Hyperthreading does the trick, I have the experience it
  increases the
   performance by about 10%. I think this is due to waiting
 for
  RAM I/O or
   things like that
 
 
  Yes. With hyperthreading, performance per thread goes down
  significantly, but total performance goes up by about 15%. In
  the
  Pentium 4 era, hyperthreading did not usually pay off, but
  with i7,
  its performance is much better. The basic idea is that there
  are two
  instruction pipelines that share the same ALU and other
  processor units;
  if one of the pipelines stalls (usually due to memory fetch),
  the other
  can use the ALU in the meantime, or the two threads may use
  different
  parts of the CPU altogether based on what the instructions
 do.
 
 
 
  10-15%, really, that low? For my program (on an i7-3930K, going from
  6 to 12 threads) it is more in the order of 40% extra simulations per
  second.
 
 
  Erik
 
 
 
  ___
  Computer-go mailing list
  Computer-go@dvandva.org
  http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
 
 
  ___
  Computer-go mailing list
  Computer-go@dvandva.org
  http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
 
 
 
 
  ___
  Computer-go mailing list
  Computer-go@dvandva.org
  http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-10 Thread Michael Williams
I imagine you can get around the lack of implicit information sharing
that you get with a shared tree by explicitly sharing information near
the root.

But doesn't having separate trees mean a large memory overhead due to
duplicate nodes?


On Fri, Aug 10, 2012 at 9:26 AM, David Fotland fotl...@smart-games.com wrote:
 Because my current approach seems to work just as well (or maybe better),
 and I haven't had time to code up a shared try and tune it up to validate
 that assumption.  Chaslot's paper indicates perhaps that not having a shared
 tree is stronger.  My guess is that they are about the same, so it's not
 worth the effort to change.

 david

 -Original Message-
 From: computer-go-boun...@dvandva.org [mailto:computer-go-
 boun...@dvandva.org] On Behalf Of Michael Williams
 Sent: Friday, August 10, 2012 12:06 AM
 To: computer-go@dvandva.org
 Subject: Re: [Computer-go] Kas Cup - results and prizes

 Why don't you use a shared tree?


 On Thu, Aug 9, 2012 at 11:49 PM, David Fotland fotl...@smart-games.com
 wrote:
  On an i7-2600 Many Faces does 11.4K pps with 4 threads, and 18.7k with
  8 threads, a 64% increase, so the 2600 scales a little better than the
  3770, but the 3770 is still a litte bit faster.
 
 
 
  david
 
 
 
  From: computer-go-boun...@dvandva.org
  [mailto:computer-go-boun...@dvandva.org] On Behalf Of Erik van der
  Werf
  Sent: Thursday, August 09, 2012 4:41 AM
 
 
  To: computer-go@dvandva.org
  Subject: Re: [Computer-go] Kas Cup - results and prizes
 
 
 
  I don't have an i7-2600, but I could run oakfoam on the 3930. I just
  downloaded it and it does compile. If you give me a list of gtp
  commands to run the benchmark, then I will send you the output back.
 
 
 
  Erik
 
 
 
  On Thu, Aug 9, 2012 at 12:38 PM, ds d...@physik.de wrote:
 
  This is very interesting,
 
  I have not more than 10% with oakfoam on i7-2600K. Would be
  interesting if it is the processor or if you e.g. access more often
  memory instead of cache due to your code...
 
  Do you have the chance to run your program on a i7-2600? or do you
  have to much time and try
  https://bitbucket.org/francoisvn/oakfoam/wiki/Home
  on your i7-3930. If so, I would be very much interested in the number
  you get in the beginning of a 19x19 game without book:)
 
 
  Detlef
 
  Am Donnerstag, den 09.08.2012, 12:16 +0200 schrieb Erik van der Werf:
 
  On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote:
  On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote:
   Hyperthreading does the trick, I have the experience it
  increases the
   performance by about 10%. I think this is due to waiting
 for
  RAM I/O or
   things like that
 
 
  Yes. With hyperthreading, performance per thread goes down
  significantly, but total performance goes up by about 15%. In
  the
  Pentium 4 era, hyperthreading did not usually pay off, but
  with i7,
  its performance is much better. The basic idea is that there
  are two
  instruction pipelines that share the same ALU and other
  processor units;
  if one of the pipelines stalls (usually due to memory fetch),
  the other
  can use the ALU in the meantime, or the two threads may use
  different
  parts of the CPU altogether based on what the instructions
 do.
 
 
 
  10-15%, really, that low? For my program (on an i7-3930K, going from
  6 to 12 threads) it is more in the order of 40% extra simulations per
  second.
 
 
  Erik
 
 
 
  ___
  Computer-go mailing list
  Computer-go@dvandva.org
  http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
 
 
  ___
  Computer-go mailing list
  Computer-go@dvandva.org
  http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
 
 
 
 
  ___
  Computer-go mailing list
  Computer-go@dvandva.org
  http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go

 ___
 Computer-go mailing list
 Computer-go@dvandva.org
 http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-10 Thread Petr Baudis
On Fri, Aug 10, 2012 at 09:26:31AM -0700, David Fotland wrote:
 Because my current approach seems to work just as well (or maybe better),
 and I haven't had time to code up a shared try and tune it up to validate
 that assumption.  Chaslot's paper indicates perhaps that not having a shared
 tree is stronger.  My guess is that they are about the same, so it's not
 worth the effort to change.

In Pachi, having a shared tree makes all the difference when scaling up
to more threads. See the graph (really awful one, sorry, it's old!) at

http://pachi.or.cz/root-vs-shared.png

If you have some information sharing near the root, I imagine it might
be similar to Pachi's distributed engine performance (or just slightly
better). But that is still far behind in scaling compared to the shared
tree in our experience.

P.S.: There are two important things, virtual loss (not necessarily 1
simulation but possibly more) and mainly lockless updates. The latter
also means that sane code should be really easy to modify to use single
shared tree instead of multiple trees.

Petr Pasky Baudis
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-09 Thread Hideki Kato
Erik van der Werf: 
CAKkgGrM83_HsQ5Z2HJupkj=gdeh3+4gm-jmlvevtjroufqn...@mail.gmail.com:
On Thu, Aug 9, 2012 at 11:14 AM, Petr Baudis pa...@ucw.cz wrote:

 On Wed, Aug 08, 2012 at 09:08:47PM +0200, ds wrote:
  Hyperthreading does the trick, I have the experience it increases the
  performance by about 10%. I think this is due to waiting for RAM I/O or
  things like that

 Yes. With hyperthreading, performance per thread goes down
 significantly, but total performance goes up by about 15%. In the
 Pentium 4 era, hyperthreading did not usually pay off, but with i7,
 its performance is much better. The basic idea is that there are two
 instruction pipelines that share the same ALU and other processor units;
 if one of the pipelines stalls (usually due to memory fetch), the other
 can use the ALU in the meantime, or the two threads may use different
 parts of the CPU altogether based on what the instructions do.


10-15%, really, that low? For my program (on an i7-3930K, going from 6 to
12 threads) it is more in the order of 40% extra simulations per second.

In general that number highly depends on the code, architecture of the 
processor (Intel's are usually better than AMD's), memory speed, cache 
size, use of ALUs, etc.  For Zen, the number is also about 40% on both 
an i7 3930K (6 to 12 threads) and an i7 920 (4 to 8 threads).

Pasky, modern processors are much more complicated :).  There are more 
than two sets of general registers, which are used not only for 
hyperthreading but also register renaming, for example.

Hideki
-- 
Hideki Kato mailto:hideki_ka...@ybb.ne.jp
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go


Re: [Computer-go] Kas Cup - results and prizes

2012-08-09 Thread Petr Baudis
On Thu, Aug 09, 2012 at 08:09:55PM +0900, Hideki Kato wrote:
 Erik van der Werf: 
 CAKkgGrM83_HsQ5Z2HJupkj=gdeh3+4gm-jmlvevtjroufqn...@mail.gmail.com:
 10-15%, really, that low? For my program (on an i7-3930K, going from 6 to
 12 threads) it is more in the order of 40% extra simulations per second.
 
 In general that number highly depends on the code, architecture of the 
 processor (Intel's are usually better than AMD's), memory speed, cache 
 size, use of ALUs, etc.  For Zen, the number is also about 40% on both 
 an i7 3930K (6 to 12 threads) and an i7 920 (4 to 8 threads).

For Zen, I'm not surprised, since I assume that in simulations, you are
matching some larger patterns which involves a lot of time-consuming
hash table lookups which is ideal for hyperthreading. Not sure about
stv. I think it matters a lot on whether you are matching patterns by
explicit test code snippets or by a hash table.

I measured the hyperthreading effect about 2 years ago with a lot older
Pachi version. I think today, the hyperthreading effect would also be
higher, but I cannot test it right now.

 Pasky, modern processors are much more complicated :).  There are more 
 than two sets of general registers, which are used not only for 
 hyperthreading but also register renaming, for example.

Sure, I just tried to sketch a rough explanation. I did not know that
hyperthreading could reduce opportunity for register renaming, though.

Petr Pasky Baudis
___
Computer-go mailing list
Computer-go@dvandva.org
http://dvandva.org/cgi-bin/mailman/listinfo/computer-go