Fuming and Petr, I was giving the tree-search some thought over the weekend and made some crude estimates based on my own MCTS implementation. I did not make fresh timing measurements, just added/subtracted some numbers to get to where I want to be. Assuming you want to maintain RAVE and win-rate information, I am guesstimating that for my quasi-heavy playouts I find an overhead of about 17% for the tree-search part (50,000 playouts total). My quasi-heavy playouts run at 20,000 playouts/second, and I am estimating that if you appropriately break the tree-search portion in descend, update, copy, and misc threads, you can get the overhead to come down to, say, 5%. Then, if you completely outsource the playouts to the FPGA, i.e., you make them cost-free (ignoring communication costs), you would be able to boost the playouts twenty-fold, to 400,000, at which point you become CPU limited. Assume further that my coding can be further improved by another 2x (not unreasonable at all), and you get to 800,000.
Thus, with an FPGA that delivers 1,500,000 playout/sec, you have a fairly serious challenge ahead of you if you want to update the tree for individual playouts and maintain RAVE information. Note that I only used 50,000 total playouts to measure the overhead, which is an underestimate if you have this much processing power available. In addition, if you move the tree-part off-FPGA, you may be able to get even faster playouts (e.g., two streams in parallel). Not to mention that FPGA's with 4x faster clocks already exist, some may be bigger, and you may be able to put a number of them in your system. I would say managing this much raw simulation power would require some key new insight (like Lukasz's epsilon trick proposal), or a reduction of effort by bunching things together (such as leaf parallellism). Perhaps the cluster-folks have something valuable to add to this, but I would say that for a single compute system this is unexplored territory. Fuming, are you planning to run a version of your program on KGS/CGOS at some time in the future? Than we can all see it in action. René On Mon, Jun 21, 2010 at 7:13 AM, Fuming Wang <[email protected]> wrote: > > Hi Petr, > > What I mean is that games are simulated in parallel, but are one clock > cycle apart from each other. > > On Sun, Jun 20, 2010 at 8:51 AM, Petr Baudis <[email protected]> wrote: > >> Hi! >> >> On Thu, Jun 17, 2010 at 10:04:58PM +0800, Fuming Wang wrote: >> > > Do all the games have to start synchonously? >> > > >> > > >> > Games are started one clock at a time. >> >> I'm sorry, I don't quite understand that - does that mean that >> different games can start at different times? >> >> > I do not quite understand your description. I'm guessing that you mean >> to >> > have n playouts at a time instead of one playout at a time like the >> regular >> > UCT algorithm. Nonetheless, FPGA based method are very static in >> structure, >> > and can not adapt easily to the dynamic changes of a game tree like a >> > software implemented tree structure. So I don't know how to implement >> UCT >> > type dynamic tree search algorithms in FPGA right now. >> >> I haven't meant the tree search to be done on the FPGA at all, I have >> merely tried to draft a method to effeciently queue games to be played >> on the FPGA next. Are you planning to do the alpha-beta search in the >> FPGA? >> >> > I see now. You were thinking about sending games to the FPGA and using the > simulation results in the main computer. This is not what we are doing > right now, we are doing the tree search in the FPGA. > > Fuming > > _______________________________________________ > Computer-go mailing list > [email protected] > http://dvandva.org/cgi-bin/mailman/listinfo/computer-go >
_______________________________________________ Computer-go mailing list [email protected] http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
