On Tue, Jan 8, 2013 at 4:12 AM, Petr Baudis <[email protected]> wrote: > Hi! >
Thanks for response! Some more questions, I marked them with @@@ below > - size of board representation; > > The bare minimum of course is two bits per point. Usually, > an extra edge frame (or half-frame) is added so that you > do not have to special-case border areas. > Then again, for efficient implementation, you will also > need at least some representation of stone chains and > chain membership information for each point. Bit masks seems slow to me, I came up with a very effective board representation, which fit all board information into three arrays of integer indexes, plus several fields for current player color, ko status, etc. I cannot tell full details right now, but this arrays keeps some intricate linked lists, and allows to store information about each board intersection, and also list of stone chains together with count of real liberties for each chain. This board representation allows very fast modification during random playouts (because all information is stored in linked lists), and is relatively small (256 bytes for 9x9, 512 bytes for 13x13 and around 2200 byte for 19x19, because 19x19 have more then 256 positions and hence 16-bit offsets are keeped inside arrays instead of 8-bit offsets). Also, this representation allows pre-sorting of array of free positions according to some RAVE information and then random-select "useful" moves during playout more often then "not useful". I'll tell details of this representation If I don't manage to write functional Go engine in close time. Probably compactness of this structure is not so important when CPU is used, but I like it may be updated very fast. Then, you'll need to guard against superko by having a hash code per point+color and hash map of encountered positions. @@@ Is it possible to completely ignore superko during playouts and detect it during UTC tree descent only, or maybe, detect only relatively short superko cycles during playouts? I think that probability of long superko cycle during random game are very small and statistically its effect during long series of random playout will be insignificant. > - size of patterns library used for playout; > A bare minimum for good computer player is 3x3 patterns > around last move, two bits per point, i.e. 16 bits per pattern. > (Of course, you can use more bits or slightly larger patterns, > or use decision tree instead of pattern matching.) You can > extend this to much larger patterns, more moves, more features > and so on, but that's not so crucial. You'd have about 8-16 such patterns (up to isomorphism). I suspect strength of random playout may greatly depend on good patterns, and so my initial plan is in using 3x3 and 5x5 patterns where each pattern position can have one of eleven Zobrist hash value: empty, border, ko, or enemy/friend stones with 1, 2, 3, >3 liberties I plan to squeeze 100000 patterns in 512KB the next way: patterns are stored in hash table with bucket size 32 byte, each bucket can contain up to eight 4-byte patterns. Each pattern have 40-bit Zobrist hash (I hope it is enough). First 14 bit of Zobrist hash are used to detect bucket in hash table, and then 32 byte of bucket are readed and 8 stored patterns are checked. First 26 bits of each pattern contains second part of 40-bit Zobrist hash, and the last 6 bit contain pattern description (is it useful pattern or antipattern, is it a ko-threat, last-game move, move which is useful during semeai, should opponent tenuki after this move, etc.) > 2) What is a typical data size of structures used in modern > > Monte-Carlo Go engines? > > - size of UCT tree node; > > You will need at least UCT and RAVE values (#of playouts > and value or #of wins), move coordinates and say another > extra byte for some flags, per child, and pointer to its children list. I have a plan to completely avoid pointers to child nodes. Maybe I don't fully understand something, but if each UCT tree node have unique Zobrist hash and all nodes reside in gigantic hash table, then address of hash table bucket with specific child node can be calculated on the fly by calculating Zobrist hash of child position from parent Zobrist hash. So, it is enough for each node to keep relatively small header of node-specific information and array of UTCValues of chid positions, where index in the array corresponds to coordinates on the board. During tree descent I can select max UTC value of child node from array, restore child move coordinates from position in array, calculate Zobrist hash of child position and jump to appropriate bucket of hash table (using first part of calculated Zobrist hash as bucket number), then search node inside bucket using second part of hash. Array of UTC values will allow me to minimize touching of main memory during descent by selecting single child with maximum UTC value. So, I think is is possible to keep size of UTC node less then 1KB for 19x19 board assuming each UTC value in array of child UTC values fit in 2 bytes (I don't sure is 2 bytes is enough for rounded UTC value, but why not. Exact UTC value will be stored in child node itself). This assume some calculations during tree descent, but I think it is not critical, because I think most processor time will be spend not in tree descent, but in random playouts. While one warp will wait child UTC node from the main memory, other warps (of total 8) will use the same streaming multiprocessor for playouts. > Each of playout will use single warp (32 GPU thread) and typically > > will use only first thread of warp, but for several task, such as > > Zobrist hash calculation, pattern matching and relatively good > > next random playout move selection, all warp threads > > will be used. > > You should also be able to fully utilize all threads for tree node > selection. On the other hand, I worry a little about memory latency > when walking the tree. Since my UTC node will have array of child's UTCValues, I'll plan to use full warp of thread to fast array search. Regarding parallel access to child nodes, I want to touch main GPU memory as little as possible both because it is slow (100x of shared memory as I hear) and because I don't want L2 cache degradation, so I'll plan to touch only one child node at each descent step. As I write above, I don't sure full utilization of threads during tree descent is important, because tree descent seems mostly bounded by slow main memory access time rather then calculation speed, and also seems relatively small comparing to the time of heavy random playout. Also, I want to do not not one, but several playouts after each tree descent (I do not know whether this is a common practice), so tree walking time seems even more minor compared to playouts. The trouble here is the superko detection I think, you > would need some clever ways to cheat here. If you > don't care about 9x9 strength but just about large boards Again, is it really so important to detect long superko cycles during playout on 9x9? > > I suspect light playout for 13x13 can fit in 1 kilobyte of memory, > > Zobrist hashes alone would occupy 2352 bytes: (14*14)*3*4. Maybe > 2028 if you forego the edge frame (but you'll have to do expensive > index recomputation). > I want to store Zobrist hash values not in 6 KB of shared memory used for playout, but in 64 KB of GPU constant memory, which have 8 KB cache per SM and to me looks very suitable for Zobrist hash storage, because it seems optimized for both of potential use cases: 1) When I random-selected 32 candidate moves for playout, and the want to check pattern matching for each candidate, each warp thread will simultaneously calculate Zobrist hash for pattern corresponded to this thread's candidate move, and at each step all threads will request the same hash value from the constant memory (say "Zobrist-hash for upper-left corner of pattern if this place is empty"). 2) When calculating Zobrist hash of whole position, different threads in warps will request hash values from constant memory which are often aligned near each other I don't examine full details of GPU constant memory yet, but from what I know it will be very efficient for such task, and precious 6KB of shared memory can be dedicated solely for board representation and some local variables @@@ By the way, I have read several times already about keeping edges inside array together with positions to avoid analyze of position index to determine if it is lie near board, but I just don't understand this. Is it really so slow to check if position lie near board? As it seems to me, many board representation (such as bit masks) imply much more performance hit comparing to edge check. At the least, I can have a small shared lookup table, which will tell for each index is it lie near the edge, and near which edge exactly > And this is 13x13. I think engines that can't work on 19x19 aren't > really very exciting nowadays, maybe that's just me. > My very efficient (as I hope) board representation takes 256 bytes for 9x9, 512 byte for 13x13, but whole enormous 2200 bytes for 19x19, so my first plan is to avoid 19x19 completely and only aim at domination for 9x9 and 13x13 ;) (at least for the first time, to comfortably reside in 6KB of available memory) You don't need much memory for ladder evaluation, I think; > you need to store move stack only in case of branches, > which are rare when checking ladders. @@@ Even if branches are rare, I probably should implement branch analysis at least 1-2 levels deep, or are you think branches are so rare this is unnecessary? I pan to have some kind of stack inside 6KB and just clone whole 512-byte board representation (for 13x13) on the top of this "stack" when branch occured, and then perform atari series to the end to see if ladder is successful. > - Hash table with pattern library must fit in 512 kilobyte of GPU > > L2 cache (with total L2 cache size 768 kilobyte) > > You will also want the cache to hold some popular parts of the tree, > a "base" board with the current position, etc. > I'll have the rest 256 kilobytes of cache for this > > I hope single UCT node will take around 512 byte for > > 13x13 board, so GPU memory may hold up to 2 million > > UCT nodes. > > I think this shouldn't be a severe strength limit; if you implement some (not necessarily even very aggressive) pruning and store your nodes in an efficient manner, you shouldn't have trouble. Glad to hear this, thanks much for the info > If you ever get bored of GPU Go, I think the hardware future lies in two directions: > > (i) Xeon Phi (Larrabee). I'm so envious of any Go programmers that get their hands on this beauty first. > I was very excite when I hear about Xeon Phi some time ago, but I suspect it will be a bit pricey (at least at the first), and I already have two GTX 580, so for a while I'll plan to play with it :) Thanks again for very interesting comments! _______________________________________________ > Computer-go mailing list > [email protected] > http://dvandva.org/cgi-bin/mailman/listinfo/computer-go >
_______________________________________________ Computer-go mailing list [email protected] http://dvandva.org/cgi-bin/mailman/listinfo/computer-go
