Re: [computer-go] Bitmap Go revisited and mockup implementation
I only showed the first line. There were many more. Gotta go to work though. René van de Veerdonk wrote: Micheal, Is that all the output you get? Did you notice that the first line with test-error changed ... the second number went from 31 to 32. We did something! But not what I expected. BTW. The program does not normally crash, it just stops after its initial testing is completed and errors were detected. If you actually went from a gazillion line to just one, we have hit a nerve. I you just didn't bother to copy the rest of the output, we made no progress. But, since the output actually changed, it would still be nice to see a few more lines and see the trend. Thanks for helping ... debugging for someone else must not be the best use of your time, René On Tue, Aug 25, 2009 at 8:03 PM, Michael Williams michaelwilliam...@gmail.com mailto:michaelwilliam...@gmail.com wrote: No. The code is now: __forceinline unsigned int bitmap_t::lowest_index () const { const __m128i comp = _mm_cmpeq_epi32 (_mm_setzero_si128 (), board); unsigned long m = _mm_movemask_epi8 (comp) ^ 0x; if (!_BitScanForward (m, m)) return 127; /* offending instruction */ //return 8*m + __lzcnt (board.m128i_u32[m/4]); /* new code follows */ unsigned long n; _BitScanForward (n, board.m128i_u32[m/4]); return 8*m + (32-n); } Debug: testing _BitScanForward (m, 0): 0 1099227377 testing _BitScanForward (m, 1): 1 0 testing _BitScanForward (m, 256): 1 8 testing _BitScanReverse (m, 0): 0 1103042705 testing _BitScanReverse (m, 1): 1 0 testing _BitScanReverse (m, 256): 1 8 test-error [bitmap_t::lowest_index]: 0 32 Release: testing _BitScanForward (m, 0): 0 16843009 testing _BitScanForward (m, 1): 1 0 testing _BitScanForward (m, 256): 1 8 testing _BitScanReverse (m, 0): 0 4 testing _BitScanReverse (m, 1): 1 0 testing _BitScanReverse (m, 256): 1 8 test-error [bitmap_t::lowest_index]: 0 32 René van de Veerdonk wrote: Micheal, Your output is actually correct. The output is undefined when the second argument to _BitScanForward is zero and that is what I am explicitly testing for in the method bitmap_t::lowest_index, which is supposed to return an index to any set bit (despite the name). Your output shows that it returns a number that is off in a very predictable way (it seems that replacing 8*m + n with 8*m + (32-n) on the last line may do the trick for you, but it breaks my local copy). So, now I am positively baffled. Help? René On Tue, Aug 25, 2009 at 7:25 PM, Michael Williams michaelwilliam...@gmail.com mailto:michaelwilliam...@gmail.com mailto:michaelwilliam...@gmail.com mailto:michaelwilliam...@gmail.com wrote: Debug build: testing _BitScanForward (m, 0): 0 842032512 testing _BitScanForward (m, 1): 1 0 testing _BitScanForward (m, 256): 1 8 testing _BitScanReverse (m, 0): 0 840277132 testing _BitScanReverse (m, 1): 1 0 testing _BitScanReverse (m, 256): 1 8 Release build: testing _BitScanForward (m, 0): 0 16843009 testing _BitScanForward (m, 1): 1 0 testing _BitScanForward (m, 256): 1 8 testing _BitScanReverse (m, 0): 0 4 testing _BitScanReverse (m, 1): 1 0 testing _BitScanReverse (m, 256): 1 8 Sucks when they produce different results. René van de Veerdonk wrote: Micheal, Thanks for trying. Could you try one or two more things? (1) Replace the _BitScanForward in the new code to _BitScanReverse. This probably won't help. (2) Add this to bitmapgo.cpp: /* sandbox */ const bool sandbox = true; void test_sandbox () { unsigned long m = 220; unsigned char test_it = _BitScanForward (m, 0); std::cout testing _BitScanForward (m, 0): (int) test_it m \n; test_it = _BitScanForward (m, 1); std::cout testing _BitScanForward (m, 1): (int) test_it m \n; test_it = _BitScanForward (m, 256); std::cout testing _BitScanForward (m, 256): (int) test_it m \n; test_it = _BitScanReverse (m, 0); std::cout testing _BitScanReverse (m, 0): (int) test_it m \n; test_it = _BitScanReverse (m, 1); std::cout testing _BitScanReverse (m, 1): (int) test_it m \n; test_it = _BitScanReverse (m, 256); std::cout testing _BitScanReverse (m, 256):
Re: [computer-go] Bitmap Go revisited and mockup implementation
Hi René, David, Confession: I have not tested 19x19. As you have noted, and others before you over the years, a 19x19 board does not fit in one but three 128-bit registers and there would be a rather big penalty as a result, perhaps (likely?) wiping out all of the benefits of bitmaps. Antoine voiced his disappointment about the speed advantage even on 9x9 in the e-mail to the list that served as my basis. My hope, as Hideko Kato points out, is in longer registers or perhaps being able to port this to a GPU. A 64-bit OS with twice the number registers would also surely help out. Nevertheless, I was surprised that I was able to get to almost the same level of speed as Libego provides. As far as I remember, I was not disappointed by the speed itself on 9x9 boards, but mainly with the following 2 points: a) my feeling is that, as you say, it does not scale very well on 19x19 (on current hardware). I don't think other classical implementations suffer such a big penalty when scaling from 9x9 to 19x19. b) I also felt that this kind of implementation was not very flexible. For instance, I had another classical implementation, running at equivalent speed, but in which local 3x3 pattern matching was almost for free, as well as other more elaborated information. When I started to think about introducing 3x3 local patterns in the bitmap only implementation, it was clear it would not be for free. At that time, my conclusion was that if one only needs pure random play with no intelligence at all during playouts, then bitmap implementation could compete (at least on 9x9). If one needs more elaborated knowledge (such as small patterns matching, or even knowledge about blocks of stones), then pure bitmap implementation is probably not so competitive. I thus gave up with the idea and jumped to more promising ones. Anyway, I'm glad my post has been usefull to you. And I encourage you to improve your implementation and let us know, especially if you have fun. Starting with something and playing with it is a good way to have new ideas, even if this makes your initial ones look less interesting a while after. Best regards, Antoine ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Bitmap Go revisited and mockup implementation
(AMD Phenom CPU) Michael Williams wrote: It came through fine the first time. Gmail probably hid the duplicate text from you for convenience since it saw that it was text that you already sent out. Or something. I can compile the original (9x9) bitmap Go files in Windows 7 x64, but when I run it I get this: test-error [bitmap_t::lowest_index]: 0 31 test-error [bitmap_t::lowest_index]: 1 30 test-error [bitmap_t::lowest_index]: 2 29 test-error [bitmap_t::lowest_index]: 3 28 test-error [bitmap_t::lowest_index]: 4 27 test-error [bitmap_t::lowest_index]: 5 26 test-error [bitmap_t::lowest_index]: 6 25 test-error [bitmap_t::lowest_index]: 7 24 test-error [bitmap_t::lowest_index]: 8 23 test-error [bitmap_t::lowest_index]: 9 22 test-error [bitmap_t::lowest_index]: 10 21 test-error [bitmap_t::lowest_index]: 11 20 test-error [bitmap_t::lowest_index]: 12 19 test-error [bitmap_t::lowest_index]: 13 18 test-error [bitmap_t::lowest_index]: 14 17 test-error [bitmap_t::lowest_index]: 15 16 test-error [bitmap_t::lowest_index]: 16 15 test-error [bitmap_t::lowest_index]: 17 14 test-error [bitmap_t::lowest_index]: 18 13 test-error [bitmap_t::lowest_index]: 19 12 test-error [bitmap_t::lowest_index]: 20 11 test-error [bitmap_t::lowest_index]: 21 10 test-error [bitmap_t::lowest_index]: 22 9 test-error [bitmap_t::lowest_index]: 23 8 test-error [bitmap_t::lowest_index]: 24 7 test-error [bitmap_t::lowest_index]: 25 6 test-error [bitmap_t::lowest_index]: 26 5 test-error [bitmap_t::lowest_index]: 27 4 test-error [bitmap_t::lowest_index]: 28 3 test-error [bitmap_t::lowest_index]: 29 2 test-error [bitmap_t::lowest_index]: 30 1 test-error [bitmap_t::lowest_index]: 31 0 test-error [bitmap_t::lowest_index]: 32 63 test-error [bitmap_t::lowest_index]: 33 62 test-error [bitmap_t::lowest_index]: 34 61 test-error [bitmap_t::lowest_index]: 35 60 test-error [bitmap_t::lowest_index]: 36 59 test-error [bitmap_t::lowest_index]: 37 58 test-error [bitmap_t::lowest_index]: 38 57 test-error [bitmap_t::lowest_index]: 39 56 test-error [bitmap_t::lowest_index]: 40 55 test-error [bitmap_t::lowest_index]: 41 54 test-error [bitmap_t::lowest_index]: 42 53 test-error [bitmap_t::lowest_index]: 43 52 test-error [bitmap_t::lowest_index]: 44 51 test-error [bitmap_t::lowest_index]: 45 50 test-error [bitmap_t::lowest_index]: 46 49 test-error [bitmap_t::lowest_index]: 47 48 test-error [bitmap_t::lowest_index]: 48 47 test-error [bitmap_t::lowest_index]: 49 46 test-error [bitmap_t::lowest_index]: 50 45 test-error [bitmap_t::lowest_index]: 51 44 test-error [bitmap_t::lowest_index]: 52 43 test-error [bitmap_t::lowest_index]: 53 42 test-error [bitmap_t::lowest_index]: 54 41 test-error [bitmap_t::lowest_index]: 55 40 test-error [bitmap_t::lowest_index]: 56 39 test-error [bitmap_t::lowest_index]: 57 38 test-error [bitmap_t::lowest_index]: 58 37 test-error [bitmap_t::lowest_index]: 59 36 test-error [bitmap_t::lowest_index]: 60 35 test-error [bitmap_t::lowest_index]: 61 34 test-error [bitmap_t::lowest_index]: 62 33 test-error [bitmap_t::lowest_index]: 63 32 test-error [bitmap_t::lowest_index]: 64 95 test-error [bitmap_t::lowest_index]: 65 94 test-error [bitmap_t::lowest_index]: 66 93 test-error [bitmap_t::lowest_index]: 67 92 test-error [bitmap_t::lowest_index]: 68 91 test-error [bitmap_t::lowest_index]: 69 90 test-error [bitmap_t::lowest_index]: 70 89 test-error [bitmap_t::lowest_index]: 71 88 test-error [bitmap_t::lowest_index]: 72 87 test-error [bitmap_t::lowest_index]: 73 86 test-error [bitmap_t::lowest_index]: 74 85 test-error [bitmap_t::lowest_index]: 75 84 test-error [bitmap_t::lowest_index]: 76 83 test-error [bitmap_t::lowest_index]: 77 82 test-error [bitmap_t::lowest_index]: 78 81 test-error [bitmap_t::lowest_index]: 79 80 test-error [bitmap_t::lowest_index]: 80 79 test-error-1 [bitmap_t::is_one_bit]: 0 test-error-1 [bitmap_t::is_one_bit]: 1 test-error-1 [bitmap_t::is_one_bit]: 2 test-error-1 [bitmap_t::is_one_bit]: 3 test-error-1 [bitmap_t::is_one_bit]: 4 test-error-1 [bitmap_t::is_one_bit]: 5 test-error-1 [bitmap_t::is_one_bit]: 6 test-error-1 [bitmap_t::is_one_bit]: 7 test-error-1 [bitmap_t::is_one_bit]: 8 test-error-1 [bitmap_t::is_one_bit]: 9 test-error-1 [bitmap_t::is_one_bit]: 10 test-error-1 [bitmap_t::is_one_bit]: 11 test-error-1 [bitmap_t::is_one_bit]: 12 test-error-1 [bitmap_t::is_one_bit]: 13 test-error-1 [bitmap_t::is_one_bit]: 14 test-error-1 [bitmap_t::is_one_bit]: 15 test-error-1 [bitmap_t::is_one_bit]: 16 test-error-1 [bitmap_t::is_one_bit]: 17 test-error-1 [bitmap_t::is_one_bit]: 18 test-error-1 [bitmap_t::is_one_bit]: 19 test-error-1 [bitmap_t::is_one_bit]: 20 test-error-1 [bitmap_t::is_one_bit]: 21 test-error-1 [bitmap_t::is_one_bit]: 22 test-error-1 [bitmap_t::is_one_bit]: 23 test-error-1 [bitmap_t::is_one_bit]: 24 test-error-1 [bitmap_t::is_one_bit]: 25 test-error-1 [bitmap_t::is_one_bit]: 26 test-error-1 [bitmap_t::is_one_bit]: 27 test-error-1 [bitmap_t::is_one_bit]: 28 test-error-1 [bitmap_t::is_one_bit]: 29 test-error-1
Re: [computer-go] Bitmap Go revisited and mockup implementation
Hi Micheal, Thanks for the test (Intel - AMD, Windows XP - Windows 7, x32 - x64, great, three birds with one stone!). So much for portability. But, hooray for test-cases! This may be related to the __lzcnt intrinsic, giving you a completely different result, i.e., 32 - my result. Could you try replacing this function in bitmap.cpp? I believe I tested this alternative during development and it generates only slightly worse performing code. __forceinline unsigned int bitmap_t::lowest_index () const { const __m128i comp = _mm_cmpeq_epi32 (_mm_setzero_si128 (), board); unsigned long m = _mm_movemask_epi8 (comp) ^ 0x; if (!_BitScanForward (m, m)) return 127; /* offending instruction */ //return 8*m + __lzcnt (board.m128i_u32[m/4]); /* new code follows */ unsigned long n; _BitScanForward (n, board.m128i_u32[m/4]); return 8*m + n; } René PS. x64 ... new playground, you could use __lzcnt64 ... so many things to try! On Tue, Aug 25, 2009 at 5:43 PM, Michael Williams michaelwilliam...@gmail.com wrote: (AMD Phenom CPU) Michael Williams wrote: It came through fine the first time. Gmail probably hid the duplicate text from you for convenience since it saw that it was text that you already sent out. Or something. I can compile the original (9x9) bitmap Go files in Windows 7 x64, but when I run it I get this: test-error [bitmap_t::lowest_index]: 0 31 test-error [bitmap_t::lowest_index]: 1 30 test-error [bitmap_t::lowest_index]: 2 29 test-error [bitmap_t::lowest_index]: 3 28 test-error [bitmap_t::lowest_index]: 4 27 test-error [bitmap_t::lowest_index]: 5 26 test-error [bitmap_t::lowest_index]: 6 25 test-error [bitmap_t::lowest_index]: 7 24 test-error [bitmap_t::lowest_index]: 8 23 test-error [bitmap_t::lowest_index]: 9 22 test-error [bitmap_t::lowest_index]: 10 21 test-error [bitmap_t::lowest_index]: 11 20 test-error [bitmap_t::lowest_index]: 12 19 test-error [bitmap_t::lowest_index]: 13 18 test-error [bitmap_t::lowest_index]: 14 17 test-error [bitmap_t::lowest_index]: 15 16 test-error [bitmap_t::lowest_index]: 16 15 test-error [bitmap_t::lowest_index]: 17 14 test-error [bitmap_t::lowest_index]: 18 13 test-error [bitmap_t::lowest_index]: 19 12 test-error [bitmap_t::lowest_index]: 20 11 test-error [bitmap_t::lowest_index]: 21 10 test-error [bitmap_t::lowest_index]: 22 9 test-error [bitmap_t::lowest_index]: 23 8 test-error [bitmap_t::lowest_index]: 24 7 test-error [bitmap_t::lowest_index]: 25 6 test-error [bitmap_t::lowest_index]: 26 5 test-error [bitmap_t::lowest_index]: 27 4 test-error [bitmap_t::lowest_index]: 28 3 test-error [bitmap_t::lowest_index]: 29 2 test-error [bitmap_t::lowest_index]: 30 1 test-error [bitmap_t::lowest_index]: 31 0 test-error [bitmap_t::lowest_index]: 32 63 test-error [bitmap_t::lowest_index]: 33 62 test-error [bitmap_t::lowest_index]: 34 61 test-error [bitmap_t::lowest_index]: 35 60 test-error [bitmap_t::lowest_index]: 36 59 test-error [bitmap_t::lowest_index]: 37 58 test-error [bitmap_t::lowest_index]: 38 57 test-error [bitmap_t::lowest_index]: 39 56 test-error [bitmap_t::lowest_index]: 40 55 test-error [bitmap_t::lowest_index]: 41 54 test-error [bitmap_t::lowest_index]: 42 53 test-error [bitmap_t::lowest_index]: 43 52 test-error [bitmap_t::lowest_index]: 44 51 test-error [bitmap_t::lowest_index]: 45 50 test-error [bitmap_t::lowest_index]: 46 49 test-error [bitmap_t::lowest_index]: 47 48 test-error [bitmap_t::lowest_index]: 48 47 test-error [bitmap_t::lowest_index]: 49 46 test-error [bitmap_t::lowest_index]: 50 45 test-error [bitmap_t::lowest_index]: 51 44 test-error [bitmap_t::lowest_index]: 52 43 test-error [bitmap_t::lowest_index]: 53 42 test-error [bitmap_t::lowest_index]: 54 41 test-error [bitmap_t::lowest_index]: 55 40 test-error [bitmap_t::lowest_index]: 56 39 test-error [bitmap_t::lowest_index]: 57 38 test-error [bitmap_t::lowest_index]: 58 37 test-error [bitmap_t::lowest_index]: 59 36 test-error [bitmap_t::lowest_index]: 60 35 test-error [bitmap_t::lowest_index]: 61 34 test-error [bitmap_t::lowest_index]: 62 33 test-error [bitmap_t::lowest_index]: 63 32 test-error [bitmap_t::lowest_index]: 64 95 test-error [bitmap_t::lowest_index]: 65 94 test-error [bitmap_t::lowest_index]: 66 93 test-error [bitmap_t::lowest_index]: 67 92 test-error [bitmap_t::lowest_index]: 68 91 test-error [bitmap_t::lowest_index]: 69 90 test-error [bitmap_t::lowest_index]: 70 89 test-error [bitmap_t::lowest_index]: 71 88 test-error [bitmap_t::lowest_index]: 72 87 test-error [bitmap_t::lowest_index]: 73 86 test-error [bitmap_t::lowest_index]: 74 85 test-error [bitmap_t::lowest_index]: 75 84 test-error [bitmap_t::lowest_index]: 76 83 test-error [bitmap_t::lowest_index]: 77 82 test-error [bitmap_t::lowest_index]: 78 81 test-error [bitmap_t::lowest_index]: 79 80 test-error [bitmap_t::lowest_index]: 80 79 test-error-1 [bitmap_t::is_one_bit]: 0 test-error-1 [bitmap_t::is_one_bit]: 1 test-error-1
Re: [computer-go] Bitmap Go revisited and mockup implementation
That doesn't seem to have any effect on the results. René van de Veerdonk wrote: Hi Micheal, Thanks for the test (Intel - AMD, Windows XP - Windows 7, x32 - x64, great, three birds with one stone!). So much for portability. But, hooray for test-cases! This may be related to the __lzcnt intrinsic, giving you a completely different result, i.e., 32 - my result. Could you try replacing this function in bitmap.cpp? I believe I tested this alternative during development and it generates only slightly worse performing code. __forceinline unsigned int bitmap_t::lowest_index () const { const __m128i comp = _mm_cmpeq_epi32 (_mm_setzero_si128 (), board); unsigned long m = _mm_movemask_epi8 (comp) ^ 0x; if (!_BitScanForward (m, m)) return 127; /* offending instruction */ //return 8*m + __lzcnt (board.m128i_u32[m/4]); /* new code follows */ unsigned long n; _BitScanForward (n, board.m128i_u32[m/4]); return 8*m + n; } René PS. x64 ... new playground, you could use __lzcnt64 ... so many things to try! On Tue, Aug 25, 2009 at 5:43 PM, Michael Williams michaelwilliam...@gmail.com mailto:michaelwilliam...@gmail.com wrote: (AMD Phenom CPU) Michael Williams wrote: It came through fine the first time. Gmail probably hid the duplicate text from you for convenience since it saw that it was text that you already sent out. Or something. I can compile the original (9x9) bitmap Go files in Windows 7 x64, but when I run it I get this: test-error [bitmap_t::lowest_index]: 0 31 test-error [bitmap_t::lowest_index]: 1 30 test-error [bitmap_t::lowest_index]: 2 29 test-error [bitmap_t::lowest_index]: 3 28 test-error [bitmap_t::lowest_index]: 4 27 test-error [bitmap_t::lowest_index]: 5 26 test-error [bitmap_t::lowest_index]: 6 25 test-error [bitmap_t::lowest_index]: 7 24 test-error [bitmap_t::lowest_index]: 8 23 test-error [bitmap_t::lowest_index]: 9 22 test-error [bitmap_t::lowest_index]: 10 21 test-error [bitmap_t::lowest_index]: 11 20 test-error [bitmap_t::lowest_index]: 12 19 test-error [bitmap_t::lowest_index]: 13 18 test-error [bitmap_t::lowest_index]: 14 17 test-error [bitmap_t::lowest_index]: 15 16 test-error [bitmap_t::lowest_index]: 16 15 test-error [bitmap_t::lowest_index]: 17 14 test-error [bitmap_t::lowest_index]: 18 13 test-error [bitmap_t::lowest_index]: 19 12 test-error [bitmap_t::lowest_index]: 20 11 test-error [bitmap_t::lowest_index]: 21 10 test-error [bitmap_t::lowest_index]: 22 9 test-error [bitmap_t::lowest_index]: 23 8 test-error [bitmap_t::lowest_index]: 24 7 test-error [bitmap_t::lowest_index]: 25 6 test-error [bitmap_t::lowest_index]: 26 5 test-error [bitmap_t::lowest_index]: 27 4 test-error [bitmap_t::lowest_index]: 28 3 test-error [bitmap_t::lowest_index]: 29 2 test-error [bitmap_t::lowest_index]: 30 1 test-error [bitmap_t::lowest_index]: 31 0 test-error [bitmap_t::lowest_index]: 32 63 test-error [bitmap_t::lowest_index]: 33 62 test-error [bitmap_t::lowest_index]: 34 61 test-error [bitmap_t::lowest_index]: 35 60 test-error [bitmap_t::lowest_index]: 36 59 test-error [bitmap_t::lowest_index]: 37 58 test-error [bitmap_t::lowest_index]: 38 57 test-error [bitmap_t::lowest_index]: 39 56 test-error [bitmap_t::lowest_index]: 40 55 test-error [bitmap_t::lowest_index]: 41 54 test-error [bitmap_t::lowest_index]: 42 53 test-error [bitmap_t::lowest_index]: 43 52 test-error [bitmap_t::lowest_index]: 44 51 test-error [bitmap_t::lowest_index]: 45 50 test-error [bitmap_t::lowest_index]: 46 49 test-error [bitmap_t::lowest_index]: 47 48 test-error [bitmap_t::lowest_index]: 48 47 test-error [bitmap_t::lowest_index]: 49 46 test-error [bitmap_t::lowest_index]: 50 45 test-error [bitmap_t::lowest_index]: 51 44 test-error [bitmap_t::lowest_index]: 52 43 test-error [bitmap_t::lowest_index]: 53 42 test-error [bitmap_t::lowest_index]: 54 41 test-error [bitmap_t::lowest_index]: 55 40 test-error [bitmap_t::lowest_index]: 56 39 test-error [bitmap_t::lowest_index]: 57 38 test-error [bitmap_t::lowest_index]: 58 37 test-error [bitmap_t::lowest_index]: 59 36 test-error [bitmap_t::lowest_index]: 60 35 test-error [bitmap_t::lowest_index]: 61 34 test-error [bitmap_t::lowest_index]: 62 33 test-error [bitmap_t::lowest_index]: 63 32 test-error [bitmap_t::lowest_index]: 64 95 test-error [bitmap_t::lowest_index]: 65 94 test-error [bitmap_t::lowest_index]: 66 93 test-error [bitmap_t::lowest_index]: 67 92 test-error
Re: [computer-go] Bitmap Go revisited and mockup implementation
Zach, I can't help the windowsiness, but thanks for taking a look already. Popcounting I believe to be optimal for 32-bit only instructions, especially since I would like to keep the intermediate byte-count to help speed up the random pick function. There is a really neat trick to speed up the final accumulations, but I couldn't find the sse2 instructions to implement it. Bitscanning is indeed awkward the way I implemented it, but I didn't see a better way to do it as there is no compare instruction that yields a boolean as the result. I would be interested to see what you come up with for shifting, I believe that this is where a lot of time is spend. René On Tue, Aug 25, 2009 at 6:20 PM, Zach Wegner zweg...@gmail.com wrote: On an initial look, it seems that the shifting, popcounting, and bitscanning functions can be improved significantly. That's all I looked at closely, perhaps the other board routines could use some reworking too. I'll try my hand at optimizing it if I get the time, but the windowsy code makes it a bit difficult. Zach ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Bitmap Go revisited and mockup implementation
A couple of notes. Some of us on this computer-go forum are also Chess programmers and many of write 64 bit chess programs. I am one of them but I know there are others. My 64 bit chess program runs almost 2X faster if I compile it to run on a 64 bit OS - in other words I'm using real 64 bit values to represent the bit boards.So this really is a big deal - you can expect a pretty good gain. I also toyed with a bit board style go program and started working on one long ago. You still need several 64 bit words to represent the state of each color.I started to write some routines that did shift, popcount, etc as fast as possible. Basically you had to compile the program for each board size, and it figured out how many bits were needed and simulated (using struct) an integer data type that was large enough. About half way through I decided not to proceed - at the time I had some idea that I was more interested in and I never picked it back up - assuming that it would be too slow (but I didn't prove it.) I think you could look at glaurung (the chess program) to find great routines for the common bit operations. I think the author of Glaurung does it about as fast as it can be done and even has assembly code for finding the right-most bit, or something like that. But you could probably learn something from looking at that program. I'm not sure how useful this might be for go, but there are some pretty clever tricks with minimal perfect hashing - that you might find some use for. Perhaps there is a way to make this work for eye-detection or someting. There is some great explanations here: http://chessprogramming.wikispaces.com/I think you want to look under move generation perhaps. Perhaps this cannot be used for GO, but it's fun to read anyway :-) The only issue of course is that you need more than 64 bits for go - but the general principles might be made to work. - Don 2009/8/25 René van de Veerdonk rene.vandeveerd...@gmail.com Antoine, I apologize for misrepresenting your feelings. What I wanted to convey is that your e-mail expressed that with the same amount of computation, other implementation strategies provide more information, so the information gained for the effort put in is disappointing. In other words, bitmap-go had better be much faster to make up for the lack of information gained ... and it wasn't. That's pretty much what you are saying as well, I believe. As are the others in this thread. I think we can all agree on this. Going into this project, I was well aware that I was going to end up with light playouts only, and that heavy playouts is the way to go except perhaps for some special purposes such as ladder readouts. I was also well aware of the scaling argument. But, I hadn't thought that this would be the first thing thrown at my first post, as there are many programmers that seem to be stuck at 9x9. Nonetheless, bitmap-go as a topic keeps resurfacing on this mailing list every once in a while and nobody ever put solid data and a reference implementation behind it. That is what I wanted to accomplish with my mockup. Confession #2: I have been working on my own MCTS implementation using the standard implementation methods that almost everybody else is using. But there is a never-ending laundry-list of things that I would like to include before I would reach reasonable strength (where reasonable is a moving target). In the mean time, there are many others that have demonstrated much more capable programmers than I ever will be. So, by providing this bitmap-go mockup at least I had the feeling that I was contributing something newsworthy to the conversation. This may have never happened if I would wait until my MCTS program is ready. I imagine that there are others on this list in a similar situation. Besides, this was an interesting side-project and perhaps someone else can benefit from it (go for it Brian!). And, yes, it was fun. Okay, enough mesmerizing, on to the main topic. David, Lukasz, I did modify my mockup to do 19x19 bitmap-go (attached). It is a hardcoded solution using arrays of three 128-bit variables. I did not even attempt to optimize this version, so this is not the best possible solution. Nonetheless, here is a comparison: Example output (9x9): = [game] = 30236.1, [moves] = 111.071 [game] = 30249.7, [moves] = 111.068 [game] = 30145.7, [moves] = 111.089 [game] = 30237.7, [moves] = 111.122 [game] = 30210.1, [moves] = 111.101 [game] = 78.0488 kpps, [moves] = 111.023 [game] = 78.0488 kpps, [moves] = 111.045 [game] = 78.0488 kpps, [moves] = 111.046 [game] = 79.0123 kpps, [moves] = 111.131 [game] = 78.0488 kpps, [moves] = 111.082 [legal] 110/51, [pick] 110/74, [play] 106/168, [score] 44, [win] 0.4187 [legal] 111/51, [pick] 111/74, [play] 106/168, [score] 40, [win] 0.4201 [legal] 111/52, [pick] 111/75, [play] 106/168, [score] 42, [win] 0.4276
Re: [computer-go] Bitmap Go revisited and mockup implementation
Mark, I would argue that 10x vs 5x is not a deal-breaker if bitmaps would support your particular goals better. They do provide the advantage of automatically providing you all the liberties of strings, for instance. Requests on how to do that efficiently come by on this mailing list on a regular basis. But I agree that it does not look very good in general. As for merging being time consuming. The merge portion of the play function looks like this: /* merge friendly strings */ bool single_stone_string = true; do { bitmap_t adj_strings = bm_adjacent bm_my_points; while (!adj_strings.is_empty ()) { unsigned int i = adj_strings.lowest_index (); do { i = anchor[i]; } while (i != anchor[i]); anchor[i] = move; single_stone_string = false; adj_strings.clear (string[i]); bm_string.set (string[i]); bm_liberty.set (liberty[i]); } bm_liberty.clear (bm_move); atari.clear (bm_string); } while (false); The difference between 9x9 and 19x19 is in the average length of the linked list anchor[]. The merge itself should be the same work, other than the fact that for 19x19 I now have to perform the work on three elements per bitmap as compared to just one for 9x9. Moving away from a linked list may prove beneficial for 19x19, but than you need to ensure that all the anchors for a string are correctly set during merge. That means looping over all the stones in the string, which I do not believe to be very efficient with bitmaps as it requires cumbersome bit-scanning. René On Tue, Aug 25, 2009 at 6:09 PM, Mark Boon tesujisoftw...@gmail.com wrote: 2009/8/25 René van de Veerdonk rene.vandeveerd...@gmail.com: Nonetheless, bitmap-go as a topic keeps resurfacing on this mailing list every once in a while and nobody ever put solid data and a reference implementation behind it. That is what I wanted to accomplish with my mockup. I think this is interesting information, so I think it's good you tried it. It basically means unless someone has a much smarter idea how to implement bit-sets, the performance is not enough to go through the trouble for most programmers. So, it looks that I was overly pessimistic in terms of performance drop per move, which is a factor of 2.4x (with little effort to optimize for 19x19). But, with 4.1x more moves, this still resulted in a 10x speed penalty. With the provided reference numbers, Libego only drops 4.5-5.0x, indicating that there is almost no performance loss per move (1.2x). Is this difference roughly in line with others expectation? Yes, I'd expect that. The main speed-determining factor in traditional implementations is merging. That depends on the average length of the merge. Obviously this can be much longer on 19x19 as a worst case, but on average I wouldn't expect it to make much difference compared to other board-sizes. Mark ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] Bitmap Go revisited and mockup implementation
Debug build: testing _BitScanForward (m, 0): 0 842032512 testing _BitScanForward (m, 1): 1 0 testing _BitScanForward (m, 256): 1 8 testing _BitScanReverse (m, 0): 0 840277132 testing _BitScanReverse (m, 1): 1 0 testing _BitScanReverse (m, 256): 1 8 Release build: testing _BitScanForward (m, 0): 0 16843009 testing _BitScanForward (m, 1): 1 0 testing _BitScanForward (m, 256): 1 8 testing _BitScanReverse (m, 0): 0 4 testing _BitScanReverse (m, 1): 1 0 testing _BitScanReverse (m, 256): 1 8 Sucks when they produce different results. René van de Veerdonk wrote: Micheal, Thanks for trying. Could you try one or two more things? (1) Replace the _BitScanForward in the new code to _BitScanReverse. This probably won't help. (2) Add this to bitmapgo.cpp: /* sandbox */ const bool sandbox = true; void test_sandbox () { unsigned long m = 220; unsigned char test_it = _BitScanForward (m, 0); std::cout testing _BitScanForward (m, 0): (int) test_it m \n; test_it = _BitScanForward (m, 1); std::cout testing _BitScanForward (m, 1): (int) test_it m \n; test_it = _BitScanForward (m, 256); std::cout testing _BitScanForward (m, 256): (int) test_it m \n; test_it = _BitScanReverse (m, 0); std::cout testing _BitScanReverse (m, 0): (int) test_it m \n; test_it = _BitScanReverse (m, 1); std::cout testing _BitScanReverse (m, 1): (int) test_it m \n; test_it = _BitScanReverse (m, 256); std::cout testing _BitScanReverse (m, 256): (int) test_it m \n; } and add a line something like this in the main program: if (test sandbox) test_sandbox (); /* new line */ if (test verify) test = test_bitmap_t::test (); /* already there */ That does not solve anything, but it may tell me if those functions are doing for you what they do for me. My output is: testing _BitScanForward (m, 0): 0 16843009 testing _BitScanForward (m, 1): 1 0 testing _BitScanForward (m, 256): 1 8 testing _BitScanReverse (m, 0): 0 3432104 testing _BitScanReverse (m, 1): 1 0 testing _BitScanReverse (m, 256): 1 8 Thanks, I really appreciate your help ! René On Tue, Aug 25, 2009 at 6:38 PM, Michael Williams michaelwilliam...@gmail.com mailto:michaelwilliam...@gmail.com wrote: That doesn't seem to have any effect on the results. René van de Veerdonk wrote: Hi Micheal, Thanks for the test (Intel - AMD, Windows XP - Windows 7, x32 - x64, great, three birds with one stone!). So much for portability. But, hooray for test-cases! This may be related to the __lzcnt intrinsic, giving you a completely different result, i.e., 32 - my result. Could you try replacing this function in bitmap.cpp? I believe I tested this alternative during development and it generates only slightly worse performing code. __forceinline unsigned int bitmap_t::lowest_index () const { const __m128i comp = _mm_cmpeq_epi32 (_mm_setzero_si128 (), board); unsigned long m = _mm_movemask_epi8 (comp) ^ 0x; if (!_BitScanForward (m, m)) return 127; /* offending instruction */ //return 8*m + __lzcnt (board.m128i_u32[m/4]); /* new code follows */ unsigned long n; _BitScanForward (n, board.m128i_u32[m/4]); return 8*m + n; } René PS. x64 ... new playground, you could use __lzcnt64 ... so many things to try! On Tue, Aug 25, 2009 at 5:43 PM, Michael Williams michaelwilliam...@gmail.com mailto:michaelwilliam...@gmail.com mailto:michaelwilliam...@gmail.com mailto:michaelwilliam...@gmail.com wrote: (AMD Phenom CPU) Michael Williams wrote: It came through fine the first time. Gmail probably hid the duplicate text from you for convenience since it saw that it was text that you already sent out. Or something. I can compile the original (9x9) bitmap Go files in Windows 7 x64, but when I run it I get this: test-error [bitmap_t::lowest_index]: 0 31 test-error [bitmap_t::lowest_index]: 1 30 test-error [bitmap_t::lowest_index]: 2 29 test-error [bitmap_t::lowest_index]: 3 28 test-error [bitmap_t::lowest_index]: 4 27 test-error [bitmap_t::lowest_index]: 5 26 test-error [bitmap_t::lowest_index]: 6 25 test-error [bitmap_t::lowest_index]: 7 24 test-error [bitmap_t::lowest_index]: 8 23 test-error [bitmap_t::lowest_index]: 9 22 test-error [bitmap_t::lowest_index]: 10 21 test-error [bitmap_t::lowest_index]: 11 20 test-error [bitmap_t::lowest_index]: 12 19 test-error [bitmap_t::lowest_index]: 13 18 test-error [bitmap_t::lowest_index]: 14 17
Re: [computer-go] Bitmap Go revisited and mockup implementation
No. The code is now: __forceinline unsigned int bitmap_t::lowest_index () const { const __m128i comp = _mm_cmpeq_epi32 (_mm_setzero_si128 (), board); unsigned long m = _mm_movemask_epi8 (comp) ^ 0x; if (!_BitScanForward (m, m)) return 127; /* offending instruction */ //return 8*m + __lzcnt (board.m128i_u32[m/4]); /* new code follows */ unsigned long n; _BitScanForward (n, board.m128i_u32[m/4]); return 8*m + (32-n); } Debug: testing _BitScanForward (m, 0): 0 1099227377 testing _BitScanForward (m, 1): 1 0 testing _BitScanForward (m, 256): 1 8 testing _BitScanReverse (m, 0): 0 1103042705 testing _BitScanReverse (m, 1): 1 0 testing _BitScanReverse (m, 256): 1 8 test-error [bitmap_t::lowest_index]: 0 32 Release: testing _BitScanForward (m, 0): 0 16843009 testing _BitScanForward (m, 1): 1 0 testing _BitScanForward (m, 256): 1 8 testing _BitScanReverse (m, 0): 0 4 testing _BitScanReverse (m, 1): 1 0 testing _BitScanReverse (m, 256): 1 8 test-error [bitmap_t::lowest_index]: 0 32 René van de Veerdonk wrote: Micheal, Your output is actually correct. The output is undefined when the second argument to _BitScanForward is zero and that is what I am explicitly testing for in the method bitmap_t::lowest_index, which is supposed to return an index to any set bit (despite the name). Your output shows that it returns a number that is off in a very predictable way (it seems that replacing 8*m + n with 8*m + (32-n) on the last line may do the trick for you, but it breaks my local copy). So, now I am positively baffled. Help? René On Tue, Aug 25, 2009 at 7:25 PM, Michael Williams michaelwilliam...@gmail.com mailto:michaelwilliam...@gmail.com wrote: Debug build: testing _BitScanForward (m, 0): 0 842032512 testing _BitScanForward (m, 1): 1 0 testing _BitScanForward (m, 256): 1 8 testing _BitScanReverse (m, 0): 0 840277132 testing _BitScanReverse (m, 1): 1 0 testing _BitScanReverse (m, 256): 1 8 Release build: testing _BitScanForward (m, 0): 0 16843009 testing _BitScanForward (m, 1): 1 0 testing _BitScanForward (m, 256): 1 8 testing _BitScanReverse (m, 0): 0 4 testing _BitScanReverse (m, 1): 1 0 testing _BitScanReverse (m, 256): 1 8 Sucks when they produce different results. René van de Veerdonk wrote: Micheal, Thanks for trying. Could you try one or two more things? (1) Replace the _BitScanForward in the new code to _BitScanReverse. This probably won't help. (2) Add this to bitmapgo.cpp: /* sandbox */ const bool sandbox = true; void test_sandbox () { unsigned long m = 220; unsigned char test_it = _BitScanForward (m, 0); std::cout testing _BitScanForward (m, 0): (int) test_it m \n; test_it = _BitScanForward (m, 1); std::cout testing _BitScanForward (m, 1): (int) test_it m \n; test_it = _BitScanForward (m, 256); std::cout testing _BitScanForward (m, 256): (int) test_it m \n; test_it = _BitScanReverse (m, 0); std::cout testing _BitScanReverse (m, 0): (int) test_it m \n; test_it = _BitScanReverse (m, 1); std::cout testing _BitScanReverse (m, 1): (int) test_it m \n; test_it = _BitScanReverse (m, 256); std::cout testing _BitScanReverse (m, 256): (int) test_it m \n; } and add a line something like this in the main program: if (test sandbox) test_sandbox (); /* new line */ if (test verify) test = test_bitmap_t::test (); /* already there */ That does not solve anything, but it may tell me if those functions are doing for you what they do for me. My output is: testing _BitScanForward (m, 0): 0 16843009 testing _BitScanForward (m, 1): 1 0 testing _BitScanForward (m, 256): 1 8 testing _BitScanReverse (m, 0): 0 3432104 testing _BitScanReverse (m, 1): 1 0 testing _BitScanReverse (m, 256): 1 8 Thanks, I really appreciate your help ! René On Tue, Aug 25, 2009 at 6:38 PM, Michael Williams michaelwilliam...@gmail.com mailto:michaelwilliam...@gmail.com mailto:michaelwilliam...@gmail.com mailto:michaelwilliam...@gmail.com wrote: That doesn't seem to have any effect on the results. René van de Veerdonk wrote: Hi Micheal, Thanks for the test (Intel - AMD, Windows XP - Windows 7, x32 - x64, great, three birds with one stone!). So much for portability. But, hooray for test-cases! This may be related to the __lzcnt intrinsic, giving you a completely different result, i.e., 32 - my result. Could you try replacing this
Re: [computer-go] Bitmap Go revisited and mockup implementation
David, Confession: I have not tested 19x19. As you have noted, and others before you over the years, a 19x19 board does not fit in one but three 128-bit registers and there would be a rather big penalty as a result, perhaps (likely?) wiping out all of the benefits of bitmaps. Antoine voiced his disappointment about the speed advantage even on 9x9 in the e-mail to the list that served as my basis. My hope, as Hideko Kato points out, is in longer registers or perhaps being able to port this to a GPU. A 64-bit OS with twice the number registers would also surely help out. Nevertheless, I was surprised that I was able to get to almost the same level of speed as Libego provides. The goals for this mockup were very humble: (1) to provide a basic implementation to see what it can do (and if I could do it), and (2) learn how to work with assembly and/or intrinsic functions (I had never done that before). It may not be too hard to try 19x19 and just see what happens. I may attempt this as a follow-up. René van de Veerdonk 2009/8/23 David Fotland fotl...@smart-games.com How much would you lose for 19x19 board? A board representation is not very interesting unless it scales to 19 line boards. David *From:* computer-go-boun...@computer-go.org [mailto: computer-go-boun...@computer-go.org] *On Behalf Of *René van de Veerdonk *Sent:* Sunday, August 23, 2009 1:11 PM *To:* computer-go@computer-go.org *Subject:* [computer-go] Bitmap Go revisited and mockup implementation Hi all, After years of reading this very interesting list, I decided to make my first contribution this week after reading once again about bitmap go. There is no freely available implementation of a bitmap based engine that I know of, so here is a start. Note that I am not a professional programmer, self-taught, and this is a part-time hobby, so do not expect perfect form. The attachment is an attempt to create an efficient implementation of a bitmap based light playout engine, following mostly the recommendations as spelled out by Antoine de Maricourt in his November 2006 message to this list. The application is single threaded and achieves around 75-80 kpps on my Centrino Duo 2.4 GHz Intel labtop. The mockup was developed in C/C++ using the Microsoft Visual Studio 2008 IDE for Windows XP (32-bit) and undoubtedly suffers from portability issues as a result. The 128-bit SSE instructions are called through intrinsic functions, hopefully at least these interfaces are the same for other compilers. The single goal of this mockup was to do some benchmarking, so there is no gtp-interface, but that would be rather easy to add for those skilled in the art. The standard rules are used with one exception, the Brown eye-rule is used. Multi-stone suicide is explicitly forbidden. No superko checks are performed during playouts, and there is no mercy-rule either. I would be particularly interested to know if a 64-bit OS will improve the performance significantly, as it does have a lot less register pressure. Moving to a 64-bit OS will also make some instructions available that allow for further simplifications/efficiency improvements in the code. As will enabling SSE3 or SSE4.1 instructions. One note, the random number generator used is from Agner Fog's public library and was not included in the attachment. You will have to provide a random number generator yourself or download the appropriate version from his well-known website (www.agner.org/optimize) I hope this is the start of a discussion about further improvements that can be made, but you may use this code as you wish with no restrictions or guarantees. See the readme.txt file included in the attachment for more details (also partially included below). Comments and feedback welcome, René van de Veerdonk Excerpts from the readme.txt file: Introduction: = Mockup of a high performance implementation of the game of Go using bitmaps as the principle datastructures. This implementation is limited to measuring the performance of the principle components by playing random playouts from an empty 9x9 board, using Chinese scoring, not allowing pass moves until no other moves are available. Suicide is not allowed, neither for single or multi-stone groups. No checks are implemented for superko, but simple ko's are disallowed explicitly. Some implementation details: Benchmarking is done in test_board.cpp, where the number of playouts (n) and repeats (j) for each test is set. Benchmarking is done starting from an empty 9x9 board until two (three with ko) passes are played in sequence. The benchmark playout follow the same scheme as method board_t::play_random_game, but with timing instrumentation added. Clock counts (cc) are measured using the cc_time_of macro. Total run-time measurement relies on the windows library. For each move during a playout
Re: [computer-go] Bitmap Go revisited and mockup implementation
If an engine (or an engine's playout policy) needs to check the legality of every move before making a selection, this could still be a benefit. René van de Veerdonk wrote: David, Confession: I have not tested 19x19. As you have noted, and others before you over the years, a 19x19 board does not fit in one but three 128-bit registers and there would be a rather big penalty as a result, perhaps (likely?) wiping out all of the benefits of bitmaps. Antoine voiced his disappointment about the speed advantage even on 9x9 in the e-mail to the list that served as my basis. My hope, as Hideko Kato points out, is in longer registers or perhaps being able to port this to a GPU. A 64-bit OS with twice the number registers would also surely help out. Nevertheless, I was surprised that I was able to get to almost the same level of speed as Libego provides. The goals for this mockup were very humble: (1) to provide a basic implementation to see what it can do (and if I could do it), and (2) learn how to work with assembly and/or intrinsic functions (I had never done that before). It may not be too hard to try 19x19 and just see what happens. I may attempt this as a follow-up. René van de Veerdonk 2009/8/23 David Fotland fotl...@smart-games.com mailto:fotl...@smart-games.com How much would you lose for 19x19 board? A board representation is not very interesting unless it scales to 19 line boards. David *From:* computer-go-boun...@computer-go.org mailto:computer-go-boun...@computer-go.org [mailto:computer-go-boun...@computer-go.org mailto:computer-go-boun...@computer-go.org] *On Behalf Of *René van de Veerdonk *Sent:* Sunday, August 23, 2009 1:11 PM *To:* computer-go@computer-go.org mailto:computer-go@computer-go.org *Subject:* [computer-go] Bitmap Go revisited and mockup implementation Hi all, After years of reading this very interesting list, I decided to make my first contribution this week after reading once again about bitmap go. There is no freely available implementation of a bitmap based engine that I know of, so here is a start. Note that I am not a professional programmer, self-taught, and this is a part-time hobby, so do not expect perfect form. The attachment is an attempt to create an efficient implementation of a bitmap based light playout engine, following mostly the recommendations as spelled out by Antoine de Maricourt in his November 2006 message to this list. The application is single threaded and achieves around 75-80 kpps on my Centrino Duo 2.4 GHz Intel labtop. The mockup was developed in C/C++ using the Microsoft Visual Studio 2008 IDE for Windows XP (32-bit) and undoubtedly suffers from portability issues as a result. The 128-bit SSE instructions are called through intrinsic functions, hopefully at least these interfaces are the same for other compilers. The single goal of this mockup was to do some benchmarking, so there is no gtp-interface, but that would be rather easy to add for those skilled in the art. The standard rules are used with one exception, the Brown eye-rule is used. Multi-stone suicide is explicitly forbidden. No superko checks are performed during playouts, and there is no mercy-rule either. I would be particularly interested to know if a 64-bit OS will improve the performance significantly, as it does have a lot less register pressure. Moving to a 64-bit OS will also make some instructions available that allow for further simplifications/efficiency improvements in the code. As will enabling SSE3 or SSE4.1 instructions. One note, the random number generator used is from Agner Fog's public library and was not included in the attachment. You will have to provide a random number generator yourself or download the appropriate version from his well-known website (www.agner.org/optimize http://www.agner.org/optimize) I hope this is the start of a discussion about further improvements that can be made, but you may use this code as you wish with no restrictions or guarantees. See the readme.txt file included in the attachment for more details (also partially included below). Comments and feedback welcome, René van de Veerdonk Excerpts from the readme.txt file: Introduction: = Mockup of a high performance implementation of the game of Go using bitmaps as the principle datastructures. This implementation is limited to measuring the performance of the principle components by playing random playouts from an empty 9x9 board, using Chinese scoring, not allowing pass moves until no other moves are available. Suicide is not allowed, neither for single or multi-stone groups. No checks
Re: [computer-go] Bitmap Go revisited and mockup implementation
2009/8/24 René van de Veerdonk rene.vandeveerd...@gmail.com: David, Confession: I have not tested 19x19. As you have noted, and others before you over the years, a 19x19 board does not fit in one but three 128-bit registers and there would be a rather big penalty as a result, perhaps (likely?) wiping out all of the benefits of bitmaps. Antoine voiced his disappointment about the speed advantage even on 9x9 in the e-mail to the list that served as my basis. My hope, as Hideko Kato points out, is in longer registers or perhaps being able to port this to a GPU. A 64-bit OS with twice the number registers would also surely help out. Nevertheless, I was surprised that I was able to get to almost the same level of speed as Libego provides. Libego provides 105-108 kpps on a single Core2 2.4 GHz on 9x9. and 22-23 kpps on 19x19 Can you test your bitmap implementation on 19x19? Lukasz The goals for this mockup were very humble: (1) to provide a basic implementation to see what it can do (and if I could do it), and (2) learn how to work with assembly and/or intrinsic functions (I had never done that before). It may not be too hard to try 19x19 and just see what happens. I may attempt this as a follow-up. René van de Veerdonk 2009/8/23 David Fotland fotl...@smart-games.com How much would you lose for 19x19 board? A board representation is not very interesting unless it scales to 19 line boards. David From: computer-go-boun...@computer-go.org [mailto:computer-go-boun...@computer-go.org] On Behalf Of René van de Veerdonk Sent: Sunday, August 23, 2009 1:11 PM To: computer-go@computer-go.org Subject: [computer-go] Bitmap Go revisited and mockup implementation Hi all, After years of reading this very interesting list, I decided to make my first contribution this week after reading once again about bitmap go. There is no freely available implementation of a bitmap based engine that I know of, so here is a start. Note that I am not a professional programmer, self-taught, and this is a part-time hobby, so do not expect perfect form. The attachment is an attempt to create an efficient implementation of a bitmap based light playout engine, following mostly the recommendations as spelled out by Antoine de Maricourt in his November 2006 message to this list. The application is single threaded and achieves around 75-80 kpps on my Centrino Duo 2.4 GHz Intel labtop. The mockup was developed in C/C++ using the Microsoft Visual Studio 2008 IDE for Windows XP (32-bit) and undoubtedly suffers from portability issues as a result. The 128-bit SSE instructions are called through intrinsic functions, hopefully at least these interfaces are the same for other compilers. The single goal of this mockup was to do some benchmarking, so there is no gtp-interface, but that would be rather easy to add for those skilled in the art. The standard rules are used with one exception, the Brown eye-rule is used. Multi-stone suicide is explicitly forbidden. No superko checks are performed during playouts, and there is no mercy-rule either. I would be particularly interested to know if a 64-bit OS will improve the performance significantly, as it does have a lot less register pressure. Moving to a 64-bit OS will also make some instructions available that allow for further simplifications/efficiency improvements in the code. As will enabling SSE3 or SSE4.1 instructions. One note, the random number generator used is from Agner Fog's public library and was not included in the attachment. You will have to provide a random number generator yourself or download the appropriate version from his well-known website (www.agner.org/optimize) I hope this is the start of a discussion about further improvements that can be made, but you may use this code as you wish with no restrictions or guarantees. See the readme.txt file included in the attachment for more details (also partially included below). Comments and feedback welcome, René van de Veerdonk Excerpts from the readme.txt file: Introduction: = Mockup of a high performance implementation of the game of Go using bitmaps as the principle datastructures. This implementation is limited to measuring the performance of the principle components by playing random playouts from an empty 9x9 board, using Chinese scoring, not allowing pass moves until no other moves are available. Suicide is not allowed, neither for single or multi-stone groups. No checks are implemented for superko, but simple ko's are disallowed explicitly. Some implementation details: Benchmarking is done in test_board.cpp, where the number of playouts (n) and repeats (j) for each test is set. Benchmarking is done starting from an empty 9x9 board until two (three with ko) passes are played in sequence. The benchmark playout follow the same scheme as method board_t
Re: [computer-go] Bitmap Go revisited and mockup implementation
Lukasz, I tested Libego version 0.125 on my labtop, compiled using Visual Studio 2008 under Windows XP with slight modifications: #ifdef NDEBUG section commented out in testing.h #include time.h added to fast_random.cpp From the command-line with engine -b I observed 67-71 kpps on the same labtop as I used to benchmark the 9x9 bitmap mockup achieving 75-80 kpps. It seems Libego performs much better (going from 70 to 105 kpps is an impressive 50% improvement) with other compilers and OS. For sure it was not developed for MS Visual Studio, so my numbers are not an apples to apples comparison. But I think it is fair to say that both implementations are at least in the same ball-park when compiled as is under Visual Studio in Windows XP. I do not have an explanation for why my results are 50% below the numbers you quoted other than the difference in compiler (inlining?) and OS (32 vs 64-bit?). It would be really interesting to have someone else do a comparison. Given that I was natively working and optimizing in Visual Studio, I am not expecting a 50% improvement for the bitmap mockup (but hopefully it will not decrease 50% either). As for testing on a 19x19 board, this is not as straightforward for me as modifying a single parameter in the program. The basic data-type needs to be changed from a single 128-bit variable to an array of three such elements. This requires a redefinition of most basic operations, in particular the shift operations will take some work as these go across array-elements. In the generated assembly I could already see that there are not enough 128-bit registers for 9x9 in some functions, and this will only get worse with arrays. Dereferencing array elements potentially also add overhead. I am going to give it a try, but do not expect results overnight, as this is a low intensity hobby for me (note that Antoine's e-mail was in November 2006). Optimistically, a 3x speed reduction may be expected, but the additional overhead and and switch-logic could easily push the drop to over 10x unless branching can be minimized. If someone could look over the mockup and suggest ways to reduce the amount of branching there I would really appreciate it. If someone would eventually try to port something like this to a GPU, that would also benefit tremendously from a reduction in branching. I find that I need to be very conscious about branching during coding, as I have observed a tendency to put ever more if, do and while statements in my code. Also, does anybody know if gcc allows the same interface to the SSE-instructions as I have adopted in bitmap.cpp? Thanks, René On Mon, Aug 24, 2009 at 4:18 PM, Łukasz Lew lukasz@gmail.com wrote: 2009/8/24 René van de Veerdonk rene.vandeveerd...@gmail.com: David, Confession: I have not tested 19x19. As you have noted, and others before you over the years, a 19x19 board does not fit in one but three 128-bit registers and there would be a rather big penalty as a result, perhaps (likely?) wiping out all of the benefits of bitmaps. Antoine voiced his disappointment about the speed advantage even on 9x9 in the e-mail to the list that served as my basis. My hope, as Hideko Kato points out, is in longer registers or perhaps being able to port this to a GPU. A 64-bit OS with twice the number registers would also surely help out. Nevertheless, I was surprised that I was able to get to almost the same level of speed as Libego provides. Libego provides 105-108 kpps on a single Core2 2.4 GHz on 9x9. and 22-23 kpps on 19x19 Can you test your bitmap implementation on 19x19? Lukasz The goals for this mockup were very humble: (1) to provide a basic implementation to see what it can do (and if I could do it), and (2) learn how to work with assembly and/or intrinsic functions (I had never done that before). It may not be too hard to try 19x19 and just see what happens. I may attempt this as a follow-up. René van de Veerdonk 2009/8/23 David Fotland fotl...@smart-games.com How much would you lose for 19x19 board? A board representation is not very interesting unless it scales to 19 line boards. David From: computer-go-boun...@computer-go.org [mailto:computer-go-boun...@computer-go.org] On Behalf Of René van de Veerdonk Sent: Sunday, August 23, 2009 1:11 PM To: computer-go@computer-go.org Subject: [computer-go] Bitmap Go revisited and mockup implementation Hi all, After years of reading this very interesting list, I decided to make my first contribution this week after reading once again about bitmap go. There is no freely available implementation of a bitmap based engine that I know of, so here is a start. Note that I am not a professional programmer, self-taught, and this is a part-time hobby, so do not expect perfect form. The attachment is an attempt to create an efficient implementation of a bitmap based light playout engine, following mostly
Re: [computer-go] Bitmap Go revisited and mockup implementation
For many people on this list, Computer Go is a hobby and that means that it is important to do whatever you find interesting and motivating, event if it may not be the best or most promising direction in order to become the next world champion program. There is room for pushing limits in all directions, IMO. Brian Sheppard wrote: Maven was fast enough, but it was clearly the slowest of all major Scrabble programs. Some programs were 5x faster than Maven. Maven was slow because it was a lot more than a move generator: it played the full game with competent positional scoring. Scoring requires much more time than move generation. So why bother optimizing move generation? I am finding that the same lesson applies to Go. But more so because Scrabble is largely first-order whereas Go is recursive. Pebbles averages ~10K trials per turn. Aya is clearly stronger at 10K trials than Pebbles. Fuego is stronger than Aya. Valkyria is stronger than Fuego, and my impression is that Zen trumps us all. Zen could be better than Pebbles + 400 rating points at 10K trials. Now look at scaling. When Fuego runs a machine that is ~20x as fast as mine, then Fuego is comparable to Zen. Valkyria plays in 19x19 tournaments running on a Pentium M, and scores 50% against much more powerful computers. My conclusion is that the analytic skills of Zen and Valkyria are worth a factor of 20, and I expect the advantage will increase with computer speeds. The crucial issue is to support intelligence. Check out David’s description of MFoG, and Remi’s papers about CrazyStone. I am sure those programs pay attention to speed, but the real issue is whether the engine can do the necessary analysis. Magnus’s posts to this list lament that Valkyria is so slow, but Magnus is clearly excited that his engine supports great analysis. So de-emphasize speed. Can you build a great pattern matcher using bit boards? Can you solve life death positions better? Can you read ladders faster or more accurately? Do bit boards make the program easier to program and debug? Pebbles started with a bit board move generator, but I have been adding other knowledge entities, like liberty counts, string iterators, and eyespace detectors. I am convinced that emphasizing domain knowledge structures will pay off. (That being said, I will borrow your intrinsic functions for my 9x9 template specialization. ☺ Every little bit helps.) Best, Brian ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
RE: [computer-go] Bitmap Go revisited and mockup implementation
Getting similar speed to Libego is very impressive. It’s important to start with a fast basic engine, since it will only get much slower as you add more knowledge. Currently I only get about 19k playouts per second on a 2 core Core2Duo 2.3 GHz, on 9 line empty board. I started at about 55k playouts per second on one core (for a UCT search). The original one only beat gnugo 17% of the time, and the current one beats gnugo 93%, both with 5000 playouts, so in this case at least much slower is much stronger (testing on 9x9, against gnugo 3.7.10, gnugo --mode gtp --chinese-rules --capture-all-dead --level 10 --positional-superko). David From: computer-go-boun...@computer-go.org [mailto:computer-go-boun...@computer-go.org] On Behalf Of René van de Veerdonk Sent: Monday, August 24, 2009 9:23 AM To: computer-go Subject: Re: [computer-go] Bitmap Go revisited and mockup implementation David, Confession: I have not tested 19x19. As you have noted, and others before you over the years, a 19x19 board does not fit in one but three 128-bit registers and there would be a rather big penalty as a result, perhaps (likely?) wiping out all of the benefits of bitmaps. Antoine voiced his disappointment about the speed advantage even on 9x9 in the e-mail to the list that served as my basis. My hope, as Hideko Kato points out, is in longer registers or perhaps being able to port this to a GPU. A 64-bit OS with twice the number registers would also surely help out. Nevertheless, I was surprised that I was able to get to almost the same level of speed as Libego provides. The goals for this mockup were very humble: (1) to provide a basic implementation to see what it can do (and if I could do it), and (2) learn how to work with assembly and/or intrinsic functions (I had never done that before). It may not be too hard to try 19x19 and just see what happens. I may attempt this as a follow-up. René van de Veerdonk 2009/8/23 David Fotland fotl...@smart-games.com How much would you lose for 19x19 board? A board representation is not very interesting unless it scales to 19 line boards. David From: computer-go-boun...@computer-go.org [mailto:computer-go-boun...@computer-go.org] On Behalf Of René van de Veerdonk Sent: Sunday, August 23, 2009 1:11 PM To: computer-go@computer-go.org Subject: [computer-go] Bitmap Go revisited and mockup implementation Hi all, After years of reading this very interesting list, I decided to make my first contribution this week after reading once again about bitmap go. There is no freely available implementation of a bitmap based engine that I know of, so here is a start. Note that I am not a professional programmer, self-taught, and this is a part-time hobby, so do not expect perfect form. The attachment is an attempt to create an efficient implementation of a bitmap based light playout engine, following mostly the recommendations as spelled out by Antoine de Maricourt in his November 2006 message to this list. The application is single threaded and achieves around 75-80 kpps on my Centrino Duo 2.4 GHz Intel labtop. The mockup was developed in C/C++ using the Microsoft Visual Studio 2008 IDE for Windows XP (32-bit) and undoubtedly suffers from portability issues as a result. The 128-bit SSE instructions are called through intrinsic functions, hopefully at least these interfaces are the same for other compilers. The single goal of this mockup was to do some benchmarking, so there is no gtp-interface, but that would be rather easy to add for those skilled in the art. The standard rules are used with one exception, the Brown eye-rule is used. Multi-stone suicide is explicitly forbidden. No superko checks are performed during playouts, and there is no mercy-rule either. I would be particularly interested to know if a 64-bit OS will improve the performance significantly, as it does have a lot less register pressure. Moving to a 64-bit OS will also make some instructions available that allow for further simplifications/efficiency improvements in the code. As will enabling SSE3 or SSE4.1 instructions. One note, the random number generator used is from Agner Fog's public library and was not included in the attachment. You will have to provide a random number generator yourself or download the appropriate version from his well-known website (www.agner.org/optimize) I hope this is the start of a discussion about further improvements that can be made, but you may use this code as you wish with no restrictions or guarantees. See the readme.txt file included in the attachment for more details (also partially included below). Comments and feedback welcome, René van de Veerdonk Excerpts from the readme.txt file: Introduction: = Mockup of a high performance implementation of the game of Go using bitmaps as the principle datastructures. This implementation is limited to measuring
RE: [computer-go] Bitmap Go revisited and mockup implementation
How much would you lose for 19x19 board? A board representation is not very interesting unless it scales to 19 line boards. David From: computer-go-boun...@computer-go.org [mailto:computer-go-boun...@computer-go.org] On Behalf Of René van de Veerdonk Sent: Sunday, August 23, 2009 1:11 PM To: computer-go@computer-go.org Subject: [computer-go] Bitmap Go revisited and mockup implementation Hi all, After years of reading this very interesting list, I decided to make my first contribution this week after reading once again about bitmap go. There is no freely available implementation of a bitmap based engine that I know of, so here is a start. Note that I am not a professional programmer, self-taught, and this is a part-time hobby, so do not expect perfect form. The attachment is an attempt to create an efficient implementation of a bitmap based light playout engine, following mostly the recommendations as spelled out by Antoine de Maricourt in his November 2006 message to this list. The application is single threaded and achieves around 75-80 kpps on my Centrino Duo 2.4 GHz Intel labtop. The mockup was developed in C/C++ using the Microsoft Visual Studio 2008 IDE for Windows XP (32-bit) and undoubtedly suffers from portability issues as a result. The 128-bit SSE instructions are called through intrinsic functions, hopefully at least these interfaces are the same for other compilers. The single goal of this mockup was to do some benchmarking, so there is no gtp-interface, but that would be rather easy to add for those skilled in the art. The standard rules are used with one exception, the Brown eye-rule is used. Multi-stone suicide is explicitly forbidden. No superko checks are performed during playouts, and there is no mercy-rule either. I would be particularly interested to know if a 64-bit OS will improve the performance significantly, as it does have a lot less register pressure. Moving to a 64-bit OS will also make some instructions available that allow for further simplifications/efficiency improvements in the code. As will enabling SSE3 or SSE4.1 instructions. One note, the random number generator used is from Agner Fog's public library and was not included in the attachment. You will have to provide a random number generator yourself or download the appropriate version from his well-known website (www.agner.org/optimize) I hope this is the start of a discussion about further improvements that can be made, but you may use this code as you wish with no restrictions or guarantees. See the readme.txt file included in the attachment for more details (also partially included below). Comments and feedback welcome, René van de Veerdonk Excerpts from the readme.txt file: Introduction: = Mockup of a high performance implementation of the game of Go using bitmaps as the principle datastructures. This implementation is limited to measuring the performance of the principle components by playing random playouts from an empty 9x9 board, using Chinese scoring, not allowing pass moves until no other moves are available. Suicide is not allowed, neither for single or multi-stone groups. No checks are implemented for superko, but simple ko's are disallowed explicitly. Some implementation details: Benchmarking is done in test_board.cpp, where the number of playouts (n) and repeats (j) for each test is set. Benchmarking is done starting from an empty 9x9 board until two (three with ko) passes are played in sequence. The benchmark playout follow the same scheme as method board_t::play_random_game, but with timing instrumentation added. Clock counts (cc) are measured using the cc_time_of macro. Total run-time measurement relies on the windows library. For each move during a playout, the program goes through a sequence of (1) identify all legal moves (~50cc) this follows Brown's standard definition; this definition differs from most Monte-Carlo engines and has a different set of pros and cons; bitmaps allow all moves to be processed in parallel (2) pick a uniformly random move from the available choices (~75cc) relies on a modulus operation; is not 100% uniform to save speed (3) playing it on the board (~170cc) lengthiest method; but rather straightforward Playouts end after 256 moves or whenever two passes are played in succession (three in case of a ko). The routines for each of these basic steps are contained in the file board.cpp. The underlying datastructures are 128-bit bitmaps, for which the sse2 instructions used to manipulate them are contained in the file bitmap.cpp. Further improvements can be expected from using any 64-bit OS that supports several more assembly instructions, and from using sse4.1 instructions. Example output (on a Intel 2.40 GHz Centrino Duo labtop): = [game] = 30469.7