Re: [computer-go] CUDA and GPU Performance
Rene, you're absolutely right, it's completely fishy! But don't worry, you're work is not in vain :) I noticed this morning, when I read your mail, that I had included the 9x9 results in my original mail instead of 19x19! Indeed, for 19x19 the results are even worse. Here's a complete rundown: - 9x9 CPU: 47,000 playouts per core per second - 9x9 GPU: 170,000 playouts per second - 19x19 CPU: 9,800 playouts per core per second - 19x19 GPU: 11,000 playouts per second I did mention in another mail that the performance difference for 9x9 should be larger, I think. What I didn't realise was that I had reported the 9x9 numbers by mistake! Additional statistics: - Processor occupancy for 19x19 was 6% instead of 9% - Branch divergence was less than half a percent. It was 2% for 9x9. This is perhaps because of the larger board size causing more moves to fall onto empty intersections, or fewer moves requiring merges/captures. Christian René van de Veerdonk wrote: Christian, Would you care to provide some more detail on your implementation for the playouts? Your results are very impressive. At 19x19 Go using bit-boards, your implementation is roughly 7x as fast as the bitboard implementation I presented just a few weeks back, and also outperforms libEgo by about a factor of two. René On Wed, Sep 9, 2009 at 2:57 PM, Christian Nentwich christ...@modeltwozero.com mailto:christ...@modeltwozero.com wrote: Mark, let me try to add some more context to answer your questions. When I say in my conclusion that it's not worth it, I mean it's not worth using the GPU to run playout algorithms of the sort that are in use today. There may be many other algorithms that form part of Go engines where the GPU can provide an order-of-magnitude speedup. Still more where the GPU can run in parallel with the CPU to help. In my experiments, a CPU core got 47,000 playouts per second and the GPU 170,000. But: - My computer has two cores (so it gets 94,000 playouts with 2 threads) - My computer's processor (intel core duo 6600) is 3 years old, and far from state of the art - My graphics card (Geforce 285) on the other hand, is recently purchased and one of the top graphics cards That means that my old CPU already gets more than twice the speed of the GPU. An Intel Nehalem processor would surely beat it, let alone an 8-core system. Bearing in mind the severe drawbacks of the GPU - these are not general purpose processors, there is much you can't do on them - this limits their usefulness with this algorithm. Compare this speedup to truly highly parallel algorithms: random number generation, matrix multiplication, monte-carlo simulation of options (which are highly parallel because there is no branching and little data); you see speedups of 10x to 100x over the CPU with those. The 9% occupancy may be puzzling but there is little that can be done about that. This, and the talk about threads and blocks would take a while to explain, because GPUs don't work like general purpose CPUs. They are SIMD processors meaning that each processor can run many threads in parallel on different items of data but only if *all threads are executing the same instruction*. There is only one instruction decoding stage per processor cycle. If any if statements or loops diverge, threads will be serialised until they join again. The 9% occupancy is a function of the amount of data needed to perform the task, and the branch divergence (caused by the playouts being different). There is little that can be done about it other than use a completely different algorithm. If you google CUDA block threads you will find out more. In short, the GPU runs like a grid cluster. In each block, 64 threads run in parallel, conceptually. On the actual hardware, in each processor 16 threads from one block will execute followed by 16 from another (half-warps). If any threads are blocked (memory reads costs ~400 cycles!) then threads from another block are scheduled instead. So the answer is: yes, there are 64 * 80 threads conceptually but they're not always scheduled at the same time. Comments on specific questions below. If paralellism is what you're looking for, why not have one thread per move candidate? Use that to collect AMAF statistics. 16Kb is not a lot to work with, so the statistics may have to be shared. One thread per move candidate is feasible with the architecture I used, since every thread has its own board. I have not implemented AMAF, so I cannot comment on the statistics bit, but the output of your algorithm is typically not in the 16k shared memory anyway. You'd write that to global memory (1GB). Would uniform random playouts be good enough for this though?
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
Petr, wow, I didn't expect to see so much experimentation being performed! It's great that you have taken the time to implement this approach, because this now shows people both alternatives for implementing a playout algorithm on the GPU. I strongly suspect the low performance in the per-intersection case is down to two reasons - please let me know what you think: - A big proportion of moves place one stone on an empty intersection. In this case 80 of your 81 threads are asleep doing nothing. This means GPU time is wasted. - In quite a few other cases, we get to the old problem with Go: that the intersections are not independent. Thus when you process captures or merges, you possibly have to lock 80 of 81 threads. You can't do much about the first. Did you find a solution for the second? I was thinking for a while about whether it would be possible to come up with a highly parallel algorithm for captures/merges that requires no lock. That's quite hard so I concentrated on my own experiment instead. I agree with you that your approach could be very performant in pattern matching scenarios. In fact, your work somewhat strengthens my conclusion that doing just straight-forward playouts on the GPU is not the way to go. Christian Petr Baudis wrote: I have written a quick CUDA implementation of the per-intersection GPGPU apprach; Christian's e-mail finally made me polish it up to a somewhat presentable form. In this implementation, each GPU thread maintains a single intersection. The implementation uses 9x9 board (10x11 internally); expanding to 19x19 would probably mean either moving some data to global memory or splitting the playout of single board to multiple blocks or waiting for GPUs with larger shared memory. ;-) The speed is unfortunately very disappointing; on GTX260, I see 17,300 playouts per second, with 50 playouts running in parallel. There are several buts: - With some further obvious optimizations, I'm thinking it should be possible to get it to at least 22,000 playouts per second easily. - Bit boards are an obvious thing to try, but I'm quite doubtful they will be an improvement - The playouts are uniformly random now, but they have been implemented in a way to make heavier playouts easy; precise liberty count is maintained and it should be easy to add pattern matching; the move selection can deal with arbitrary probabilities for different intersections (I wonder, what are the standard high-performance numbers for heavier playouts with pattern matching?) - Christian is playing 2650 games in parallel; I'm playing only 50 games in parallel, in the meantime the CPU can pick next games to play from the tree - My code already accounts for transferring board images from CPU to the GPU (different one for each playout) and getting scores back - Christian's GPU seems quite better than mine - Apparently I still suck at CUDA optimizations; this has been my first real small CUDA project, it would be awesome if someone more experienced could look at the code and suggest obvious improvements Still, I'm pretty unhappy with the slowness; I wonder how Christian achieved such a high speed. One problem with my approach is that I have to make use of a lot of atomic instrinsics (some of them could be worked around) and __syncthreads() all the time to ensure that all threads have consistent board image at each stage. I think with pattern matching, the per-intersection approach could shine more, since I can easily match patterns everywhere on the board at once. Still, I wonder if it will become at least on par with good CPU implementations. On Wed, Sep 09, 2009 at 04:54:23PM +0100, Christian Nentwich wrote: In other words: no tree search is involved, and this is the lightest possible playout. The raw numbers are as follows: - CPU Search: 47,000 playouts per CPU core per second, on an Intel 6600 Core-2 Duo - GPU Search: 170,000 playouts per second, on an NVidia Geforce 285 card I still find this quite astonishing; since you consider this experiment a failure anyway, would you mind publishing the source code? :-) ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ -- Christian Nentwich Director, Model Two Zero Ltd. +44-(0)7747-061302 http://www.modeltwozero.com ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
This is very interesting. Here is a crazy idea, maybe it the same as Marks but I want to take it to its extreme. Since AMAF values are so helpful, perhaps one can let go of the idea of sequential play following the rules of go, and basically play moves in parallel on all empty intersection. Compute new state (do captures) and repeat a fixed number of times and evaluate. Since I do not understand GPU programming at all I cannot judge whether this would make things easier to implement. Also a lot of weird thing will happens when you do things in parallel. What happens when two adjacent blocks are captured at the same time. Are both removed? Are there tricks to remove the one that was most likely to captured to begin with. -Magnus Quoting Christian Nentwich christ...@modeltwozero.com: Petr, wow, I didn't expect to see so much experimentation being performed! It's great that you have taken the time to implement this approach, because this now shows people both alternatives for implementing a playout algorithm on the GPU. I strongly suspect the low performance in the per-intersection case is down to two reasons - please let me know what you think: - A big proportion of moves place one stone on an empty intersection. In this case 80 of your 81 threads are asleep doing nothing. This means GPU time is wasted. - In quite a few other cases, we get to the old problem with Go: that the intersections are not independent. Thus when you process captures or merges, you possibly have to lock 80 of 81 threads. You can't do much about the first. Did you find a solution for the second? I was thinking for a while about whether it would be possible to come up with a highly parallel algorithm for captures/merges that requires no lock. That's quite hard so I concentrated on my own experiment instead. I agree with you that your approach could be very performant in pattern matching scenarios. In fact, your work somewhat strengthens my conclusion that doing just straight-forward playouts on the GPU is not the way to go. Christian Petr Baudis wrote: I have written a quick CUDA implementation of the per-intersection GPGPU apprach; Christian's e-mail finally made me polish it up to a somewhat presentable form. In this implementation, each GPU thread maintains a single intersection. The implementation uses 9x9 board (10x11 internally); expanding to 19x19 would probably mean either moving some data to global memory or splitting the playout of single board to multiple blocks or waiting for GPUs with larger shared memory. ;-) The speed is unfortunately very disappointing; on GTX260, I see 17,300 playouts per second, with 50 playouts running in parallel. There are several buts: - With some further obvious optimizations, I'm thinking it should be possible to get it to at least 22,000 playouts per second easily. - Bit boards are an obvious thing to try, but I'm quite doubtful they will be an improvement - The playouts are uniformly random now, but they have been implemented in a way to make heavier playouts easy; precise liberty count is maintained and it should be easy to add pattern matching; the move selection can deal with arbitrary probabilities for different intersections (I wonder, what are the standard high-performance numbers for heavier playouts with pattern matching?) - Christian is playing 2650 games in parallel; I'm playing only 50 games in parallel, in the meantime the CPU can pick next games to play from the tree - My code already accounts for transferring board images from CPU to the GPU (different one for each playout) and getting scores back - Christian's GPU seems quite better than mine - Apparently I still suck at CUDA optimizations; this has been my first real small CUDA project, it would be awesome if someone more experienced could look at the code and suggest obvious improvements Still, I'm pretty unhappy with the slowness; I wonder how Christian achieved such a high speed. One problem with my approach is that I have to make use of a lot of atomic instrinsics (some of them could be worked around) and __syncthreads() all the time to ensure that all threads have consistent board image at each stage. I think with pattern matching, the per-intersection approach could shine more, since I can easily match patterns everywhere on the board at once. Still, I wonder if it will become at least on par with good CPU implementations. On Wed, Sep 09, 2009 at 04:54:23PM +0100, Christian Nentwich wrote: In other words: no tree search is involved, and this is the lightest possible playout. The raw numbers are as follows: - CPU Search: 47,000 playouts per CPU core per second, on an Intel 6600 Core-2 Duo - GPU Search: 170,000 playouts per second, on an NVidia Geforce 285 card I still find this quite astonishing; since you consider this experiment a failure anyway, would you mind publishing the source code? :-)
Re: [computer-go] CUDA and GPU Performance
Interesting stuff. I don't have the skills nor the time to make such experiments myself, but here is a simple idea: When using a bitmap representation of the board, it is quite possible to find all eye-like points with a constant number of bit-shifting operations. That should reduce the number of branches. It might even be possible to check liberties and remove captured stones with constant number of loopings, at least for most reasonable board positions (the number would have to be unreasonably large for pathological positions with very long and narrow groups). This still leaves a few places where the code will have to branch, most notably choosing a random legal move. And the ending of the simulation, of course. If we solve all these, it should be possible to run simple playouts in parallel, taking full advantage of the SIMD nature of the GPU. Of course, each playout would be slower than necesary, but maybe that would even out with many enough done in parallel. -H -- Heikki Levanto In Murphy We Turst heikki (at) lsd (dot) dk ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
Hi! On Thu, Sep 10, 2009 at 12:26:06PM +0100, Christian Nentwich wrote: I strongly suspect the low performance in the per-intersection case is down to two reasons - please let me know what you think: - A big proportion of moves place one stone on an empty intersection. In this case 80 of your 81 threads are asleep doing nothing. This means GPU time is wasted. This is mostly true, though only through part of the loop body; I think this is the biggest reason my approach is slower than your, though, and it will be much worse on 19x19. It would be interesting to have N threads running independently and getting contracted to various boards for massive tasks like capturing and merging groups. But that would be quite complex task, not sure if even practical to consider with the current state of the CUDA sandbox. - In quite a few other cases, we get to the old problem with Go: that the intersections are not independent. Thus when you process captures or merges, you possibly have to lock 80 of 81 threads. I do no locking, actually being able to do large captures and merges very quickly was always my primary motivation in even entertaining this way of parallelism. However, I do a lot of atomicAdd()s which have quite an overhead and in some cases basically just serialize many threads. Mainly: (i) When counting score. This is totally unoptimized and in O(N) now; should be trivial to rewrite this as parallel summing in O(logN). This is one of the easy optimizations I mentioned, in fact the most obvious one. (ii) When recalculating liberties. All intersection threads adding extra liberty to the same group will get serialized. Avoiding this seems difficult. -- Petr Pasky Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
On Thu, Sep 10, 2009 at 01:54:49PM +0200, Magnus Persson wrote: This is very interesting. Here is a crazy idea, maybe it the same as Marks but I want to take it to its extreme. Since AMAF values are so helpful, perhaps one can let go of the idea of sequential play following the rules of go, and basically play moves in parallel on all empty intersection. Compute new state (do captures) and repeat a fixed number of times and evaluate. Since I do not understand GPU programming at all I cannot judge whether this would make things easier to implement. Also a lot of weird thing will happens when you do things in parallel. What happens when two adjacent blocks are captured at the same time. Are both removed? Are there tricks to remove the one that was most likely to captured to begin with. Well, after one iteration, _all_ blocks should be captured since there will be no empty intersections, right? (Even if you don't play in eye-like spaces, you will have this situation until some of these actually form.) The easiest heuristic would seem to be to capture the smallest group, but I have hard time imagining this working well, _and_ how to choose the smallest group efficiently (other than that, modifying my source to do this should be fairly easy). -- Petr Pasky Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
Quoting Petr Baudis pa...@ucw.cz: On Thu, Sep 10, 2009 at 01:54:49PM +0200, Magnus Persson wrote: This is very interesting. Here is a crazy idea, maybe it the same as Marks but I want to take it to its extreme. Since AMAF values are so helpful, perhaps one can let go of the idea of sequential play following the rules of go, and basically play moves in parallel on all empty intersection. Compute new state (do captures) and repeat a fixed number of times and evaluate. Since I do not understand GPU programming at all I cannot judge whether this would make things easier to implement. Also a lot of weird thing will happens when you do things in parallel. What happens when two adjacent blocks are captured at the same time. Are both removed? Are there tricks to remove the one that was most likely to captured to begin with. Well, after one iteration, _all_ blocks should be captured since there will be no empty intersections, right? (Even if you don't play in eye-like spaces, you will have this situation until some of these actually form.) The easiest heuristic would seem to be to capture the smallest group, but I have hard time imagining this working well, _and_ how to choose the smallest group efficiently (other than that, modifying my source to do this should be fairly easy). -- Petr Pasky Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ -- Magnus Persson Berlin, Germany ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
Sorry for my empty reply to Petr, I misclicked When a block is identified you could also sum up what the original strength of the original stones covered by the block and use that as a heuristic for what should be captured first. Also to avoid filling all liberties. One can avoid that with pattern matching when the original position is copied to the GPU. For example if black has an empty triangle, black will rather pass than play that point. Thus either white plays there or it remains empty. Basically it mean we can impose heavy pruning on the original position. And then copy many copies of that position and evaluate them massively in parallel. -Magnus Quoting Petr Baudis pa...@ucw.cz: On Thu, Sep 10, 2009 at 01:54:49PM +0200, Magnus Persson wrote: This is very interesting. Here is a crazy idea, maybe it the same as Marks but I want to take it to its extreme. Since AMAF values are so helpful, perhaps one can let go of the idea of sequential play following the rules of go, and basically play moves in parallel on all empty intersection. Compute new state (do captures) and repeat a fixed number of times and evaluate. Since I do not understand GPU programming at all I cannot judge whether this would make things easier to implement. Also a lot of weird thing will happens when you do things in parallel. What happens when two adjacent blocks are captured at the same time. Are both removed? Are there tricks to remove the one that was most likely to captured to begin with. Well, after one iteration, _all_ blocks should be captured since there will be no empty intersections, right? (Even if you don't play in eye-like spaces, you will have this situation until some of these actually form.) The easiest heuristic would seem to be to capture the smallest group, but I have hard time imagining this working well, _and_ how to choose the smallest group efficiently (other than that, modifying my source to do this should be fairly easy). -- Petr Pasky Baudis A lot of people have my books on their bookshelves. That's the problem, they need to read them. -- Don Knuth ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ -- Magnus Persson Berlin, Germany ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
On Thu, Sep 10, 2009 at 08:29:31AM -0400, Jason House wrote: I've thought of something similar in the past, but with a twist: pre-compute a subset of moves that could be safely played in parallel. Even if you can only play 285 moves in parallel on an empty 19x19, it could still be a huge speed boost. Hmm, do you have some ideas how this pre-computation could be done in less than O(N)? I'm gonna think about this now during some travelling... ;-) Petr Pasky Baudis ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
maybe some divide conquer algorithm? Am 10.09.2009 um 14:43 schrieb Petr Baudis: On Thu, Sep 10, 2009 at 08:29:31AM -0400, Jason House wrote: I've thought of something similar in the past, but with a twist: pre-compute a subset of moves that could be safely played in parallel. Even if you can only play 285 moves in parallel on an empty 19x19, it could still be a huge speed boost. Hmm, do you have some ideas how this pre-computation could be done in less than O(N)? I'm gonna think about this now during some travelling... ;-) Petr Pasky Baudis ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
Since AMAF values are so helpful, perhaps one can let go of the idea of sequential play following the rules of go, and basically play moves in parallel on all empty intersection. Compute new state (do captures) and repeat a fixed number of times and evaluate. two thoughts: i) how do you determine color to play for each thread? ii) if i) becomes unsynchronized with the rules of go, you're basically exploring random boards instead of boards that are related to the game that you're interested in. the board should rapidly converge to something that retains no information about the initial state of the board (assuming that the initial board has stones on it). s. ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
Somewhat... One could generate a random number (r) and combine it with the move mask (m) as follows: black moves = m r white moves = m ~r This has the drawback that the number of black and white moves may not be equal. It can be modified to give an equal number of moves such as requiring that each subset of n intersections generates the same number of black and white moves, or doing some kind of looping until the condition is met. For looping, you could keep some subset of the generated moves in order to guarantee convergence. For example, if there are 14 more white moves than black moves, keep all generated moves except for 14 white moves. Then you only have to pick 14 moves in the next loop iteration. Sent from my iPhone On Sep 10, 2009, at 8:43 AM, Petr Baudis pa...@ucw.cz wrote: On Thu, Sep 10, 2009 at 08:29:31AM -0400, Jason House wrote: I've thought of something similar in the past, but with a twist: pre-compute a subset of moves that could be safely played in parallel. Even if you can only play 285 moves in parallel on an empty 19x19, it could still be a huge speed boost. Hmm, do you have some ideas how this pre-computation could be done in less than O(N)? I'm gonna think about this now during some travelling... ;-) Petr Pasky Baudis ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CUDA implementation of the per-intersection GPGPU approach
Quoting steve uurtamo uurt...@gmail.com: Since AMAF values are so helpful, perhaps one can let go of the idea of sequential play following the rules of go, and basically play moves in parallel on all empty intersection. Compute new state (do captures) and repeat a fixed number of times and evaluate. two thoughts: i) how do you determine color to play for each thread? Random. Only one random bit necessary for each decision. ii) if i) becomes unsynchronized with the rules of go, you're basically exploring random boards instead of boards that are related to the game that you're interested in. the board should rapidly converge to something that retains no information about the initial state of the board (assuming that the initial board has stones on it). Usual lightweight random games are also random creating situation that never happen in normal play. But violation of rules will probably make it more biased than a usual light playout. Hard to tell before trying. My feeling is that it might work surprisingly well but it might get extremely biased in some situations, which perhaps can be fixed by prefiltering of the move that can be played. Magnus -- Magnus Persson Berlin, Germany ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
[computer-go] CGOS Engine does NOT use time control commands
I always see this message when booting up CGOS. However, my program supports time_left and I also state this in list_commands. Time control would be much better if I got feedback from the server. How can I use it? Is there a special command I'm missing? Thanks, Isaac ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CGOS Engine does NOT use time control commands
I think it needs *time_settings* On Thu, Sep 10, 2009 at 12:37 PM, Isaac Deutsch i...@gmx.ch wrote: I always see this message when booting up CGOS. However, my program supports time_left and I also state this in list_commands. Time control would be much better if I got feedback from the server. How can I use it? Is there a special command I'm missing? Thanks, Isaac ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/
Re: [computer-go] CGOS Engine does NOT use time control commands
thanks, it works! Am 10.09.2009 um 18:49 schrieb Go Fast: I think it needs time_settings On Thu, Sep 10, 2009 at 12:37 PM, Isaac Deutsch i...@gmx.ch wrote: I always see this message when booting up CGOS. However, my program supports time_left and I also state this in list_commands. Time control would be much better if I got feedback from the server. How can I use it? Is there a special command I'm missing? Thanks, Isaac ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/ ___ computer-go mailing list computer-go@computer-go.org http://www.computer-go.org/mailman/listinfo/computer-go/