* Clark Wierda: > I don't have specific answers, but I do have some thoughts. > > First, the externally parallel has no overhead related to coordination. I > would expect you to get the result you have of full core utilization and > nearly perfect scaling.
GNU parallel saves intermediate outputs from sub-jobs running in parallel, so there is still some coordination overhead because that data has to be kept around. It's also written in Perl, and while copying bytes around in bulk with Perl isn't *that* slow, it's still some work the tool has to do. There are also some savings due to internal parallelization, such as fewer GC roots to scan. > As soon as you have a common resource, you will have the related overhead > due to management of the concurrent actions. The actual cost will be > determined by how much contention you have for that resource. > > Have you looked at your program using the Go Profiler? I found the output > quite useful in determining where my program was spending its time. > Another thing to check is CPU load. Are you saturating the CPU with 6 > threads. If not, you are likely waiting somewhere. The program appears to be GC-bound, so I doubt that the Go profiler will tell me much more. One oddity is that CPU utilization never reaches 100% across all cores. It seems that GC spends a lot of time waiting and switching between threads, using futexes and sched_yield. Hence the high number of context switches. But I think I have figured it out: The major difference between external and internal parallelization is heap size. With external parallelization, I get 12 processes with around 13 MiB RSS each. With internal parallelization, the single process RSS stays at 25 MiB. I have not found a way to obtain GC summary stats, but based on the traces, it seems that in both cases, the heap grows to 4 MiB per process, while the live data seems to be less than 500 KiB. With external parallelization, that's 48 MiB of heap in total, most of which is headroom for the GC to work with. With internal parallelization, the headroom is perhaps around 3.5 MiB. If I set GOGC=1000, the internal parallelization is about as fast as the external one, and RSS is well below the external parallelization. Context switches are much reduced as well. So this appears to be a case where the runtime does not expand the heap aggressively enough, and the scheduler and GC somehow stall the program because there is not enough heap available. I would have expected that in this case, more computation time would be dedicated to the garbage collector, but this does not seem to be happening. -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.