* Clark Wierda:

> I don't have specific answers, but I do have some thoughts.
>
> First, the externally parallel has no overhead related to coordination.  I 
> would expect you to get the result you have of full core utilization and 
> nearly perfect scaling.

GNU parallel saves intermediate outputs from sub-jobs running in
parallel, so there is still some coordination overhead because that
data has to be kept around.  It's also written in Perl, and while
copying bytes around in bulk with Perl isn't *that* slow, it's still
some work the tool has to do.

There are also some savings due to internal parallelization, such as
fewer GC roots to scan.

> As soon as you have a common resource, you will have the related overhead 
> due to management of the concurrent actions.  The actual cost will be 
> determined by how much contention you have for that resource.
>
> Have you looked at your program using the Go Profiler?  I found the output 
> quite useful in determining where my program was spending its time. 
>  Another thing to check is CPU load.  Are you saturating the CPU with 6 
> threads.  If not, you are likely waiting somewhere.

The program appears to be GC-bound, so I doubt that the Go profiler
will tell me much more.

One oddity is that CPU utilization never reaches 100% across all
cores.  It seems that GC spends a lot of time waiting and switching
between threads, using futexes and sched_yield.  Hence the high number
of context switches.

But I think I have figured it out: The major difference between
external and internal parallelization is heap size.  With external
parallelization, I get 12 processes with around 13 MiB RSS each.  With
internal parallelization, the single process RSS stays at 25 MiB.  I
have not found a way to obtain GC summary stats, but based on the
traces, it seems that in both cases, the heap grows to 4 MiB per
process, while the live data seems to be less than 500 KiB.  With
external parallelization, that's 48 MiB of heap in total, most of
which is headroom for the GC to work with.  With internal
parallelization, the headroom is perhaps around 3.5 MiB.

If I set GOGC=1000, the internal parallelization is about as fast as
the external one, and RSS is well below the external parallelization.
Context switches are much reduced as well.

So this appears to be a case where the runtime does not expand the
heap aggressively enough, and the scheduler and GC somehow stall the
program because there is not enough heap available.  I would have
expected that in this case, more computation time would be dedicated
to the garbage collector, but this does not seem to be happening.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to