My disappointment with that blog post is that it does not address the
larger issue, which some people here have clarified, but to be clear on the
larger issue and the various points raised about channels:
SMALL BENCHMARKS ARE HARD
Benchmarking is harder than it seems, because computers since the IBM
360/85 (1969), the DEC PDP-11/70 (1975), and a others before have memory
caches. They can also have multiple CPUs, multiple levels of memory cache,
and since 1964's CDC 6000, multiple threads active in a single CPU (the
basis of the first NVIDIA GeForce GPU and rediscovered by intel as Jackson
Technology, aka SMT, a few years later).
In a world where computers read, write, and compute multiple things at
once, it is quite difficult to measure the cost of any single thing. Why?
Because if that one thing can be done in some idle part of the computer
while other things are happening, then the effective cost is zero.
Even if you manage to measure that one thing somehow with sufficient
scaffolding and after disabling your modern CPU's power saving speed
degrader and myriad other processes and performance modulators such as
inside the CPU uOp dispatch priority and the like, the problem is, how to
understand it and extrapolate from it.
To be clear, if it takes a perfectly measured average of 140 ns +/- 11ns to
do something, and you do 100 of them, it will likely not add 100x that time
or 100x that variance to your application's actual performance rates. Maybe
all 100 of those fit in "empty spots" so the cost is zero. Maybe they
conflict with instruction scheduling, bus activity, or cache/VM activity in
such a way that they add 100000x that time.
This is why microbenchmarks are so hard to understand. They are hard to
write properly, they are hard to measure in a meaningful way, and they are
hard to extrapolate with confidence. (You should do them, but, you should
always keep in mind that the application-level, real-world impact may be 0x
or 100x that much cost.)
CONCURRENT PROGRAMMING IS HARD
Doing things concurrently generally implies one or more spreading steps
where the code widens one thing to multiple things, and one or more
gathering steps where the multiple things narrow to fewer things,
ultimately to one thing, like knowing when to exit the program. For decades
there have been frequent errors in widening, narrowing, and data and device
conflicts in the concurrent phase. Code that works for years may suddenly
break. Code that looks simple my never work. Natural intuition in such
cases often leads one astray.
One solution is to be afraid of concurrency and avoid it. One is to embrace
it, but with special armor as if handling hot lava or a pit of vipers. A
middle route is to only allow it in such a way that it its tame (Lava in
insulated containers, snakes asleep in darkened boxes.) One such mild
route--Communicating Sequential Processes--was pioneered by C. A. R. Hoare,
the inventor of Quicksort. Per Brinch Hansen has an excellent book about OS
construction via the method, and Go is one of CSP's direct decedents. Go's
channels, along with select, receive, send, and close operations, are its
SMALL BENCHMARKS OF CONCURRENCY PRIMITIVES IS VERY HARD
It is hard to measure directly in much the same way it is hard to directly
measure curved space-time. Indirect ways are hard too. As above, even when
you can measure them, it is hard to understand what that data says in the
context of your program or any program other than the test harness.
This is where that blog post comes in. To paraphrase, "I think some of Go's
mild, safe mechanisms lack a feature that I wish for, and not only that,
when I use them to emulate some parts of my low-level lava-juggling armor,
they are not as fast. Oh no! Yet, I still love Go." Well people see that,
seem to miss:
a. Why in the heck would you use high-level magic to emulate low-level
tools? In the case of channels, they already use lava juggling and snake
charming tools hidden safely inside their implementation.
b. How can you compare performance of high level program structuring
elements and low-level viper wrangling tools? Whichever is 'faster' or
'simpler' for the same task is likely a misapplication of one or both.
c. What about the whole notion of making concurrency safe and easy?
Experienced people from Tony Hoare to Rob Pike have seen the light about
hiding the parts of concurrency that are so often tools of self-destruction
in the hands of very good programmers. Why tempt beginners to open that
door? Why tempt anyone?
That's what I think when people comment on that post. Sure, new features
could be considered. Sure, existing tools can be tweaked toward hardware
optimal implementation. Sure, Go provides all the lava and snake tools one
could need in the sync package. But unless your well-designed and
well-implemented application is too inefficient as it scales from 1 to N
CPUs, then why would you think to abandon magic that brings simplicity and
correctness to what was formerly a wasteland of failed efforts and
inscrutable bugs? ("My plane flies 0.0000001% faster without the weight of
my parachute so i leave it behind" is not a well-considered approach.)
How would you know if an application was that inefficient? By benchmarking
THE WHOLE APPLICATION rather than an emulation of a low-level concurrency
Channels are not "expensive" it is just that they are not free. I can do
3,000,000 channel send/receive pairs per second on my notebook computer. If
each send is a single bit, that's 366kb/second safely and easily sent
between communicating processes. If each is a pointer to a 1MB data
structure, then that's 3 TB/sec safely and easily sent between
communicating processes. It is not likely that any application can do 3
million interesting tasks (build and send web pages, compute market
conditions and send buy/sell orders, update databases, etc.) on any
computer, much less a four core mobile device on battery power.
Maybe it is possible to do 5x or 20x that many mutex-protected increments
of an integer using those viper-handling gloves and body armor. But a
computer that is dedicated to updating a single int is questionable, and an
application dominated by it is also questionable.
If anything, I'd rather put a sleep in all the sync primitives and force
the mental discipline to make the application fast DESPITE artificially
slow sync/cond/mutex/wait/... speeds. It is all about high-level design,
choice of algorithms and data structures, and similar issues--that's where
100x gains in performance await. 2x on a mutex is just not interesting to
On Sat, Aug 12, 2017 at 5:34 AM, Jesper Louis Andersen <
> On Fri, Aug 11, 2017 at 2:22 PM Chris Hopkins <cbehopk...@gmail.com>
>> .... The microsecond or so of cost you see I understood was *not* due to
>> there being thousands of operations needed to run the channel, but the
>> latency added by the stall, and scheduler overhead.
> One particular case, which many benchmarks end up doing is that they run a
> single operation through the system which in turn pays all the context
> switching overhead for that operation. But channels pipeline. If you start
> running a million operations, then the switching overhead amortizes over
> the operations if your system is correctly asynchronous and tuned.
> I think most message passing languages add some kind of atomics in order
> to track counters and like stuff without resorting to sending around
> microscopic messages all the time.
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Michael T. Jones
You received this message because you are subscribed to the Google Groups
To unsubscribe from this group and stop receiving emails from it, send an email
For more options, visit https://groups.google.com/d/optout.