FWIW, I did some further experiments.  Disabling mem-pool entirely (in favor of 
plain malloc/free) brought run time down to 3:35, vs. 2:57 for the exact same 
thing without multiplexing.  Somehow we're still not managing contention very 
well at this kind of thread count, but the clues and opportunities are becoming 
less obvious.
