Hi Jesper,

Thanks for the good explanation :). The Java optimization techniques can 
explain the difference in performance. I now realize that the Java 
performance in production (a web application) will be worse than in the 
benchmark because lock elision and coarsening will most likely not be 
possible or be as beneficial in the benchmark. I've run the Java benchmark 
with -Djava.compiler=NONE to disable the optimizations, but that is not 
realy fair and performance drops two orders of magnitude. 

Your suggestions of using multiple stores and giving each goroutine his own 
store would work well when having multiple routines that process data. This 
library is intended to be used by a team that develops web applications, 
and we want to measure durations of certain operations in a request or keep 
track of the response times of remote systems. The statistics must be 
available for inspection at request Multiple stores would then be more 
difficult to achieve, at least the implementation would be a little bit 
more complex. 

On Sunday, October 2, 2016 at 9:07:15 PM UTC+2, Jesper Louis Andersen wrote:
>
>
>> This program has a store that holds the data, and a sync.Mutex that 
>> guards concurrent access on reads and writes. This is a snippet of the 
>> locking based implementation:
>>
>> type Store struct {
>>    durations map[string]*Distribution
>>    counters  map[string]int64
>>    samples   map[string]*Distribution
>>
>>    lock *sync.Mutex
>> }
>>
>> If I were to write this, I would have one Store per goroutine. Then at 
> intervals, you send this Store to a central location (over a channel) and 
> tally up the counts. In turn, you avoid locking in the tight loops since 
> each goroutine owns the Store it operates on, until you make a copy of it 
> and send it along. approach, which is a bit harder to pull of in Go, is to 
> have a mutex/Store per core. Read operations then take all mutexes for 
> reading, but write operations will almost never compete on the lock, so it 
> becomes free of contention.
>
> The reason this works is because your counters work as CRDT P-counters and 
> max/min are CRDTs on their own. So you don't have to synchronize at all[0]. 
> Also, if you keep the count, the sum and the sum of squares, it is easy to 
> compute the mean and the std. dev. of the data set. All form CRDTs [1].
>
> The key to fast concurrent programming which achieves good parallelism is 
> to avoid having to synchronize at all.
>
>
> (Aside: both of the above schemes have been used by me with success in 
> Erlang, so I know they work in practice as well as in theory).
>
>> I would really like to understand why Locking in Go is so much slower than 
>> in Java. I've initially written this program using channels, but that was 
>> much slower than locking. Can somebody please help me out?
>>
>> Javas compiler employs a number of optimizations w.r.t. Locking:
>
> * Escape analysis and lock elision: If we can prove data is thread-local 
> and never escapes the thread, then we can elide the lock and never take it.
>
> * Lock coarsening: Suppose we take and release the same lock multiple 
> times in a row. Say you roll out the inner loop of your benchmark a bit. 
> Then you can take the lock once, update several times and release the lock. 
> This is called coarsening the lock. If the lock is heavily contended, this 
> may be faster. Especially if other optimizations allows us to 
> strength-reduce the updates and fold several of them into a single update. 
> Which would be exactly what we do once we have unrolled the loop 4-8 times.
>
> There is also the ability to exploit modern CPUs hardware transactional 
> memory: If you have a coarse lock over a large array, and most updates 
> affect different indexes in the array, then you can run a transaction 
> optimistically under the assumption threads will write to different slots. 
> On a collision, you restart the transaction, this time with normal mutexes. 
> It is sometimes called hardware lock elision, but the actual state of it on 
> modern CPUs eludes me. The extensions were there but I think they were 
> disabled by microcode since there were bugs in them (At least on the 
> Haswell generation of Intel CPUs). Also, I don't think Java has this, yet, 
> but I might be mistaken.
>  
> [0] CRDT: Commutative/Convergent Replicated Data Type.
> [1] Consider just using https://godoc.org/github.com/codahale/hdrhistogram by 
> Coda Hale. HDR Histogram is a nice data structure, invented by Gil Tene, 
> which combines the ideas of Flajolet's HyperLogLog with the structure of 
> floating point representations. Histogram updates are measured in the lower 
> nanoseconds.
>
>
On Sunday, October 2, 2016 at 9:07:15 PM UTC+2, Jesper Louis Andersen wrote:
>
>
>> This program has a store that holds the data, and a sync.Mutex that 
>> guards concurrent access on reads and writes. This is a snippet of the 
>> locking based implementation:
>>
>> type Store struct {
>>    durations map[string]*Distribution
>>    counters  map[string]int64
>>    samples   map[string]*Distribution
>>
>>    lock *sync.Mutex
>> }
>>
>> If I were to write this, I would have one Store per goroutine. Then at 
> intervals, you send this Store to a central location (over a channel) and 
> tally up the counts. In turn, you avoid locking in the tight loops since 
> each goroutine owns the Store it operates on, until you make a copy of it 
> and send it along. approach, which is a bit harder to pull of in Go, is to 
> have a mutex/Store per core. Read operations then take all mutexes for 
> reading, but write operations will almost never compete on the lock, so it 
> becomes free of contention.
>
> The reason this works is because your counters work as CRDT P-counters and 
> max/min are CRDTs on their own. So you don't have to synchronize at all[0]. 
> Also, if you keep the count, the sum and the sum of squares, it is easy to 
> compute the mean and the std. dev. of the data set. All form CRDTs [1].
>
> The key to fast concurrent programming which achieves good parallelism is 
> to avoid having to synchronize at all.
>
>
> (Aside: both of the above schemes have been used by me with success in 
> Erlang, so I know they work in practice as well as in theory).
>
>> I would really like to understand why Locking in Go is so much slower than 
>> in Java. I've initially written this program using channels, but that was 
>> much slower than locking. Can somebody please help me out?
>>
>> Javas compiler employs a number of optimizations w.r.t. Locking:
>
> * Escape analysis and lock elision: If we can prove data is thread-local 
> and never escapes the thread, then we can elide the lock and never take it.
>
> * Lock coarsening: Suppose we take and release the same lock multiple 
> times in a row. Say you roll out the inner loop of your benchmark a bit. 
> Then you can take the lock once, update several times and release the lock. 
> This is called coarsening the lock. If the lock is heavily contended, this 
> may be faster. Especially if other optimizations allows us to 
> strength-reduce the updates and fold several of them into a single update. 
> Which would be exactly what we do once we have unrolled the loop 4-8 times.
>
> There is also the ability to exploit modern CPUs hardware transactional 
> memory: If you have a coarse lock over a large array, and most updates 
> affect different indexes in the array, then you can run a transaction 
> optimistically under the assumption threads will write to different slots. 
> On a collision, you restart the transaction, this time with normal mutexes. 
> It is sometimes called hardware lock elision, but the actual state of it on 
> modern CPUs eludes me. The extensions were there but I think they were 
> disabled by microcode since there were bugs in them (At least on the 
> Haswell generation of Intel CPUs). Also, I don't think Java has this, yet, 
> but I might be mistaken.
>  
> [0] CRDT: Commutative/Convergent Replicated Data Type.
> [1] Consider just using https://godoc.org/github.com/codahale/hdrhistogram by 
> Coda Hale. HDR Histogram is a nice data structure, invented by Gil Tene, 
> which combines the ideas of Flajolet's HyperLogLog with the structure of 
> floating point representations. Histogram updates are measured in the lower 
> nanoseconds.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to