Hi Jesper, Thanks for the good explanation :). The Java optimization techniques can explain the difference in performance. I now realize that the Java performance in production (a web application) will be worse than in the benchmark because lock elision and coarsening will most likely not be possible or be as beneficial in the benchmark. I've run the Java benchmark with -Djava.compiler=NONE to disable the optimizations, but that is not realy fair and performance drops two orders of magnitude.
Your suggestions of using multiple stores and giving each goroutine his own store would work well when having multiple routines that process data. This library is intended to be used by a team that develops web applications, and we want to measure durations of certain operations in a request or keep track of the response times of remote systems. The statistics must be available for inspection at request Multiple stores would then be more difficult to achieve, at least the implementation would be a little bit more complex. On Sunday, October 2, 2016 at 9:07:15 PM UTC+2, Jesper Louis Andersen wrote: > > >> This program has a store that holds the data, and a sync.Mutex that >> guards concurrent access on reads and writes. This is a snippet of the >> locking based implementation: >> >> type Store struct { >> durations map[string]*Distribution >> counters map[string]int64 >> samples map[string]*Distribution >> >> lock *sync.Mutex >> } >> >> If I were to write this, I would have one Store per goroutine. Then at > intervals, you send this Store to a central location (over a channel) and > tally up the counts. In turn, you avoid locking in the tight loops since > each goroutine owns the Store it operates on, until you make a copy of it > and send it along. approach, which is a bit harder to pull of in Go, is to > have a mutex/Store per core. Read operations then take all mutexes for > reading, but write operations will almost never compete on the lock, so it > becomes free of contention. > > The reason this works is because your counters work as CRDT P-counters and > max/min are CRDTs on their own. So you don't have to synchronize at all[0]. > Also, if you keep the count, the sum and the sum of squares, it is easy to > compute the mean and the std. dev. of the data set. All form CRDTs [1]. > > The key to fast concurrent programming which achieves good parallelism is > to avoid having to synchronize at all. > > > (Aside: both of the above schemes have been used by me with success in > Erlang, so I know they work in practice as well as in theory). > >> I would really like to understand why Locking in Go is so much slower than >> in Java. I've initially written this program using channels, but that was >> much slower than locking. Can somebody please help me out? >> >> Javas compiler employs a number of optimizations w.r.t. Locking: > > * Escape analysis and lock elision: If we can prove data is thread-local > and never escapes the thread, then we can elide the lock and never take it. > > * Lock coarsening: Suppose we take and release the same lock multiple > times in a row. Say you roll out the inner loop of your benchmark a bit. > Then you can take the lock once, update several times and release the lock. > This is called coarsening the lock. If the lock is heavily contended, this > may be faster. Especially if other optimizations allows us to > strength-reduce the updates and fold several of them into a single update. > Which would be exactly what we do once we have unrolled the loop 4-8 times. > > There is also the ability to exploit modern CPUs hardware transactional > memory: If you have a coarse lock over a large array, and most updates > affect different indexes in the array, then you can run a transaction > optimistically under the assumption threads will write to different slots. > On a collision, you restart the transaction, this time with normal mutexes. > It is sometimes called hardware lock elision, but the actual state of it on > modern CPUs eludes me. The extensions were there but I think they were > disabled by microcode since there were bugs in them (At least on the > Haswell generation of Intel CPUs). Also, I don't think Java has this, yet, > but I might be mistaken. > > [0] CRDT: Commutative/Convergent Replicated Data Type. > [1] Consider just using https://godoc.org/github.com/codahale/hdrhistogram by > Coda Hale. HDR Histogram is a nice data structure, invented by Gil Tene, > which combines the ideas of Flajolet's HyperLogLog with the structure of > floating point representations. Histogram updates are measured in the lower > nanoseconds. > > On Sunday, October 2, 2016 at 9:07:15 PM UTC+2, Jesper Louis Andersen wrote: > > >> This program has a store that holds the data, and a sync.Mutex that >> guards concurrent access on reads and writes. This is a snippet of the >> locking based implementation: >> >> type Store struct { >> durations map[string]*Distribution >> counters map[string]int64 >> samples map[string]*Distribution >> >> lock *sync.Mutex >> } >> >> If I were to write this, I would have one Store per goroutine. Then at > intervals, you send this Store to a central location (over a channel) and > tally up the counts. In turn, you avoid locking in the tight loops since > each goroutine owns the Store it operates on, until you make a copy of it > and send it along. approach, which is a bit harder to pull of in Go, is to > have a mutex/Store per core. Read operations then take all mutexes for > reading, but write operations will almost never compete on the lock, so it > becomes free of contention. > > The reason this works is because your counters work as CRDT P-counters and > max/min are CRDTs on their own. So you don't have to synchronize at all[0]. > Also, if you keep the count, the sum and the sum of squares, it is easy to > compute the mean and the std. dev. of the data set. All form CRDTs [1]. > > The key to fast concurrent programming which achieves good parallelism is > to avoid having to synchronize at all. > > > (Aside: both of the above schemes have been used by me with success in > Erlang, so I know they work in practice as well as in theory). > >> I would really like to understand why Locking in Go is so much slower than >> in Java. I've initially written this program using channels, but that was >> much slower than locking. Can somebody please help me out? >> >> Javas compiler employs a number of optimizations w.r.t. Locking: > > * Escape analysis and lock elision: If we can prove data is thread-local > and never escapes the thread, then we can elide the lock and never take it. > > * Lock coarsening: Suppose we take and release the same lock multiple > times in a row. Say you roll out the inner loop of your benchmark a bit. > Then you can take the lock once, update several times and release the lock. > This is called coarsening the lock. If the lock is heavily contended, this > may be faster. Especially if other optimizations allows us to > strength-reduce the updates and fold several of them into a single update. > Which would be exactly what we do once we have unrolled the loop 4-8 times. > > There is also the ability to exploit modern CPUs hardware transactional > memory: If you have a coarse lock over a large array, and most updates > affect different indexes in the array, then you can run a transaction > optimistically under the assumption threads will write to different slots. > On a collision, you restart the transaction, this time with normal mutexes. > It is sometimes called hardware lock elision, but the actual state of it on > modern CPUs eludes me. The extensions were there but I think they were > disabled by microcode since there were bugs in them (At least on the > Haswell generation of Intel CPUs). Also, I don't think Java has this, yet, > but I might be mistaken. > > [0] CRDT: Commutative/Convergent Replicated Data Type. > [1] Consider just using https://godoc.org/github.com/codahale/hdrhistogram by > Coda Hale. HDR Histogram is a nice data structure, invented by Gil Tene, > which combines the ideas of Flajolet's HyperLogLog with the structure of > floating point representations. Histogram updates are measured in the lower > nanoseconds. > > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.