Hello, At [1] you have a chart with a pgbench running on monster for a range of 5 to100 threads/clients (for 5, there are 5 threads and 5 clients, so a total of 10 new processes). As you can see, when the number of running processes is greater than the number of cores, the cache-coherent heuristic has much better results (in some cases more than ~15%). I modified my cache-coherent heuristic, for the monster, so that it tries to stick a process to a socket, no matter what core. The L3 cache comes to work, at that's why we have these results. Now I'm working to make a tunable option for selecting the level you want to stick a process (thread level, core level, socket/package level). By tomorrow, I will commit it. Also, this week I cleaned up the code and introduce KTR debug options instead of my old kprintfs. The next week, I will come with this kind of charts for core i3, core i7 (alexh) and dual-xeon (ftigeot).
[1] http://leaf.dragonflybsd.org/~mihaic/pgbench_monster.pdf Mihai