> First of all thanks for committing part-1 of this changes and it
> seems you are planing to commit part-3 based on results of tests
> which Andres is planing to do and for remaining part (part-2), today
> I have tried some tests, the results of which are as follows:
> Scale Factor - 3000, Shared_buffer - 8GB
>    Patch_Ver/Client_Count 16 32 64 128  reduce-replacement-locking.patch
> + 128 Buf Partitions 157732 229547 271536 245295
> scalable_buffer_eviction_v9.patch 163762 230753 275147 248309
> Scale Factor - 3000, Shared_buffer - 8GB
>    Patch_Ver/Client_Count 16 32 64 128  reduce-replacement-locking.patch
> + 128 Buf Partitions 157781 212134 202209 171176
> scalable_buffer_eviction_v9.patch 160301 213922 208680 172720
> The results indicates that in all cases there is benefit by doing
> part-2 (bgreclaimer).  Though the benefit at this configuration is
> not high, but might be at some higher configurations
> (scale factor - 10000) there is more benefit.  Do you see any merit
> in pursuing further to accomplish part-2 as well?

Interesting results.  Thanks for gathering this data.

If this is the best we can do with the bgreclaimer, I think the case for
pursuing it is somewhat marginal.  The biggest jump you've got seems to be
at scale factor 3000 with 64 clients, where you picked up about 4%.  4%
isn't nothing, but it's not a lot, either.  On the other hand, this might
not be the best we can do.  There may be further improvements to
bgreclaimer that make the benefit larger.

Backing up a it, to what extent have we actually solved the problem here?
If we had perfectly removed all of the scalability bottlenecks, what would
we expect to see?  You didn't say which machine this testing was done on,
or how many cores it had, but for example on the IBM POWER7 machine, we
probably wouldn't expect the throughput at 64 clients to be 4 times the
throughput at 16 cores because up to 16 clients each one can have a full
CPU core, whereas after that and out to 64 each is getting a hardware
thread, which is not quite as good.  Still, we'd expect performance to go
up, or at least not go down.  Your data shows a characteristic performance
knee: between 16 and 32 clients we go up, but then between 32 and 64 we go
down, and between 64 and 128 we go down more.  You haven't got enough data
points there to show very precisely where the knee is, but unless you
tested this on a smaller box than what you have been using, we're certainly
hitting the knee sometime before we run out of physical cores.  That
implies a remaining contention bottleneck.

My results from yesterday were a bit different.  I tested 1 client, 8
clients, and multiples of 16 clients out to 96.  With
reduce-replacement-locking.patch + 128 buffer mapping partitions,
performance continued to rise all the way out to 96 clients.  It definitely
wasn't linearly, but it went up, not down.  I don't know why this is
different from what you are seeing.  Anyway there's a little more ambiguity
there about how much contention remains, but my bet is that there is at
least some contention that we could still hope to remove.  We need to
understand where that contention is.  Are the buffer mapping locks still
contended?  Is the new spinlock contended?  Are there other contention
points?  I won't be surprised if it turns out that the contention is on the
new spinlock and that a proper design for bgreclaimer is the best way to
remove that contention .... but I also won't be surprised if it turns out
that there are bigger wins elsewhere.  So I think you should try to figure
out where the remaining contention is first, and then we can discuss what
to do about it.

On another point, I think it would be a good idea to rebase the bgreclaimer
patch over what I committed, so that we have a clean patch against master
to test with.

