On 06/11/2020 10:24, Kyotaro Horiguchi wrote:
Thank you for the comment!
First off, I thought that I managed to eliminate the degradation
observed on the previous versions, but significant degradation (1.1%
slower) is still seen in on case.
One thing to keep in mind with micro-benchmarks like this is that even
completely unrelated code changes can change the layout of the code in
memory, which in turn can affect CPU caching affects in surprising ways.
If you're lucky, you can see 1-5% differences just by adding a function
that's never called, for example, if it happens to move other code in
memory so that a some hot codepath or struct gets split across CPU cache
lines. It can be infuriating when benchmarking.
At Thu, 5 Nov 2020 11:09:09 +0200, Heikki Linnakangas <hlinn...@iki.fi> wrote in
(A) original test patch
I naively thought that the code path is too short to bury the
degradation of additional a few instructions. Actually I measured
performance again with the same patch set on the current master and
had the more or less the same result.
master 8195.58ms, patched 8817.40 ms: +10.75%
However, I noticed that the additional call was a recursive call and a
jmp inserted for the recursive call seems taking significant
time. After avoiding the recursive call, the difference reduced to
+0.96% (master 8268.71ms : patched 8348.30ms)
Just two instructions below are inserted in this case, which looks
reasonable.
8720ff <+31>: cmpl $0xffffffff,0x4ba942(%rip) # 0xd2ca48
<catalog_cache_prune_min_age>
872106 <+38>: jl 0x872240 <SearchCatCache1+352> (call to a function)
That's interesting. I think a 1% degradation would be acceptable.
I think we'd like to enable this feature by default though, so the
performance when it's enabled is also very important.
(C) inserting bare counter-update code without a branch
Do we actually need a branch there? If I understand correctly, the
point is to bump up a usage counter on the catcache entry. You could
increment the counter unconditionally, even if the feature is not
used, and avoid the branch that way.
That change causes 4.9% degradation, which is worse than having a
branch.
master 8364.54ms, patched 8666.86ms (+4.9%)
The additional instructions follow.
+ 8721ab <+203>: mov 0x30(%rbx),%eax # %eax = ct->naccess
+ 8721ae <+206>: mov $0x2,%edx
+ 8721b3 <+211>: add $0x1,%eax # %eax++
+ 8721b6 <+214>: cmove %edx,%eax # if %eax == 0 then %eax = 2
<original code>
+ 8721bf <+223>: mov %eax,0x30(%rbx) # ct->naccess = %eax
+ 8721c2 <+226>: mov 0x4cfe9f(%rip),%rax # 0xd42068 <catcacheclock>
+ 8721c9 <+233>: mov %rax,0x38(%rbx) # ct->lastaccess = %rax
Do you need the "ntaccess == 2" test? You could always increment the
counter, and in the code that uses ntaccess to decide what to evict,
treat all values >= 2 the same.
Need to handle integer overflow somehow. Or maybe not: integer overflow
is so infrequent that even if a hot syscache entry gets evicted
prematurely because its ntaccess count wrapped around to 0, it will
happen so rarely that it won't make any difference in practice.
- Heikki