For DBAs, memory access is the fastest
thing in the world. Disk IO is slow and evil, having the data in memory
is fast and good. Last night I’ve had the opportunity to listen to a
talk by Ulrich Drepper
of Red Hat. This man is obsessed with performance in a way that I’ve
never seen before. His mission in life is to utilize every CPU cycle,
never ever wasting precious CPU waiting for anything - especially not
memory. How can you not like someone like this? Of course the talk was
fascinating, and I was furiously scribbling notes for two hours. The
rest of the post are these notes.
CPUs are fast and are getting faster. Memory, on the other hand, is
slow. Each memory access takes 240 core cycles to complete, which is an
incredible waste of resources.
To make matters worse, there are two types of memory these days: DRAM
and SRAM. DRAM is relatively cheap, takes less energy and less real
estate than SRAM, but it also has a physical limit for its speed - it
needs to discharge and this takes time. SRAM is a huge energy hog, but
it has no theoretical limits on its speed. For energy and real estate
reasons, SRAM can never be used as the machine main memory, so DRAM is
used.
The current solution for the discrepancy is using CPU Caches. They
are extra bits of fast memory that sit between the CPU and the main
memory with the intention to speed things up. The use of these caches
is transparent to the application developers (although they can do a
lot to optimize cache usage), and requires only minor changes in the
OS, as they are controlled by the CPU and the related chipset.
Important part of this solution is that there are different layers
of caches with different sizes and speed. Usually we are talking about
two or three such layers, and the closer the layer is to the CPU, the
faster it will be and also smaller.
The Layers:
CPU Layer (AKA registers) - Cache so fast it takes
less than one core cycle to access it.
L1 - These days there are two L1 caches - one for code
instructions and the other for program data. Thats because programs
don’t modify their own instructions while running (at least not often),
and separating read-only instructions make things faster. If you use an
interpreted language such as Python, your code will be kept in the data
cache and you will lose this optimization.
Accessing L1 cache takes about 3 core cycles, which is fast enough to
be irrelevant. CPUs use pipelines, so by the time the instruction
actually runs the data will already be in the CPU.
L2 - Takes about 14 cycles to access. If you have
multiple cores, this layer will be shared between all of them.
Main memory - As we mentioned before, this takes 240
cycles to get to. This is a huge gap. Turns out that if a machine needs
to be really really fast, like a backbone router or a cray, it will not
contain any DRAM memory.
At this point Ulrich talked at length about addressing the memory in
the cache. L2 cache works by physical addresses, just like real memory.
L1 cache works a bit differently. It uses something called Cache Lines.
Each cache line keeps 32 or 64 bits of information, that could belong
to different variables. Cache lines are addressed by a hash key and a
tag. The hash key can match up to four different cache lines, and then
the correct line is chosen by comparing tags. Ulrich shows that this is
much faster than direct hashing (each key matches just one line).
The nerdiest part of the talk is now behind us, and now Ulrich talks
about different programming practices and how they impact performance
by allowing greater or lesser cache utilization. He did amazing
benchmarks and showed us very cool graphs that demonstrate these ideas
visually. I’ll just summarize the ideas.
Cache usage recommendation for C++ developers:
- Sequential memory access is almost 10 times faster than random
access (No surprise for DB developers, I hope). In your programs, this
means you should use arrays instead of linked lists. In your code, this
means you should avoid indirect calls. C++ developers - avoid virtual
functions and use templates instead.
- Choose data structures as small as possible. There is a huge
speed
penalty for data elements that are larger than a cache line. If they
are larger than a memory page, it gets much worse.
- Pre-fetching data (reading it into the cache before it is
actually
use) can help hide much of the cost of main memory access. C developers
can force pre-fetch by accessing the variable before it is actually
used, or by calling an appropriate assembly instruction.
- Use structures to keep together variables that are used together.
This will help you get them on the same cache line.
- L3 cache is used in high end servers. Large L3 cache, say with
32M,
can give you tremendous speed ups, much better than what a faster CPU
will give you. Which is why servers with large L3 cache are so
expensive. Ulrich gave an example of having Oracle fit entirely into
the L3 cache, which will make it 10 times faster and will justify the
cost. I burst out laughing, because to me fitting the DB into 32M of
memory is ridiculous - we need 12G for the SGA! But the rest of the
audience took this seriously, so maybe I missed something.
- Multi-threading. Generally, a very bad idea - because if a bit of
memory is in L1 cache of CPU #1, CPU #2 can’t use it until #1 releases
the bit, and then #2 reads it from main memory, while invalidating the
cache for #1, because the bit was changed by #2. So, if you use
multi-threading, your expansive L1 cache is wasted. This part reminded
me of the way cache fusion works in RAC.
You can find the slides for the talk here,
and the paper that the slide is based on here.
Ulrich claims that the talk contains only about 1% of the content of
the paper. I encourage you to at least look at the slides, because the
graphs that demonstrate the recommendations make a powerful impact.
According to Ulrich, Redhat and Oracle are working hard so our DB
code will be as optimized as humanly possible to make the most of the
CPU caches and the related speedups. Which is a nice thing to keep in
mind - while we are working hard to optimize our applications, query
and the DB itself, someone was working hard to make sure that
everything below this level is already as optimal as possible.
(Ulrich gave an example of having Oracle fit entirely into the L3 cache, which will make it 10 times faster and will justify the cost. I burst out laughing, because to me fitting the DB into 32M of memory is ridiculous - we need 12G for the SGA)
Well, it’s not the SGA, it’s the Oracle code itself that goes into L3 cache. In general, 32M is plenty enough for the most active working set of Oracle, although the entire code exceeds 120M in size in 10g.
One of the reasons why code-bloat is so detrimental to performance of modern CPUs: they need to cache the code in L3 or L2 to make them run at anywhere near the speed of the CPU clock. Anything else causes a huge penalty in performance.
There are also other considerations that were not talked about in that presentation. Like code and data alignment to 512 or other byte boundaries. Of extreme importance to kep Ln caches fed at top speed from main (slow) memory. But, I digress.
Thanks for the correction. L3 cache does make more sense in terms of instruction caching.
Ulrich was actually asked about alignment, and I think he explained how some compilers can help with this, but I didn’t take notes of this and therefore have no clue what he actually said.
[...] run first. Higher for the process currently using the CPU (because it has things on cache and you don’t want to mess up the cache), high for processes with many ticks left as opposed to those with only few ticks, and 0 for [...]