Kevin Lawton wrote:
> I'll commit your write-cache code mods soon.
Thanks, I attach a version with avoids unneeded seeks and
implements a "sync" option.
>
> If library execution performance is not an issue, I'm still
> interested in memory consumption by the map algorithms. Memory
> use has a way of pushing other critical pieces of data out of
> the processor caches.
Yes, you're right. That is the weak point with the current
map implementation. I'm storing the key combined with the
sector data. That's bad, because the cache is polluted while
traversing the tree. I should combine the key with a pointer to
the data, but this was slightly more coding. It should be done
in the near future. The nice thing however about the map is that it uses
internally a chunk allocater. This means that the key/value
pairs are stored in a number of contigious memory regions
(probably 4K each, depending on the implementation, I'll have
to look at the STL source),
improving the cache performance, provided I have implemented
storing the pointers. (Storing the pointer implies however
uses malloc/frees, now a malloc is done for every 7 sectors,
or writing my own chunk allocator, I'll think about it a while)
An alternative would be to use the Unix dbopen() with the hash
option to create the write-cache. Do you have experience with its
performance?
While talking/thinking about optimal cache strategies,
I have done some pondering on designing a hash implementaion
for the dynamic translation jump target lookups.
I think it could be very efficient to have a simple
(but efficient :-) implementation
which uses a 2 or 3 level lookup. The first level maps on the
L1 internal cache, the 2nd level on the L2 cache and perhaps
even a third level which uses main memory.
The first level and second level implementations are just a
few lines of code (minimal impact on the instruction caches).
A simple two deep cache replacement strategie has proven very
effective on chess hashtable lookups. The idea is that we "reserve"
4Kbyte of L1 cache (stores 512 lookup values,
we'll have to experiment with the real
sizes). The hashkey calculation would be:
unsigned key= ladr & 0xFF;
the L1 table definition:
struct L1slot {
unsigned key1, val1;
unsigned key2, val2;
} L1tab[0xFF];
the probe would be:
struct L1slot *p= &L1tab[key];
if (p->key1 == key)
return p->val1;
else if (p->key2 == key)
return p->val2;
====
When this fails we resort to a L2cache lookup. (Perhaps we should
skip the L1 cache, that will be a matter of heuristic tuning)
The basic idea of the L2 cache is to profit from the X86 cache
behaviour of cache bursts. Whenever 1 memory byte is touched and it
is not present in the cache, then the cache hardware fills the
cache with a
cache burst. This means that 64 bytes are fetched (depending on the
size of cachelines and the actual model Pentium II/III, Athlon, ...)
in 4 bursts of 16 bytes. This means that although we just wanted
access to 1 byte, we will actually get access to the next 63 bytes
"for free". Modelled on the above two deep cache implementation,
we could implement an 8 deep implementation. The tests on key2 to key8
are garantueed to access data already in the cache!
The key could be the same, resulting in 256 * 64 = 32Kbyte of L2 cache
use (and 2048 stored lookup values). If this lookup fails
also, we should resort to a smart, c.q. complex L3 implementation.
The main idea is to have the most recently used lookup values
hardwired this way in e.g. 25% of the L1 and L2 caches.
A needed precondition is that many more lookups are executed
than hashtable updates, because updates have to be performed 2 or
3 times, depending on the number of levels.
I think it is impossible to implement a
dynamic translation hash table the first time "right",
but the L1 and L2 implementations are simple and we could easily
try out a large number of different
implementations of hash strategies.
Tom
write-cache.cc