Hi Rudolf, Good detailed email. Yes, this is how I think it works, and as far as I know, the L1 instruction cache also writes to the L2. The L2 is the main reason that CAR works. I have never been happy with the post_car code. Something about it doesn't seem right, but I have never found it. I do think that more care needs to happen with cache en/disable and the MTRR settings.
Marc On Fri, May 7, 2010 at 12:10 PM, Rudolf Marek <[email protected]> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > Hi all, > > I examined a bit how does it works. Maybe if one can read this > http://en.wikipedia.org/wiki/CPU_cache and then continue here :) > > > I was particularly curious because we do writeback - writeback copy of data > from > CAR to ram (to copy stack and sysinfo, which must cause L1 evictions), and > also > we do DQS memory training (which writes to RAM during CAR) and we use cache to > cache ROM too. > > This means not only L1 is used but we must be using L2 too. Here are some > notes > why I think it works :) > > Here is what I found: > > AMD L2 cache is exclusive, it means it only contains data evicted from L1 > caches. In other words there is never same data in both caches. I could not > find > any info if it is valid for the Icache too. If the icache gets moved to L2 or > not. It should but it does not seem to happen during CAR. > > L1 Data cache: > Size: 64Kb 2-way associative. > lines per tag=1 line size=64 bytes. > L1 Instruction cache: > Size: 64Kb 2-way associative. > lines per tag=1 line size=64 bytes. > > 512KB/core > > L2 cache: > Size: 1024Kb 16-way associative. > lines per tag=1 line size=64 bytes. > > Here is basic math how to calculate cache organization: > > line size => tells how many bytes are stored in one cache line (exploits the > spatial locality of data). Here it is 64 bytes so bits of address 5:0 are > used. > > Index: it tells how many cache lines do we have. > > The level of associativity tells how many addresses which compete for same > index > can be stored in cache simultaneously. > > For L1 we have: 64*1024 / 64 / 2 = 512 is the number of cache lines. We have 2 > (assoc is 2) "arrays" each has 64 bytes/per line and total size is 64KB. The > index is therefore on addresses 14:6. The rest of address is used as tag (tag > identifies the actual location of data in memory together with the cache line > index) One can say each 14:0 bit of address compete for same index. We have > asoc > level of 2 so each 16 bits of addr will fill whole cache. > > For L2 here it is 512KB assoc is 16. We have 32KB / 64 indexes = 512 (lines) > so addresses 14:6 build up the index. Rest is tag. > > The CAR idea on AMD is just to use it and never cause an eviction from L2 > cache > to main memory (which is not functioning). > > Step 0) enable cache and WB mtrrs for any ranges > 1) all lines are invalid, validate them by dummy read exactly as big as max L1 > cache. For instruction cache enough is a instr fetch. > 2) The dummy read region can be now used to store data - it is simply an > arbitrary address range 0-64KB max. > > 3) caching of ROM works too because: > > a) MTRR for rom is set (currently only for part of it) it could be WP type but > we use WB, no harm here because we do not modify any code ;) > b) L1 instruction cache is filled from flash chip directly (remember L2 is > exclusive cache on AMD) > c) if L1 instr cache is not evicted into L2 then on cache miss it L1 line is > simply invalidated and refilled from flash rom. I tried to check this using > performance counters but there is not a counter for this. This is > uninteresting > case because it does not complicate anything. > > c) if L1 instr cache gets evicted into L2, (which I dont know if is true), > > then we can run into following > > I) no L1 data cache lines was evicted into L2 - again not interesting case > because nothing gets wrong. > > II) we have some L1 data cache evicted into L2. This really happens in our > CAR! > print_debug("Copying data from cache to RAM -- switching to use RAM as > stack..."); > memcopy((void *)((CONFIG_RAMTOP)-CONFIG_DCACHE_RAM_SIZE), (void > *)CONFIG_DCACHE_RAM_BASE, CONFIG_DCACHE_RAM_SIZE); > > It happens here because we do copy from CAR region to RAM while CAR is still > running. Both regions are WB so we must evict some L1 cache lines for sure, > and > performance counters confirm this. You may say this is not an issue because > RAM > is running normally, but for example while we resume from S3 we cannot > overwrite > random memory with out CAR... I think this evictions so far happens only here > and still things works nice here is why: > > We have at most 64KB or dirty data, we can spread it into L2 nicely and still > have a lot of free space even on systems where we have 128KB L2. In this case > no > evictions into system because we can have the data still in L2. > > Now lets go back, what if CPU instruction cache gets evicted into L2? Here it > would cause problems because in L2 would be L1 data cache data and random L1 > instr cache code competing for the space. > > I think here it works because dirty data is evicted with lowest priority. I > think if all lanes of cache are full, the lane with "clean" data is > invalidated > first. This saves the day for us because it guarantees that our L1 data will > not fall off the cache never ever - only if we exceed the L2 cache size with > dirty data. > > We examined so far the ROM caching and oversized L1 handling. But the memory > training uses writes to not yet initialized RAM. How it works here? > > I checked and the memory write uses the instruction which bypasses caches. The > read uses cache, but it invalidates the cache line afterwards. Again because > we > have at most L1size of dirty data and L2 is big enough it does not spoil the > party and no stuff gets evicted back to non functioning memory. > > Last thing which worries me are speculated fills which can be do by CPU. I > think > they are disabled because the bit for proble FILLs is 0. The fam11 which has > better documented L2 for general storage needs to have some other bits toggled > not to do some extra speculations. Fam 10h describes only L1 car and older > fams > also the L1 only CAR. In our code we practically use L2 in all cases. > > What we could do is to program a performance counter for L2 writebacks to > system > at the beginning of CAR and in CAR disable check if it is still zero. This > will > tell if we did something nasty. > > We could also avoid the WB-WB copy of the CAR area. I tried with WB-UC copy > and > we have then 0 evictions from L1 which is fine (i did some experiments in > January see AMD CAR questions email). > > Uhh its long email took like hour to write, please tell if you think that it > works really this way. > > Thanks, > Rudolf > > > > > > > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.10 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iEYEARECAAYFAkvkV60ACgkQ3J9wPJqZRNXaYgCglBFGuv2PtaR7yI/xxpVgvFBu > vjwAn1ZPp1AArEih9CyO1T44tz/o97LR > =ce4w > -----END PGP SIGNATURE----- > > -- > coreboot mailing list: [email protected] > http://www.coreboot.org/mailman/listinfo/coreboot > -- http://se-eng.com -- coreboot mailing list: [email protected] http://www.coreboot.org/mailman/listinfo/coreboot

