On Sat, Sep 11, 2010 at 9:34 AM, Scott Duplichan <[email protected]> wrote: > ]-----Original Message----- > ]From: [email protected] [mailto:[email protected]] > On Behalf Of Arne Georg Gleditsch > ]Sent: Saturday, September 11, 2010 06:01 AM > ]To: Scott Duplichan > ]Cc: 'Marc Jones'; 'Carl-Daniel Hailfinger'; 'Coreboot' > ]Subject: Re: [coreboot] rfc - gcc builtins and memset memcpy memmove memcmp > ] > ]"Scott Duplichan" <[email protected]> writes: > ]> In this report: > ]> http://article.gmane.org/gmane.linux.bios/57707, > ]> Arne may have been encountering the ClLinesToNbDis issue > ]> (assuming the memset code was running from flash). Switching > ]> to rep movs would greatly improve performance because unlike > ]> a byte loop, rep movs loops in microcode which does not cause > ]> continuous flash memory accesses. > ] > ]This was my assumption as well. After fixing the ClLinesToNbDis > ]setting, I have removed the rep stosb code from my tree, and so far I've > ]not observed the pathological memset behaviour that caused me to put it > ]in in the first place. (As mentioned earlier this was never altogether > ]deterministic, I'm assuming some critical part of the original memset > ]loop needed to straddle cache lines or something for it to manifest.) > > Interesting point about memcpy straddling a cache line boundary. It got > me thinking about what the DediProg em100 trace function shows when > booting from SPI flash. With SPI, the SB initially reads a dword at a > time. If the processor is not caching code, a byte loop memcpy would > trigger multiple dword reads from the flash chip for every byte copied. > If BIOS sets SB option PrefetchEnSPIFromHost, then the SB will switch > to cache line reads, and cache the last line read. Since a byte loop > memcpy fits in a cache line, it seems conceivable that memcpy performance > would be good unless the function straddles a cache line boundary. I am > not sure what the situation is with LPC flash. > > Anyway, I noticed coreboot is not setting the AMD SB bit > PrefetchEnSPIFromHost. > For big payloads, setting this bit could cut boot time by eliminating > overhead when reading big chunks from SPI flash memory.
Oh, we should do that. But, that doesn't really explain why gcc doesn't do a rep stos or rep mov (which should hit the cache)/ That should be an easy optimization for gcc. It also doesn't address why coreboot has a functions when we could use gcc intrinsic that should be optimized for the architecture they are built for. Marc -- http://se-eng.com -- coreboot mailing list: [email protected] http://www.coreboot.org/mailman/listinfo/coreboot

