Qingqing, On 12/8/05 8:07 PM, "Qingqing Zhou" <[EMAIL PROTECTED]> wrote:
> /* prefetch ahead */ > __asm__ __volatile__ ( > "1: prefetchnta 128(%0)\n" > : : "r" (s) : "memory" ); I think this kind / grain of prefetch is handled as a compiler optimization in the latest GNU compilers, and further there are some memory streaming operations for the Pentium 4 ISA that are now part of the standard compiler optimizations done by gcc. What I think would be tremendously beneficial is to implement L2 cache blocking in certain key code paths like sort. What I mean by "cache blocking" is performing as many operations on a block of memory (maybe 128 pages worth for a 1MB cache) as possible, then moving to the next batch of memory and performing all of the work on that batch, etc. The other thing to consider in conjunction with this would be maximizing use of the instruction cache, increasing use of parallel functional units and minimizing pipeline stalls. The best way to do this would be to group operations into tighter groups and separating out branching: So instead of structures like this: function_row_at_a_time(row) if conditions do some work else if other do different work else if error print error_log You'd have function_buffer_at_a_time(buffer_of_rows) loop on sizeof(buffer_of_rows) / sizeof_L2cache do a lot of work on each row loop on sizeof(buffer_of_rows) / sizeof_L2cache if error exit The ideas in the above optimizations: - Delay work until a buffer can be gathered - Increase the "computational intensity" of the loops by putting more instructions together - While in loops doing lots of work, avoid branches / jumps - Luke ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match