>The statement was that L4 was getting this right. Neither the L4 kernel nor the Coyotos kernel use the XMM registers at all aside from saving and restoring them in context switch. They are compiled to avoid floating point as well. I didn't specifically say L4 was right just that they were originally mostly assembly and have removed most of the assembly for speed code ( eg and nearly all the ipc code) , which is allowing more portability and yet libs are going the other way. By wrong I meant they were going in opposite paths yet in both cases performance and portability are important. "It seems wrong that all the guys working on the L4 series OS have got it to the point where they removed ( most of ?) the asm and made it more portable while the libs are going the other way and we are seeing more assembly." >Contrary to what was said earlier in the thread, the block zero instructions on the scalar unit are quite good on x86/x64. The only zero instruction I'm aware on is on the SSE registers ( though you can obviously use stosd 0). For set I go by 0-2K mov with 4* loop unroll 2-32Kb MMX/SSE with 8* unroll 32kb-2-3M (cache) rep stosd 2+ or 3Mb+ MMX/SSE with non temporal stores The above are for fastest memory write but the non temporal stores have the advantage of not filling the cache. This comparable speed non SSE only applies to stores ( which I'm not sure we discussed we talked more about copy) , for copy ( read and store) or store and read ( buffers) SSE is nearly always better especially on Core2 and i7. And with new 256 bit ymm this will continue to be the case. Eg for a circ buffer ( 16 byte aligned) I have ( a store and read) we can see SSE being better at 52 bytes ( 4 header and 12 ints) .. ( mem copy uses SSE over a certain size) and significantly (45% reduction) better at 820 bytes >Though nowadays the SIMD and Float registers are so universally used that the case for avoiding them in the kernel is pretty weak. Agree. Though rather than SIMD and Float registers for x86_64 ( not Arm and PPC) I think it's better to consider them as 128/256 bit GP registers (there is only a few things they can't do and these are rare) . The question whether you need a 128 bit GP register besides load and store is a valid one , building 16 byte messages in registers , 16 byte flags , for bit ops and memory scanning ( eg GC) 128 bits are useful and they can do things standard registers cant eg store byte 11 or short 7 , store via mask etc . Since the c language doesn't support _128 there is not much opportunity to try it without asm or intrinsic ( eg high barrier to use) can/will BitC down the track ( and can the compiler understand) ? Ben
<<attachment: winmail.dat>>
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
