On 03.05.2013 16:32, Jose Fonseca wrote: > > ----- Original Message ----- >> Am 03.05.2013 06:58, schrieb Jose Fonseca: >>> >>> ----- Original Message ----- >>>> Currently, there's no way to get the high bits of a 32x32 >>>> signed/unsigned integer multiplication with tgsi. However, all of >>>> d3d10, OpenGL, and OpenCL support that, so we need it as well. >>>> There's essentially two ways how it could be done: - a >>>> 2-destination instruction returning both high and low bits (this >>>> is how it looks like in d3d10 and glsl) - use the existing umul for >>>> the low bits and have another instruction for the high bits (this >>>> is how it looks like in opencl) >>>> >>>> Well there's other possibilities but these looked like they'd match >>>> both APIs and HW reasonably (well with the exception of things like >>>> sse2 which would prefer 2x2 32bit inputs and return 2x64bit as one >>>> reg...). >>>> >>>> Actually it's two new instructions because unlike for the low bits >>>> it matters for the high bits if the source operands are signed or >>>> unsigned. >>>> >>>> Personally I'm favoring two separate instructions for low and high >>>> bits to not have to deal with multi-destination instructions, but >>>> if someone makes a strong case for one returning both low and high >>>> bits I could be convinced otherwise. I think though two >>>> instructions matches most hw very well (with the exception of >>>> software renderers and possibly intel graphics but then a good >>>> backend could certainly recognize this). >>> Roland, >>> >>> I don't know about GPU HW, but I think that what you propose will >>> forever prevent decent SSE code generation with LLVM. >>> >>> Using two separate opcodes for hi/low bits relies on common >>> sub-expression elimination to merge the two multiplication operations >>> back into one. But I strongly doubt that even LLVM's optimization >>> passes will be able to do that. >>> >>> Getting the 64bits results with LLVM will require sign extend the >>> source arguments (http://llvm.org/docs/LangRef.html#mul-instruction ) >>> or SSE intrinsics. Eitherway, the expressions for the low and high >>> bit will be radically different, so we'll end with two multiplies in >>> the end -- which I think it is simply inadmissible -- TGSI should not >>> stand in the way of backends generating good code. >> You can't generate good code either way, this is a deficiency of sse >> instruction set. >> As I've outlined in another email, I think the best you can do with >> sse41 is: >> - shuffle both src args (put 2nd/4th elements into 1st/3rd slot) >> - 2xpmuldq/pmuludq for doing the 32x32->64bit mul for both 1st/3rd and >> 2nd/4th element >> - shuffle the high bits into place (I think this needs 3 hw shuffle >> instructions) >> - shuffle the low bits into place (can benefit from shuffles for high >> bits, so just one another shuffle) >> >> Maybe you can do better with more clever shuffles, but in any case the >> low bits will always require one (at least) additional shuffle. >> >> If you have separate opcodes, everything will be the same, except the >> last step you'll just ignore that shuffle and instead just use the >> pmulld instruction, which will do exactly what you need for the low >> bits. Sure multiplications are more effort for the hw, but hell it even >> has the same throughput on most cpus compared to a shuffle, just latency >> is worse. In any case it would be 8 vs 8 instructions, with just one >> instruction of them very slightly worse. We have much more optimization >> opportunities elsewhere than that (I agree that with sse2, which lacks >> pmulld, it would be worse, but we never particularly cared about that). > That's the thing -- if we have 32x32->64 opcodes we can fine tune this later. > If we stick with separate high bit opcodes then that ability is lost (at > least without coming back and changing TGSI again). > >>> So I strongly think this is a bad idea. TGSI has support for multiple >>> destinations, though we never made much use of it. I see nothing >>> special about it. >>> >>> If you can prove me wrong -- that LLVM can handle merge the >>> multiplies -- fine. But I do think we have bigger fish to fry, so >>> I'd prefer we don't put too much time debating this. >> No I doubt llvm can merge it (though in theory nothing would prevent it >> from recognizing the pattern). My guess is it will do scalar extraction, >> and use the imul/mul instructions (which can return 2x32bit numbers even >> on 32bit), then combine the vectors back together (most likely element >> by element). If it actually does it like that, a separate mul for the >> low bits would be in fact a win, because it would save the 4 reinsertion >> of the elements at the cost of just one vector mul (llvm uses pmulld >> just fine). But looking at this that way doesn't really make sense, we >> need instructions which make sense for everybody and aren't specified to >> suit one very peculiar implementation. >> But even if it generates optimal code, fact is that the multiply for >> getting the low bits is essentially noise in the whole instruction >> sequence. And who knows maybe intel will one day add some pmulhd/pmulhud >> instruction (which just makes plain more sense for vector instruction >> sets rather than the weird expanding muls). >> So I really don't see how that will prevent good code from being >> generated. Yes it will be one more multiplication (3 instead of 2 if >> doing everything vectorized) but multiplications are hardly expensive >> these days. We have much, much more important things to care about. >> >> But I'd like to hear from other driver writers. It looked like for >> radeon and nouveau separate lo/hi instructions would be perfect, but I >> can't be sure. Intel IGPs OTOH always calculate a 64bit result for 32bit >> multiplies using the accumulator, so two instructions would indeed be >> suboptimal - but since it's the same calculation twice an optimizing >> backend should be able to get rid of the extra calc quite easily. > Not as easy as if we have the 32x32->64bits. > > > I really think that having an abstraction where an arithmetic operation is > broken into two operations is inherently bad. It is unnecessarily imposing > assumptions/restrictions on the backends.
I think I'd rather have 2 destination registers on 1 instruction for this reason. Splitting into 2 instructions at the driver backend level is much simpler than reassembling a 64 bit integer from 2 separate instructions later. The question is how to distribute the result. Low parts to DST[0] and high parts to DST[1] or low parts to DST[0,1].x,z and high parts to DST[0,1].y,w. The latter would match how we treat other 64 bit values right now (doubles/float64). > > Jose > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/mesa-dev _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev