----- Original Message ----- > Am 03.05.2013 06:58, schrieb Jose Fonseca: > > > > > > ----- Original Message ----- > >> Currently, there's no way to get the high bits of a 32x32 > >> signed/unsigned integer multiplication with tgsi. However, all of > >> d3d10, OpenGL, and OpenCL support that, so we need it as well. > >> There's essentially two ways how it could be done: - a > >> 2-destination instruction returning both high and low bits (this > >> is how it looks like in d3d10 and glsl) - use the existing umul for > >> the low bits and have another instruction for the high bits (this > >> is how it looks like in opencl) > >> > >> Well there's other possibilities but these looked like they'd match > >> both APIs and HW reasonably (well with the exception of things like > >> sse2 which would prefer 2x2 32bit inputs and return 2x64bit as one > >> reg...). > >> > >> Actually it's two new instructions because unlike for the low bits > >> it matters for the high bits if the source operands are signed or > >> unsigned. > >> > >> Personally I'm favoring two separate instructions for low and high > >> bits to not have to deal with multi-destination instructions, but > >> if someone makes a strong case for one returning both low and high > >> bits I could be convinced otherwise. I think though two > >> instructions matches most hw very well (with the exception of > >> software renderers and possibly intel graphics but then a good > >> backend could certainly recognize this). > > > > Roland, > > > > I don't know about GPU HW, but I think that what you propose will > > forever prevent decent SSE code generation with LLVM. > > > > Using two separate opcodes for hi/low bits relies on common > > sub-expression elimination to merge the two multiplication operations > > back into one. But I strongly doubt that even LLVM's optimization > > passes will be able to do that. > > > > Getting the 64bits results with LLVM will require sign extend the > > source arguments (http://llvm.org/docs/LangRef.html#mul-instruction ) > > or SSE intrinsics. Eitherway, the expressions for the low and high > > bit will be radically different, so we'll end with two multiplies in > > the end -- which I think it is simply inadmissible -- TGSI should not > > stand in the way of backends generating good code.
> You can't generate good code either way, this is a deficiency of sse > instruction set. > As I've outlined in another email, I think the best you can do with > sse41 is: > - shuffle both src args (put 2nd/4th elements into 1st/3rd slot) > - 2xpmuldq/pmuludq for doing the 32x32->64bit mul for both 1st/3rd and > 2nd/4th element > - shuffle the high bits into place (I think this needs 3 hw shuffle > instructions) > - shuffle the low bits into place (can benefit from shuffles for high > bits, so just one another shuffle) > > Maybe you can do better with more clever shuffles, but in any case the > low bits will always require one (at least) additional shuffle. > > If you have separate opcodes, everything will be the same, except the > last step you'll just ignore that shuffle and instead just use the > pmulld instruction, which will do exactly what you need for the low > bits. Sure multiplications are more effort for the hw, but hell it even > has the same throughput on most cpus compared to a shuffle, just latency > is worse. In any case it would be 8 vs 8 instructions, with just one > instruction of them very slightly worse. We have much more optimization > opportunities elsewhere than that (I agree that with sse2, which lacks > pmulld, it would be worse, but we never particularly cared about that). That's the thing -- if we have 32x32->64 opcodes we can fine tune this later. If we stick with separate high bit opcodes then that ability is lost (at least without coming back and changing TGSI again). > > > > So I strongly think this is a bad idea. TGSI has support for multiple > > destinations, though we never made much use of it. I see nothing > > special about it. > > > > If you can prove me wrong -- that LLVM can handle merge the > > multiplies -- fine. But I do think we have bigger fish to fry, so > > I'd prefer we don't put too much time debating this. > > No I doubt llvm can merge it (though in theory nothing would prevent it > from recognizing the pattern). My guess is it will do scalar extraction, > and use the imul/mul instructions (which can return 2x32bit numbers even > on 32bit), then combine the vectors back together (most likely element > by element). If it actually does it like that, a separate mul for the > low bits would be in fact a win, because it would save the 4 reinsertion > of the elements at the cost of just one vector mul (llvm uses pmulld > just fine). But looking at this that way doesn't really make sense, we > need instructions which make sense for everybody and aren't specified to > suit one very peculiar implementation. > But even if it generates optimal code, fact is that the multiply for > getting the low bits is essentially noise in the whole instruction > sequence. And who knows maybe intel will one day add some pmulhd/pmulhud > instruction (which just makes plain more sense for vector instruction > sets rather than the weird expanding muls). > So I really don't see how that will prevent good code from being > generated. Yes it will be one more multiplication (3 instead of 2 if > doing everything vectorized) but multiplications are hardly expensive > these days. We have much, much more important things to care about. > > But I'd like to hear from other driver writers. It looked like for > radeon and nouveau separate lo/hi instructions would be perfect, but I > can't be sure. Intel IGPs OTOH always calculate a 64bit result for 32bit > multiplies using the accumulator, so two instructions would indeed be > suboptimal - but since it's the same calculation twice an optimizing > backend should be able to get rid of the extra calc quite easily. Not as easy as if we have the 32x32->64bits. I really think that having an abstraction where an arithmetic operation is broken into two operations is inherently bad. It is unnecessarily imposing assumptions/restrictions on the backends. Jose _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev