On Tue, Jun 13, 2017 at 4:12 PM, Roland Scheidegger <[email protected]> wrote: > Am 13.06.2017 um 15:11 schrieb Karol Herbst: >> On Tue, Jun 13, 2017 at 2:18 PM, Roland Scheidegger <[email protected]> >> wrote: >>> Am 13.06.2017 um 08:57 schrieb Karol Herbst: >>>> On Tue, Jun 13, 2017 at 2:17 AM, Roland Scheidegger <[email protected]> >>>> wrote: >>>>> I am actually also thinking this should be different. >>>>> >>>>> e.g. imho MAD means the operation can be either fused or unfused. >>>>> This is the "traditional" definition of MAD - opencl for instance will >>>>> follow this too, albeit this isn't mentioned in the gallium docs (it >>>>> probably should be). >>>>> (OpenCL says: "Whether or how the product of a * b is rounded and how >>>>> supernormal or subnormal intermediate products are handled is not >>>>> defined. mad is intended to be used where speed is preferred over >>>>> accuracy.") >>>>> I think doing something different here in gallium can only lead to >>>>> madness long term - glsl doesn't have mad in the first place, and as far >>>>> as I can tell d3d10 is ok with fused/unfused mad too (the docs stating >>>>> "Fused operations (such as mad, dp3) produce results that are no less >>>>> accurate than the worst possible serial ordering of evaluation of the >>>>> unfused expansion of the operation.") >>>>> >>>>> This means that mul+add cannot be fused anywhere to a mad if precise is >>>>> specified, and therefore you should never have to worry about doing a >>>>> fused or unfused mul/add in the driver with a mad - it's enough if you >>>>> just don't fuse mul+add in the driver itself (if you can't do unfused >>>>> mad). >>>>> >>>>> Roland >>>>> >>>> >>>> well there is a TGSI peephole doing this mul+add=>mad optimisation, >>>> because it isn't wrong, because mad != fma and mul+add==mad, but on >>>> Fermi+ Nvidia hardware there is no mad, only fma and because mad != fma, >>>> we need to split it up again. >>>> >>>> So either TGSI doesn't merge it if the Instruction is flagged as precise >>>> (which >>>> it is in those tests mentioned) allthough it is correct, or we lower >>>> something in >>>> the driver, because the Instruction isn't supported by the hardware all >>>> along. >>> >>> Yes, I think the TGSI peephole shouldn't merge mul+add to mad with >>> precise. You say this isn't wrong, but imho it clearly is, because noone >>> ever said MAD can't be a fused add - it is multiply + add, yes, but if >>> there's intermediate rounding or not isn't specified. FWIW gallivm code >>> also assumes this, and will use llvm.fmuladd for implementation (which >>> is exactly the same "mul+add" story as opencl mad, and will use fma on >>> cpus supporting it and separate mul+add otherwise, save some bugs in >>> older llvm versions apparently). >>> So we should just clarify that in the tgsi docs - mad is multiply + add, >>> with undefined intermediate rounding, it can be a fused mul+add or an >>> unfused one (technically it could also be something in-between I suppose >>> since the apis just specify the accuracy isn't worse than a unfused >>> multiply + add). Every driver gets to use what it can do fastest for it, >>> and because there's no specified intermediate rounding for it, precise >>> doesn't change anything there. >>> >>> That's at least my opinion what TGSI_OPCODE_MAD should be (of course, >>> older gpus always used unfused mad, but this wasn't a requirement). >>> >>> Roland >>> >> >> I think the best idea would be to specify that: >> TGSI_OPCODE_MAD is unfused mu+add >> TGSI_OPCODE_FMA is fused mul+add >> >> Having TGSI_OPCODE_MAD being unfused and fused adds an ambiguity >> without providing any advantages imho. >> >> This way it's clear what both is. The backend can still decide that it can >> use >> FMA to implement TGSI_OPCODE_MAD or that it can't use MAD and splits it >> up, but then the backend decides and the choice is explicit and respects >> limitations of the hardware, which Gallium/TGSI doesn't know about. > > I just don't agree with that. There's lots of apis which have such an > ambigous mad, with precisely the intention of it being as fast as > possible, with undefined intermediate rounding. I think there's a reason > that d3d10 mad, opencl mad, llvm fmuladd all are exactly like that. Why > should tgsi mad be different? > It exists because you otherwise cannot say you don't want to allow > unsafe math generally, but are ok if a mad is either fused or not. If > you require a fused one, use fma. If you require an unfused > multiply+add, just use mul and add. If you don't care, use mad. > Granted, arguably with per-instruction precise modifier, mul + add > without the modifier works as well. >
okay, so I think the most sane thing to do now is to adjust the peephole inside TGSI to not merge mul+add into a mad if either the mul or the add have that precise modifier. >> >> Or we remove TGSI_OPCODE_MAD and let the backends do the opts. > This would be a possibility, but backends might not be prepared for it > (e.g. I don't think gallivm would let llvm emit fused fmas for mul + add > sequence). Plus mad being so common makes the tgsi look nicer. > > Roland _______________________________________________ mesa-dev mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/mesa-dev
