On Wed, Dec 12, 2012 at 5:06 PM, Paul Berry <stereotype...@gmail.com> wrote: > On 11 December 2012 23:49, Aras Pranckevicius <a...@unity3d.com> wrote: >> >> >>> For the initial implementation I've chosen a strategy that operates >>> exclusively at the GLSL IR level, so that it doesn't require the >>> cooperation of the driver back-ends. >> >> >> Wouldn't this negatively affect performance of some GPUs? > > > I'm glad you asked--I've actually had quite a bit of in person discussion > with Eric and Ian about this. > > With the i965 back-end, we're expecting a slight performance improvement, > based on the following reasoning: > > - Most of the packing/unpacking operations in the shader will be coalesced > with other operations by optimization passes, so they won't negatively > impact performance. This is especially true in the fragment shader, where > operations are scalarized, so the packing/unpacking should just turn into > simple scalar copies, and those should be completely eliminated by copy > propagation. Most programs spend most of their time in the fragment shader > anyhow, so the performance penalty is already limited to shaders that have a > smaller contribution to execution time. > > - The extra operations we are talking about are register-to-register > moves--no memory access is involved, and no ALU resources are tied up. So > there's a pretty small upper limit to the performance penalty even in the > case where optimization can't eliminate the copy. > > - Having packed varyings will mean that the vertex shader spends less time > writing its output to the VUE, and the fragment shader spends less time > reading its input from the VUE. We don't know exactly how long these VUE > reads/writes take (it is difficult to measure them because they are part of > the process of starting and terminating threads), but it's very likely that > they take longer than register moves. So the already-small performance > penalty discussed above is probably offset by a larger performance > improvement due to more efficient utilization of the VUE. > > I can't speak with authority on the inner workings of the other GPUs > supported by Mesa, but it seems like most of the arguments above are general > enough to apply to most GPU architectures, not just i965. > > Of course, there could be some important factor that I'm missing that makes > all of this analysis completely wrong and causes varying packing to carry a > huge penalty on some architectures. If that's the case, I think the best > way to address the problem is to find an application that is slowed down by > varying packing and run experiments to understand why. > > If worse comes to worst, we could of course modify the varying packing code > so that it only takes effect when there are a large number of varyings that > there is no alternative. But that would carry a two disadvantages: it would > complicate the linker (especially the handling of transform feedback) to > have to handle both packed and unpacked varying formats, and it would reduce > test coverage of varying packing to almost nil (since most of our piglit > tests use a small number of varyings). Because of those disadvantages, and > the fact that our current understanding leads us to expect a performance > improvement, I'd like to save this strategy for a last resort. > >> >> >> Not sure if relevant for Mesa, but e.g. on PowerVR SGX it's really bad to >> pack two vec2 texture coordinates into a single vec4. That's because var.xy >> texture read can be "prefetched", whereas var.zw texture read is not >> prefetched (essentially treated as a dependent texture read), and often >> causes stalls in the shader execution. > > > Interesting--I had not thought of that possibility. On i965 all texture > reads have to be done explicitly by the fragment shader (there is no > prefetching IIRC), so this penalty doesn't apply. Does anyone know if a > penalty like this exists in any of Mesa's other back-ends? If so that might > suggest some good experiments to try. I'm open to revising my opinion if > someone measures a significant performance degradation, particularly with a > real-world app.
R300 and R400 support 4 texture indirections (as defined by ARB_fragment_program). Adding ALU instructions before the first TEX instruction increases the number of texture indirections by 1, which might make some shaders not be executable on the hardware at all. I think this optimization should be disabled on drivers where the texture indirection limit is too low. Marek _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev