Re: unified shader for layer rendering
2013/10/10 Benoit Jacob jacob.benoi...@gmail.com this is the kind of work that would require very careful performance measurements Here is a benchmark: http://people.mozilla.org/~bjacob/webglbranchingbenchmark/webglbranchingbenchmark.html Some results: http://people.mozilla.org/~bjacob/webglbranchingbenchmark/webglbranchingbenchmarkresults.txt Benoit ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: unified shader for layer rendering
On Friday, October 11, 2013 5:50:05 AM UTC+13, Benoit Girard wrote: On Thu, Oct 10, 2013 at 7:59 AM, Andreas Gal andreas@gmail.com wrote: Rationale: switching shaders tends to be expensive. In my opinion this is the only argument for working on this at moment. I think almost the opposite :-) I am not sure if it is true that switching shaders is expensive - my understanding is that it is common in games to have many more shaders than we have and to switch them more frequently - and that is what GPUs are optimised for. Perhaps Dan G can confirm or refute that. The advantage to me is that we have a single shader and avoid the combinatorial explosion when we add more shaders for things like SVG filters/CSS compositing. That may well be worth a performance hit to facilitate. There may also be other options - there is prior art here, the search term is 'shader permutations' (I looked into this a little 18 months ago, but do not remember if I found anything useful). We should explore all options before jumping on this, I think. In terms of performance - there is a trade off here between branching and switching shaders, both of which are 'known' to be slow and also known to have been optimised in some GPUs/drivers. So we need to do some serious investigation to find out where the better perf is. In particular I think this may be a case where low end mobile GPUs and very old GPUs (the two areas where we really care about perf) may have very different characteristics. So, the right answer for b2g may not be the right answer for Firefox on Windows XP. I have not recently been discussing new shaders, perhaps you are thinking of mstange who is looking at HW implementations of SVG filters? Particularly at the moment where we're overwhelmed with high priority desktop and mobile graphics work, I'd like to see numbers before we consider a change. I have seen no indications that we get hurt by switching shaders. I suspected it might matter when we start to have 100s of layers in a single page but we always fall down from another reason before this can become a problem. I'd like to be able to answer 'In which use cases would patching this lead to a user measurable improvement?' before working on this. Right now we have a long list of bugs where we have a clear answer to that question. Patching this is good to check off that we're using the GPU optimally on the GPU best practice dev guides and will later help us batch draw calls more aggressively but I'd like to have data to support this first. Also old Android drivers are a bit touchy with shaders so I recommend counting some dev times for resolving these issues. I know that roc and nrc have some plans for introducing more shaders which will make a unified shader approach more difficult. I'll let them weight in here. On the flip side I suspect having a single unified shader will be faster to compile then the several shaders we have on the start-up path. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: unified shader for layer rendering
2013/10/11 Nicholas Cameron nick.r.came...@gmail.com The advantage to me is that we have a single shader and avoid the combinatorial explosion when we add more shaders for things like SVG filters/CSS compositing. [...snip...] I have not recently been discussing new shaders, perhaps you are thinking of mstange who is looking at HW implementations of SVG filters? Incidentally, I just looked into the feasibility of implementing constant-time-regardless-of-operands (necessary for filter security) filters in OpenGL shaders, as a similar topic is being discussed at the moment on the WebGL mailing list, and there is a serious problem: Newer GPUs (since roughly 2008 for high-end desktop GPUs, since 2013 for high-end mobile GPUs) have IEEE754-conformant floating point with denormals, and denormals may be slow there too. https://developer.nvidia.com/content/cuda-pro-tip-flush-denormals-confidence http://malideveloper.arm.com/engage-with-mali/benchmarking-floating-point-precision-part-iii/ I suggest on the Khronos public_webgl list that one way that this could be solved in the future would be to write an OpenGL extension spec to force flush-to-zero behavior to avoid denormals. For all I know, flush-to-zero is currently a CUDA compiler flag but isn't exposed to OpenGL. The NVIDIA whitepaper above also hints at this only being a problem with multi-instruction functions such as square-root and inverse-square-root (which is already a problem for e.g. lighting filters, which need to normalize a vector), but that would at best be very NVIDIA-specific; in general, denormals are a minority case that requires special handling, so their slowness is rather universal; all x86 and ARM CPUs that I tested have slow denormals. Benoit ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
unified shader for layer rendering
Hi, we currently have a zoo of shaders to render layers: RGBALayerProgramType, BGRALayerProgramType, RGBXLayerProgramType, BGRXLayerProgramType, RGBARectLayerProgramType, RGBXRectLayerProgramType, BGRARectLayerProgramType, RGBAExternalLayerProgramType, ColorLayerProgramType, YCbCrLayerProgramType, ComponentAlphaPass1ProgramType, ComponentAlphaPass1RGBProgramType, ComponentAlphaPass2ProgramType, ComponentAlphaPass2RGBProgramType, (I have just eliminated the Copy2D variants, so omitted here.) Next, I would like to replace everything but the YCbCr and ComponentAlpha shaders with one unified shader (attached below). Rationale: Most of our shader programs only differ minimally in cycle count, and we are generally memory bound, not GPU cycle bound (even on mobile). In addition, GPUs are actually are very efficient at branching, as long the branch is uniform and doesn't change direction per pixel or vertex (the driver compiles essentially variants and runs that). Last but not least, switching shaders tends to be expensive. Proposed approach: We use a single shader to replace the current 8 layer shaders. I verified with the mali shader compiler that the shortest path (color layer) is pretty close to the old color shader (is now 3, due to the opacity multiplication, was 1). For a lot of scenes we will be able to render without ever switching shaders, so that should more than make up for the extra cycles, especially since we are memory bound anyway. More uniforms have to be set per shader invocation, but that should be pretty cheap. I completely dropped the distinction of 2D and 3D masks. 3D masks should be able to handle the 2D case and the cycle savings are minimal, and as mentioned before, irrelevant. An important advantage is that with this approach we can now easily add additional layer effects to the pipeline without exponentially exploding the number of programs (RGBXRectLayerProgramWithGrayscaleAndWithoutOpacityButMaskAndNotMask3D…). Also, last but not least, this reduces code complexity quite a bit. Feedback welcome. Thanks, Andreas --- // Base color (will be rendered if layer texture is not read). uniform vec4 uColor; // Layer texture (disabled for color layers). uniform bool uTextureEnabled; uniform vec2 uTexCoordMultiplier; uniform bool uTextureBGRA; // Default is RGBA. uniform bool uTextureNoAlpha; uniform float uTextureOpacity; uniform sampler2D uTexture; uniform bool uTextureUseExternalOES; uniform samplerExternalOES uTextureExternalOES; #ifndef GL_ES uniform bool uTextureUseRect; uniform sampler2DRect uTextureRect; #endif // Masking (optional) uniform bool uMaskEnabled; varying vec3 vMaskCoord; uniform sampler2D uMaskTexture; void main() { vec4 color = uColor; if (uTextureEnabled) { vec2 texCoord = vTexCoord * uTexCoordMultiplier; if (uTextureUseExternalOES) { color = texture2D(uTextureExternalOES, texCoord); #ifndef GL_ES } else if (uTextureUseRect) { color = texture2DRect(uTexture, texCoord); #endif } else { color = texture2D(uTexture, texCoord); } if (uTextureBGRA) { color = color.bgra; } if (uTextureNoAlpha) { color = vec4(color.rgb, 1.0); } color *= uTextureOpacity; } if (uMaskEnabled) { color *= texture2D(uMaskTexture, vMaskCoord.xy / vMaskCoord.z).r; } gl_FragColor = color; } ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: unified shader for layer rendering
On Thu, Oct 10, 2013 at 7:59 AM, Andreas Gal andreas@gmail.com wrote: Rationale: switching shaders tends to be expensive. In my opinion this is the only argument for working on this at moment. Particularly at the moment where we're overwhelmed with high priority desktop and mobile graphics work, I'd like to see numbers before we consider a change. I have seen no indications that we get hurt by switching shaders. I suspected it might matter when we start to have 100s of layers in a single page but we always fall down from another reason before this can become a problem. I'd like to be able to answer 'In which use cases would patching this lead to a user measurable improvement?' before working on this. Right now we have a long list of bugs where we have a clear answer to that question. Patching this is good to check off that we're using the GPU optimally on the GPU best practice dev guides and will later help us batch draw calls more aggressively but I'd like to have data to support this first. Also old Android drivers are a bit touchy with shaders so I recommend counting some dev times for resolving these issues. I know that roc and nrc have some plans for introducing more shaders which will make a unified shader approach more difficult. I'll let them weight in here. On the flip side I suspect having a single unified shader will be faster to compile then the several shaders we have on the start-up path. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: unified shader for layer rendering
I'll pile on what Benoit G said --- this is the kind of work that would require very careful performance measurements before we commit to it. Also, like Benoit said, we have seen no indication that glUseProgram is hurting us. General GPU wisdom is that switching programs is not per se expensive as long as one is not relinking them, and besides the general performance caveat with any state change, forcing to split drawing into multiple draw-calls, which also applies to updating uniforms, so we're not escaping it here. In addition to that, not all GPUs have real branching. My Sandy Bridge Intel chipset has real branching, but older Intel integrated GPUs don't, and I'd be very surprised if all of the mobile GPUs we're currently supporting did. To put this in perspective, in the world of discrete desktop NVIDIA GPUs, this was only introduced in the Geforce 6 series. Old, but a lot more advanced that some integrated/mobile devices we still support. On GPUs that are not capable of actual branching, if...else blocks are implemented by executing all branches and masking the result. On such GPUs, a unified shader would run considerably slower, basically N times slower for N branches. Even on GPUs with branching, each branching has a cost and we have N of them, so in all cases the unified shader approach introduces new (at least potential) scalability issues. So if we wanted to invest in this, we would need to conduct careful benchmarking on a wide range of hardware. Benoit 2013/10/10 Benoit Girard bgir...@mozilla.com On Thu, Oct 10, 2013 at 7:59 AM, Andreas Gal andreas@gmail.com wrote: Rationale: switching shaders tends to be expensive. In my opinion this is the only argument for working on this at moment. Particularly at the moment where we're overwhelmed with high priority desktop and mobile graphics work, I'd like to see numbers before we consider a change. I have seen no indications that we get hurt by switching shaders. I suspected it might matter when we start to have 100s of layers in a single page but we always fall down from another reason before this can become a problem. I'd like to be able to answer 'In which use cases would patching this lead to a user measurable improvement?' before working on this. Right now we have a long list of bugs where we have a clear answer to that question. Patching this is good to check off that we're using the GPU optimally on the GPU best practice dev guides and will later help us batch draw calls more aggressively but I'd like to have data to support this first. Also old Android drivers are a bit touchy with shaders so I recommend counting some dev times for resolving these issues. I know that roc and nrc have some plans for introducing more shaders which will make a unified shader approach more difficult. I'll let them weight in here. On the flip side I suspect having a single unified shader will be faster to compile then the several shaders we have on the start-up path. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: unified shader for layer rendering
2013/10/10 Benoit Jacob jacob.benoi...@gmail.com I'll pile on what Benoit G said --- this is the kind of work that would require very careful performance measurements before we commit to it. Also, like Benoit said, we have seen no indication that glUseProgram is hurting us. General GPU wisdom is that switching programs is not per se expensive as long as one is not relinking them, and besides the general performance caveat with any state change, forcing to split drawing into multiple draw-calls, which also applies to updating uniforms, so we're not escaping it here. In addition to that, not all GPUs have real branching. My Sandy Bridge Intel chipset has real branching, but older Intel integrated GPUs don't, and I'd be very surprised if all of the mobile GPUs we're currently supporting did. To put this in perspective, in the world of discrete desktop NVIDIA GPUs, this was only introduced in the Geforce 6 series. In fact, even on a Geforce 6, we only get full real CPU-like (MIMD) branching in vertex shaders, not in fragment shaders. http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter34.html Benoit ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: unified shader for layer rendering
I do appreciate the fact that it reduces complexity (in addition to less state changes). I agree that the decision of dedicating resources on that rather than on other high priority projects that are in the pipes should be motivated by some numbers. Cheers, Nical On Thu, Oct 10, 2013 at 11:04 AM, Benoit Jacob jacob.benoi...@gmail.comwrote: 2013/10/10 Benoit Jacob jacob.benoi...@gmail.com I'll pile on what Benoit G said --- this is the kind of work that would require very careful performance measurements before we commit to it. Also, like Benoit said, we have seen no indication that glUseProgram is hurting us. General GPU wisdom is that switching programs is not per se expensive as long as one is not relinking them, and besides the general performance caveat with any state change, forcing to split drawing into multiple draw-calls, which also applies to updating uniforms, so we're not escaping it here. In addition to that, not all GPUs have real branching. My Sandy Bridge Intel chipset has real branching, but older Intel integrated GPUs don't, and I'd be very surprised if all of the mobile GPUs we're currently supporting did. To put this in perspective, in the world of discrete desktop NVIDIA GPUs, this was only introduced in the Geforce 6 series. In fact, even on a Geforce 6, we only get full real CPU-like (MIMD) branching in vertex shaders, not in fragment shaders. http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter34.html Benoit ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: unified shader for layer rendering
I didn't see anything in this message that suggested we should drop everything we're doing and start on this right now, but most of the early comments I'm seeing are commenting on that. Let's make that a separate discussion. If we didn't have all these variations, what would we do? Would we just do one, then add more (and add code complexity) only if we find significant performance improvements? I'd have to go with probably yes on that one :) Let's see if we can focus the discussion on what should we have, and then on how do we get there, and let's not worry about when and how for now. Gets in the way. -- - Milan On 2013-10-10, at 7:59 , Andreas Gal andreas@gmail.com wrote: Hi, we currently have a zoo of shaders to render layers: RGBALayerProgramType, BGRALayerProgramType, RGBXLayerProgramType, BGRXLayerProgramType, RGBARectLayerProgramType, RGBXRectLayerProgramType, BGRARectLayerProgramType, RGBAExternalLayerProgramType, ColorLayerProgramType, YCbCrLayerProgramType, ComponentAlphaPass1ProgramType, ComponentAlphaPass1RGBProgramType, ComponentAlphaPass2ProgramType, ComponentAlphaPass2RGBProgramType, (I have just eliminated the Copy2D variants, so omitted here.) Next, I would like to replace everything but the YCbCr and ComponentAlpha shaders with one unified shader (attached below). Rationale: Most of our shader programs only differ minimally in cycle count, and we are generally memory bound, not GPU cycle bound (even on mobile). In addition, GPUs are actually are very efficient at branching, as long the branch is uniform and doesn't change direction per pixel or vertex (the driver compiles essentially variants and runs that). Last but not least, switching shaders tends to be expensive. Proposed approach: We use a single shader to replace the current 8 layer shaders. I verified with the mali shader compiler that the shortest path (color layer) is pretty close to the old color shader (is now 3, due to the opacity multiplication, was 1). For a lot of scenes we will be able to render without ever switching shaders, so that should more than make up for the extra cycles, especially since we are memory bound anyway. More uniforms have to be set per shader invocation, but that should be pretty cheap. I completely dropped the distinction of 2D and 3D masks. 3D masks should be able to handle the 2D case and the cycle savings are minimal, and as mentioned before, irrelevant. An important advantage is that with this approach we can now easily add additional layer effects to the pipeline without exponentially exploding the number of programs (RGBXRectLayerProgramWithGrayscaleAndWithoutOpacityButMaskAndNotMask3D…). Also, last but not least, this reduces code complexity quite a bit. Feedback welcome. Thanks, Andreas --- // Base color (will be rendered if layer texture is not read). uniform vec4 uColor; // Layer texture (disabled for color layers). uniform bool uTextureEnabled; uniform vec2 uTexCoordMultiplier; uniform bool uTextureBGRA; // Default is RGBA. uniform bool uTextureNoAlpha; uniform float uTextureOpacity; uniform sampler2D uTexture; uniform bool uTextureUseExternalOES; uniform samplerExternalOES uTextureExternalOES; #ifndef GL_ES uniform bool uTextureUseRect; uniform sampler2DRect uTextureRect; #endif // Masking (optional) uniform bool uMaskEnabled; varying vec3 vMaskCoord; uniform sampler2D uMaskTexture; void main() { vec4 color = uColor; if (uTextureEnabled) { vec2 texCoord = vTexCoord * uTexCoordMultiplier; if (uTextureUseExternalOES) { color = texture2D(uTextureExternalOES, texCoord); #ifndef GL_ES } else if (uTextureUseRect) { color = texture2DRect(uTexture, texCoord); #endif } else { color = texture2D(uTexture, texCoord); } if (uTextureBGRA) { color = color.bgra; } if (uTextureNoAlpha) { color = vec4(color.rgb, 1.0); } color *= uTextureOpacity; } if (uMaskEnabled) { color *= texture2D(uMaskTexture, vMaskCoord.xy / vMaskCoord.z).r; } gl_FragColor = color; } ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: unified shader for layer rendering
Vlad put in a let's see if we can cache compiled shaders bug in a few weeks ago, and perhaps that is something we should consider when discussing shaders in general. I didn't know about recompiling when some uniforms change though, that's good intel. -- - Milan On 2013-10-10, at 15:13 , Jeff Gilbert jgilb...@mozilla.com wrote: I'll also add a note that just because we aren't recompiling doesn't mean the driver isn't. If we change enough (or maybe just the correct) uniforms, this can cause the driver to recompile the shader, which is indeed slow. Trying to unify too many shader types might just tickle this. Some drivers will shoot us a warning via KHR_debug that we can catch, when shader-recompilation happens. -Jeff - Original Message - From: Nicolas Silva nical.si...@gmail.com To: Benoit Jacob jacob.benoi...@gmail.com Cc: Benoit Girard bgir...@mozilla.com, dev-platform@lists.mozilla.org, Andreas Gal andreas@gmail.com Sent: Thursday, October 10, 2013 11:23:45 AM Subject: Re: unified shader for layer rendering I do appreciate the fact that it reduces complexity (in addition to less state changes). I agree that the decision of dedicating resources on that rather than on other high priority projects that are in the pipes should be motivated by some numbers. Cheers, Nical On Thu, Oct 10, 2013 at 11:04 AM, Benoit Jacob jacob.benoi...@gmail.comwrote: 2013/10/10 Benoit Jacob jacob.benoi...@gmail.com I'll pile on what Benoit G said --- this is the kind of work that would require very careful performance measurements before we commit to it. Also, like Benoit said, we have seen no indication that glUseProgram is hurting us. General GPU wisdom is that switching programs is not per se expensive as long as one is not relinking them, and besides the general performance caveat with any state change, forcing to split drawing into multiple draw-calls, which also applies to updating uniforms, so we're not escaping it here. In addition to that, not all GPUs have real branching. My Sandy Bridge Intel chipset has real branching, but older Intel integrated GPUs don't, and I'd be very surprised if all of the mobile GPUs we're currently supporting did. To put this in perspective, in the world of discrete desktop NVIDIA GPUs, this was only introduced in the Geforce 6 series. In fact, even on a Geforce 6, we only get full real CPU-like (MIMD) branching in vertex shaders, not in fragment shaders. http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter34.html Benoit ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform