On Fri, Sep 13, 2013 at 2:15 PM, Paul Berry <stereotype...@gmail.com> wrote:
> On 12 September 2013 22:06, Chia-I Wu <olva...@gmail.com> wrote: > >> From: Chia-I Wu <o...@lunarg.com> >> >> Consider only the top-left and top-right pixels to approximate DDX in a >> 2x2 >> subspan, unless the application or the user requests a more accurate >> approximation. This results in a less accurate approximation. However, >> it >> improves the performance of Xonotic with Ultra settings by 24.3879% +/- >> 0.832202% (at 95.0% confidence) on Haswell. No noticeable image quality >> difference observed. >> >> No piglit gpu.tests regressions (tested with v1) >> >> I failed to come up with an explanation for the performance difference, >> as the >> change does not affect Ivy Bridge. If anyone has the insight, please >> kindly >> enlighten me. Performance differences may also be observed on other games >> that call textureGrad and dFdx. >> >> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option. >> Update >> comments. >> > > I'm not entirely comfortable making a change that has a known negative > impact on computational accuracy (even one that leads to such an impressive > performance improvement) when we don't have any theories as to why the > performance improvement happens, or why the improvement doesn't apply to > Ivy Bridge. In my experience, making changes to the codebase without > understanding why they improve things almost always leads to improvements > that are brittle, since it's likely that the true source of the improvement > is a coincidence that will be wiped out by some future change (or won't be > relevant to client programs other than this particular benchmark). Having > a theory as to why the performance improvement happens would help us be > confident that we're applying the right fix under the right circumstances. > There's another angle to approach this and that is to develop a simple test case that will show the different results across a range of computational accuracy and run the test on proprietary drivers for the same hardware to determine what settings they are using. > > For example, here's one theory as to why we might be seeing an > improvement: perhaps Haswell's sample_d processing is smart enough to > realize that when all the gradient values within a sub-span are the same, > that means that all of the sampling for the sub-span will come from the > same LOD, and that allows it to short-cut some expensive step in the LOD > calculation. Perhaps the same improvement isn't seen on Ivy Bridge because > Ivy Bridge's sample_d processing logic is less sophisticated, so it's > unable to perform the optimization. If this is the case, then conditioning > the optimization on brw->is_haswell (as you've done) makes sense. > > Another possible explanation for the Haswell vs Ivy Bridge difference is > that perhaps Ivy Bridge, being a lower-performing chip, has other > bottlenecks that make the optimization irrelevant for this particular > benchmark, but potentially still useful for other benchmarks. For > instance, maybe when this benchmark executes on Ivy Bridge, the texture > that's being sampled from is located in sufficiently distant memory that > optimizing the sample_d's memory accesses makes no difference, since the > bottleneck is the speed with which the texture can be read into cache, > rather than the speed of operation of sample_d. If this explanation is > correct, then it might be worth applying the optimization to both Ivy > Bridge and Haswell (and perhaps Sandy Bridge as well), since it might > conceivably benefit those other chips when running applications that place > less cache pressure on the chip. > This scenario is where I'd place my bets, especially given that the numbers are based on Xonotic. I benchmarked this patch using Xonotic on Bay Trail as is and by replacing !brw->is_haswell with !brw->is_baytrail. With ultra and ultimate levels at medium and high resolutions, the results were all essentially the same at comparable resolutions and quality levels. I don't see any justification to tie this change to just Haswell hardware. There's all sorts of reasons why doing that sounds like a big mistake. In fact, another _explanation_ to add to your list is maybe there's another is_haswell test elsewhere in the driver that is responsible for the performance anomaly. > Another possibile explanation is that Haswell has a bug in its sample_d > logic which causes it to be slow under some conditions, and this > lower-accuracy DDX computation happens to work around it. If that's the > case, we might want to consider not using sample_d at all on Haswell, and > instead calculating the LOD in the shader and using sample_l instead. If > this is the correct explanation, then that might let us have faster > performance without sacrificing DDX accuracy. > > A final possible explanation for the performance improvement is that > perhaps for some reason sample_d performs more optimally when the DDX and > DDY computations have similar accuracies to each other. Before your patch, > our computation of DDX was more accurate than DDY; your patch decreases the > accuracy of DDX to match DDY. If this explanation is correct, then a > better solution would probably be to improve the accuracy of DDY to make it > comparable to DDX, rather than the other way around. > > Before we land this patch, can we do some experiments to try to figure out > which of these explanations (if any) is correct? > > >> >> Signed-off-by: Chia-I Wu <o...@lunarg.com> >> --- >> src/mesa/drivers/dri/i965/brw_context.c | 1 + >> src/mesa/drivers/dri/i965/brw_context.h | 1 + >> src/mesa/drivers/dri/i965/brw_fs_emit.cpp | 40 >> ++++++++++++++++++++++++------- >> src/mesa/drivers/dri/i965/intel_screen.c | 4 ++++ >> 4 files changed, 38 insertions(+), 8 deletions(-) >> >> diff --git a/src/mesa/drivers/dri/i965/brw_context.c >> b/src/mesa/drivers/dri/i965/brw_context.c >> index 4fcc9fb..1cdfb9d 100644 >> --- a/src/mesa/drivers/dri/i965/brw_context.c >> +++ b/src/mesa/drivers/dri/i965/brw_context.c >> @@ -470,6 +470,7 @@ brwCreateContext(int api, >> brw_draw_init( brw ); >> >> brw->precompile = driQueryOptionb(&brw->optionCache, >> "shader_precompile"); >> + brw->accurate_derivative = driQueryOptionb(&brw->optionCache, >> "accurate_derivative"); >> >> ctx->Const.ContextFlags = 0; >> if ((flags & __DRI_CTX_FLAG_FORWARD_COMPATIBLE) != 0) >> diff --git a/src/mesa/drivers/dri/i965/brw_context.h >> b/src/mesa/drivers/dri/i965/brw_context.h >> index c566bba..8bfc54a 100644 >> --- a/src/mesa/drivers/dri/i965/brw_context.h >> +++ b/src/mesa/drivers/dri/i965/brw_context.h >> @@ -964,6 +964,7 @@ struct brw_context >> bool always_flush_cache; >> bool disable_throttling; >> bool precompile; >> + bool accurate_derivative; >> >> driOptionCache optionCache; >> /** @} */ >> diff --git a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp >> b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp >> index bfb3d33..69aeab1 100644 >> --- a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp >> +++ b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp >> @@ -540,7 +540,7 @@ fs_generator::generate_tex(fs_inst *inst, struct >> brw_reg dst, struct brw_reg src >> * >> * arg0: ss0.tl ss0.tr ss0.bl ss0.br ss1.tl ss1.tr ss1.bl ss1.br >> * >> - * and we're trying to produce: >> + * Ideally, we want to produce: >> * >> * DDX DDY >> * dst: (ss0.tr - ss0.tl) (ss0.tl - ss0.bl) >> @@ -556,24 +556,48 @@ fs_generator::generate_tex(fs_inst *inst, struct >> brw_reg dst, struct brw_reg src >> * >> * For DDX, it ends up being easy: width = 2, horiz=0 gets us the same >> result >> * for each pair, and vertstride = 2 jumps us 2 elements after >> processing a >> - * pair. But for DDY, it's harder, as we want to produce the pairs >> swizzled >> - * between each other. We could probably do it like ddx and swizzle the >> right >> - * order later, but bail for now and just produce >> + * pair. But the ideal approximation of DDX may impose a huge >> performance >> + * cost on sample_d. As such, we favor ((ss0.tr - ss0.tl)x4 (ss1.tr - >> + * ss1.tl)x4) unless the app or the user requests otherwise. >> + * >> + * For DDY, it's harder, as we want to produce the pairs swizzled >> between each >> + * other. We could probably do it like ddx and swizzle the right order >> later, >> + * but bail for now and just produce >> * ((ss0.tl - ss0.bl)x4 (ss1.tl - ss1.bl)x4) >> */ >> void >> fs_generator::generate_ddx(fs_inst *inst, struct brw_reg dst, struct >> brw_reg src) >> { >> + unsigned vstride, width; >> + >> + /* Produce accurate result only when requested. We emit only one >> + * instruction for either case, but the problem is the result may >> affect >> + * how fast sample_d executes. >> + * >> + * Since the performance difference is only observed on Haswell, >> ignore the >> + * hints on other GENs for now. >> + */ >> + if (!brw->is_haswell || >> + brw->ctx.Hint.FragmentShaderDerivative == GL_NICEST || >> + brw->accurate_derivative) { >> + vstride = BRW_VERTICAL_STRIDE_2; >> + width = BRW_WIDTH_2; >> + } >> + else { >> + vstride = BRW_VERTICAL_STRIDE_4; >> + width = BRW_WIDTH_4; >> + } >> + >> struct brw_reg src0 = brw_reg(src.file, src.nr, 1, >> BRW_REGISTER_TYPE_F, >> - BRW_VERTICAL_STRIDE_2, >> - BRW_WIDTH_2, >> + vstride, >> + width, >> BRW_HORIZONTAL_STRIDE_0, >> BRW_SWIZZLE_XYZW, WRITEMASK_XYZW); >> struct brw_reg src1 = brw_reg(src.file, src.nr, 0, >> BRW_REGISTER_TYPE_F, >> - BRW_VERTICAL_STRIDE_2, >> - BRW_WIDTH_2, >> + vstride, >> + width, >> BRW_HORIZONTAL_STRIDE_0, >> BRW_SWIZZLE_XYZW, WRITEMASK_XYZW); >> brw_ADD(p, dst, src0, negate(src1)); >> diff --git a/src/mesa/drivers/dri/i965/intel_screen.c >> b/src/mesa/drivers/dri/i965/intel_screen.c >> index eb6515e..ee08ffd 100644 >> --- a/src/mesa/drivers/dri/i965/intel_screen.c >> +++ b/src/mesa/drivers/dri/i965/intel_screen.c >> @@ -61,6 +61,10 @@ PUBLIC const char __driConfigOptions[] = >> DRI_CONF_SECTION_END >> DRI_CONF_SECTION_QUALITY >> DRI_CONF_FORCE_S3TC_ENABLE("false") >> + >> + DRI_CONF_OPT_BEGIN_B(accurate_derivative, "false") >> + DRI_CONF_DESC(en, "Perform more accurate derivative calculation") >> + DRI_CONF_OPT_END >> DRI_CONF_SECTION_END >> DRI_CONF_SECTION_DEBUG >> DRI_CONF_NO_RAST("false") >> -- >> 1.8.3.1 >> >> _______________________________________________ >> mesa-dev mailing list >> mesa-dev@lists.freedesktop.org >> http://lists.freedesktop.org/mailman/listinfo/mesa-dev >> > > > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/mesa-dev > >
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev