Re: [Mesa-dev] [PATCH] i965/hsw: compute DDX in a subspan based only on top row

Mark Mueller Sun, 15 Sep 2013 12:51:03 -0700

On Fri, Sep 13, 2013 at 2:15 PM, Paul Berry <stereotype...@gmail.com> wrote:


> On 12 September 2013 22:06, Chia-I Wu <olva...@gmail.com> wrote:
>
>> From: Chia-I Wu <o...@lunarg.com>
>>
>> Consider only the top-left and top-right pixels to approximate DDX in a
>> 2x2
>> subspan, unless the application or the user requests a more accurate
>> approximation.  This results in a less accurate approximation.  However,
>> it
>> improves the performance of Xonotic with Ultra settings by 24.3879% +/-
>> 0.832202% (at 95.0% confidence) on Haswell.  No noticeable image quality
>> difference observed.
>>
>> No piglit gpu.tests regressions (tested with v1)
>>
>> I failed to come up with an explanation for the performance difference,
>> as the
>> change does not affect Ivy Bridge.  If anyone has the insight, please
>> kindly
>> enlighten me.  Performance differences may also be observed on other games
>> that call textureGrad and dFdx.
>>
>> v2: Honor GL_FRAGMENT_SHADER_DERIVATIVE_HINT and add a drirc option.
>>  Update
>>     comments.
>>
>
> I'm not entirely comfortable making a change that has a known negative
> impact on computational accuracy (even one that leads to such an impressive
> performance improvement) when we don't have any theories as to why the
> performance improvement happens, or why the improvement doesn't apply to
> Ivy Bridge.  In my experience, making changes to the codebase without
> understanding why they improve things almost always leads to improvements
> that are brittle, since it's likely that the true source of the improvement
> is a coincidence that will be wiped out by some future change (or won't be
> relevant to client programs other than this particular benchmark).  Having
> a theory as to why the performance improvement happens would help us be
> confident that we're applying the right fix under the right circumstances.
>

There's another angle to approach this and that is to develop a simple test
case that will show the different results across a range of computational
accuracy and run the test on proprietary drivers for the same hardware to
determine what settings they are using.


>
> For example, here's one theory as to why we might be seeing an
> improvement: perhaps Haswell's sample_d processing is smart enough to
> realize that when all the gradient values within a sub-span are the same,
> that means that all of the sampling for the sub-span will come from the
> same LOD, and that allows it to short-cut some expensive step in the LOD
> calculation.  Perhaps the same improvement isn't seen on Ivy Bridge because
> Ivy Bridge's sample_d processing logic is less sophisticated, so it's
> unable to perform the optimization.  If this is the case, then conditioning
> the optimization on brw->is_haswell (as you've done) makes sense.
>
> Another possible explanation for the Haswell vs Ivy Bridge difference is
> that perhaps Ivy Bridge, being a lower-performing chip, has other
> bottlenecks that make the optimization irrelevant for this particular
> benchmark, but potentially still useful for other benchmarks.  For
> instance, maybe when this benchmark executes on Ivy Bridge, the texture
> that's being sampled from is located in sufficiently distant memory that
> optimizing the sample_d's memory accesses makes no difference, since the
> bottleneck is the speed with which the texture can be read into cache,
> rather than the speed of operation of sample_d.  If this explanation is
> correct, then it might be worth applying the optimization to both Ivy
> Bridge and Haswell (and perhaps Sandy Bridge as well), since it might
> conceivably benefit those other chips when running applications that place
> less cache pressure on the chip.
>

This scenario is where I'd place my bets, especially given that the numbers
are based on Xonotic. I benchmarked this patch using Xonotic on Bay Trail
as is and by replacing !brw->is_haswell with !brw->is_baytrail. With ultra
and ultimate levels at medium and high resolutions, the results were all
essentially the same at comparable resolutions and quality levels.

I don't see any justification to tie this change to just Haswell hardware.
There's all sorts of reasons why doing that sounds like a big mistake. In
fact, another _explanation_ to add to your list is maybe there's another
is_haswell test elsewhere in the driver that is responsible for the
performance anomaly.


> Another possibile explanation is that Haswell has a bug in its sample_d
> logic which causes it to be slow under some conditions, and this
> lower-accuracy DDX computation happens to work around it.  If that's the
> case, we might want to consider not using sample_d at all on Haswell, and
> instead calculating the LOD in the shader and using sample_l instead.  If
> this is the correct explanation, then that might let us have faster
> performance without sacrificing DDX accuracy.
>
> A final possible explanation for the performance improvement is that
> perhaps for some reason sample_d performs more optimally when the DDX and
> DDY computations have similar accuracies to each other.  Before your patch,
> our computation of DDX was more accurate than DDY; your patch decreases the
> accuracy of DDX to match DDY.  If this explanation is correct, then a
> better solution would probably be to improve the accuracy of DDY to make it
> comparable to DDX, rather than the other way around.
>
> Before we land this patch, can we do some experiments to try to figure out
> which of these explanations (if any) is correct?
>
>
>>
>> Signed-off-by: Chia-I Wu <o...@lunarg.com>
>> ---
>>  src/mesa/drivers/dri/i965/brw_context.c   |  1 +
>>  src/mesa/drivers/dri/i965/brw_context.h   |  1 +
>>  src/mesa/drivers/dri/i965/brw_fs_emit.cpp | 40
>> ++++++++++++++++++++++++-------
>>  src/mesa/drivers/dri/i965/intel_screen.c  |  4 ++++
>>  4 files changed, 38 insertions(+), 8 deletions(-)
>>
>> diff --git a/src/mesa/drivers/dri/i965/brw_context.c
>> b/src/mesa/drivers/dri/i965/brw_context.c
>> index 4fcc9fb..1cdfb9d 100644
>> --- a/src/mesa/drivers/dri/i965/brw_context.c
>> +++ b/src/mesa/drivers/dri/i965/brw_context.c
>> @@ -470,6 +470,7 @@ brwCreateContext(int api,
>>     brw_draw_init( brw );
>>
>>     brw->precompile = driQueryOptionb(&brw->optionCache,
>> "shader_precompile");
>> +   brw->accurate_derivative = driQueryOptionb(&brw->optionCache,
>> "accurate_derivative");
>>
>>     ctx->Const.ContextFlags = 0;
>>     if ((flags & __DRI_CTX_FLAG_FORWARD_COMPATIBLE) != 0)
>> diff --git a/src/mesa/drivers/dri/i965/brw_context.h
>> b/src/mesa/drivers/dri/i965/brw_context.h
>> index c566bba..8bfc54a 100644
>> --- a/src/mesa/drivers/dri/i965/brw_context.h
>> +++ b/src/mesa/drivers/dri/i965/brw_context.h
>> @@ -964,6 +964,7 @@ struct brw_context
>>     bool always_flush_cache;
>>     bool disable_throttling;
>>     bool precompile;
>> +   bool accurate_derivative;
>>
>>     driOptionCache optionCache;
>>     /** @} */
>> diff --git a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> index bfb3d33..69aeab1 100644
>> --- a/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> +++ b/src/mesa/drivers/dri/i965/brw_fs_emit.cpp
>> @@ -540,7 +540,7 @@ fs_generator::generate_tex(fs_inst *inst, struct
>> brw_reg dst, struct brw_reg src
>>   *
>>   * arg0: ss0.tl ss0.tr ss0.bl ss0.br ss1.tl ss1.tr ss1.bl ss1.br
>>   *
>> - * and we're trying to produce:
>> + * Ideally, we want to produce:
>>   *
>>   *           DDX                     DDY
>>   * dst: (ss0.tr - ss0.tl)     (ss0.tl - ss0.bl)
>> @@ -556,24 +556,48 @@ fs_generator::generate_tex(fs_inst *inst, struct
>> brw_reg dst, struct brw_reg src
>>   *
>>   * For DDX, it ends up being easy: width = 2, horiz=0 gets us the same
>> result
>>   * for each pair, and vertstride = 2 jumps us 2 elements after
>> processing a
>> - * pair. But for DDY, it's harder, as we want to produce the pairs
>> swizzled
>> - * between each other.  We could probably do it like ddx and swizzle the
>> right
>> - * order later, but bail for now and just produce
>> + * pair.  But the ideal approximation of DDX may impose a huge
>> performance
>> + * cost on sample_d.  As such, we favor ((ss0.tr - ss0.tl)x4 (ss1.tr -
>> + * ss1.tl)x4) unless the app or the user requests otherwise.
>> + *
>> + * For DDY, it's harder, as we want to produce the pairs swizzled
>> between each
>> + * other.  We could probably do it like ddx and swizzle the right order
>> later,
>> + * but bail for now and just produce
>>   * ((ss0.tl - ss0.bl)x4 (ss1.tl - ss1.bl)x4)
>>   */
>>  void
>>  fs_generator::generate_ddx(fs_inst *inst, struct brw_reg dst, struct
>> brw_reg src)
>>  {
>> +   unsigned vstride, width;
>> +
>> +   /* Produce accurate result only when requested.  We emit only one
>> +    * instruction for either case, but the problem is the result may
>> affect
>> +    * how fast sample_d executes.
>> +    *
>> +    * Since the performance difference is only observed on Haswell,
>> ignore the
>> +    * hints on other GENs for now.
>> +    */
>> +   if (!brw->is_haswell ||
>> +       brw->ctx.Hint.FragmentShaderDerivative == GL_NICEST ||
>> +       brw->accurate_derivative) {
>> +      vstride = BRW_VERTICAL_STRIDE_2;
>> +      width = BRW_WIDTH_2;
>> +   }
>> +   else {
>> +      vstride = BRW_VERTICAL_STRIDE_4;
>> +      width = BRW_WIDTH_4;
>> +   }
>> +
>>     struct brw_reg src0 = brw_reg(src.file, src.nr, 1,
>>                                  BRW_REGISTER_TYPE_F,
>> -                                BRW_VERTICAL_STRIDE_2,
>> -                                BRW_WIDTH_2,
>> +                                vstride,
>> +                                width,
>>                                  BRW_HORIZONTAL_STRIDE_0,
>>                                  BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);
>>     struct brw_reg src1 = brw_reg(src.file, src.nr, 0,
>>                                  BRW_REGISTER_TYPE_F,
>> -                                BRW_VERTICAL_STRIDE_2,
>> -                                BRW_WIDTH_2,
>> +                                vstride,
>> +                                width,
>>                                  BRW_HORIZONTAL_STRIDE_0,
>>                                  BRW_SWIZZLE_XYZW, WRITEMASK_XYZW);
>>     brw_ADD(p, dst, src0, negate(src1));
>> diff --git a/src/mesa/drivers/dri/i965/intel_screen.c
>> b/src/mesa/drivers/dri/i965/intel_screen.c
>> index eb6515e..ee08ffd 100644
>> --- a/src/mesa/drivers/dri/i965/intel_screen.c
>> +++ b/src/mesa/drivers/dri/i965/intel_screen.c
>> @@ -61,6 +61,10 @@ PUBLIC const char __driConfigOptions[] =
>>     DRI_CONF_SECTION_END
>>     DRI_CONF_SECTION_QUALITY
>>        DRI_CONF_FORCE_S3TC_ENABLE("false")
>> +
>> +      DRI_CONF_OPT_BEGIN_B(accurate_derivative, "false")
>> +        DRI_CONF_DESC(en, "Perform more accurate derivative calculation")
>> +      DRI_CONF_OPT_END
>>     DRI_CONF_SECTION_END
>>     DRI_CONF_SECTION_DEBUG
>>        DRI_CONF_NO_RAST("false")
>> --
>> 1.8.3.1
>>
>> _______________________________________________
>> mesa-dev mailing list
>> mesa-dev@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>>
>
>
> _______________________________________________
> mesa-dev mailing list
> mesa-dev@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/mesa-dev
>
>

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH] i965/hsw: compute DDX in a subspan based only on top row

Reply via email to