Am 06.01.2016 um 17:31 schrieb Jose Fonseca: > On 06/01/16 16:26, Jose Fonseca wrote: >> On 06/01/16 00:06, srol...@vmware.com wrote: >>> From: Roland Scheidegger <srol...@vmware.com> >>> >>> The trick here is to recognize that in the c + n * dcdx calculations, >>> not only can the lower FIXED_ORDER bits not change (as the dcdx values >>> have those all zero) but that this means the sign bit of the >>> calculations >>> cannot be different as well, that is >>> sign(c + n*dcdx) == sign((c >> FIXED_ORDER) + n*(dcdx >> FIXED_ORDER)). >>> That shaves off more than enough bits to never require 64bit masks. >>> A shifted plane c value could still easily exceed 32 bits, however >>> since we >>> throw out planes which are trivial accept even before binning (and >>> similarly >>> don't even get to see tris for which there was a trivial reject >>> plane)) this >>> is never a problem. >>> The idea isnt't all that revolutionary, in fact something similar was >>> tried >>> ages ago (9773722c2b09d5f0615a47cecf4347859474dc56) back when the >>> values were >>> only 32 bit anyway. I believe now it didn't quite work then because the >>> adjustment needed for testing trivial reject / partial masks wasn't >>> handled >>> correctly. >>> This still keeps the separate 32/64 bit paths for now, as the 32 bit >>> one still >>> looks minimally simpler (and also because if we'd pass in dcdx/dcdy/eo >>> unscaled >>> from setup which would be a good reason to ditch the 32 bit path, we'd >>> need to >>> change the special-purpose rasterization functions for small tris). >>> >>> This passes piglit triangle-rasterization (-fbo -auto -max_size >>> -subpixelbits 8). It still fails triangle-rasterization-overdraw >>> -max_size >>> (no change, fails everything at position 2048 - interestingly for >>> softpipe, >>> nvidia maxwell 1 blob, and amd evergreen open-source drivers the test >>> fails >>> as well but at 4096 - seems like we're missing a float mantissa bit >>> somewhere!). >> >> I don't think that's how the test is supposed to be run. >> >> If you do an apitrace, you'll see the test creates a fbo with 1000x1000, >> a viewport with 16Kx16K, and does a readpixels of 4Kx4K... > > The problem is that the generic "-fbo" option is not useful for this, as > we can't reliably resize it after the fact. > > Take a look at tests/general/triangle-rasterization.cpp -- it has a > different option "-use_fbo" that creates its own fbo. OK I was running that the wrong way too I think. This one still passes with -max_size -use_fbo -subpixelbits 8 (takes _forever_ though - all due to convert_ubyte in readpixel path...)
triangle-rasterization-overdraw with just -auto passes. The max_size parameter is a bit confusing since it won't do anything at all without -fbo as piglit_width/height will just get overwritten to window_width/height (and with fbo it will just fail badly). Increasing the window size manually to 8192/8192 won't really work neither as the size will be cut down to screen size. However, increasing this and then use -fbo actually does the right thing. And passes. Would be nice if piglit could pick up those size parameters _after_ piglit_init... Roland > Jose > > >> >> Jose >> >>> --- >>> src/gallium/drivers/llvmpipe/lp_rast_tri.c | 84 >>> +++++++++---------- >>> src/gallium/drivers/llvmpipe/lp_rast_tri_tmp.h | 107 >>> +++++++++++++++++++++---- >>> 2 files changed, 133 insertions(+), 58 deletions(-) >>> >>> diff --git a/src/gallium/drivers/llvmpipe/lp_rast_tri.c >>> b/src/gallium/drivers/llvmpipe/lp_rast_tri.c >>> index c9b9221..a4dd6ef 100644 >>> --- a/src/gallium/drivers/llvmpipe/lp_rast_tri.c >>> +++ b/src/gallium/drivers/llvmpipe/lp_rast_tri.c >>> @@ -64,43 +64,43 @@ block_full_16(struct lp_rasterizer_task *task, >>> } >>> >>> static inline unsigned >>> -build_mask_linear(int64_t c, int64_t dcdx, int64_t dcdy) >>> +build_mask_linear(int32_t c, int32_t dcdx, int32_t dcdy) >>> { >>> unsigned mask = 0; >>> >>> - int64_t c0 = c; >>> - int64_t c1 = c0 + dcdy; >>> - int64_t c2 = c1 + dcdy; >>> - int64_t c3 = c2 + dcdy; >>> - >>> - mask |= ((c0 + 0 * dcdx) >> FIXED_SHIFT) & (1 << 0); >>> - mask |= ((c0 + 1 * dcdx) >> FIXED_SHIFT) & (1 << 1); >>> - mask |= ((c0 + 2 * dcdx) >> FIXED_SHIFT) & (1 << 2); >>> - mask |= ((c0 + 3 * dcdx) >> FIXED_SHIFT) & (1 << 3); >>> - mask |= ((c1 + 0 * dcdx) >> FIXED_SHIFT) & (1 << 4); >>> - mask |= ((c1 + 1 * dcdx) >> FIXED_SHIFT) & (1 << 5); >>> - mask |= ((c1 + 2 * dcdx) >> FIXED_SHIFT) & (1 << 6); >>> - mask |= ((c1 + 3 * dcdx) >> FIXED_SHIFT) & (1 << 7); >>> - mask |= ((c2 + 0 * dcdx) >> FIXED_SHIFT) & (1 << 8); >>> - mask |= ((c2 + 1 * dcdx) >> FIXED_SHIFT) & (1 << 9); >>> - mask |= ((c2 + 2 * dcdx) >> FIXED_SHIFT) & (1 << 10); >>> - mask |= ((c2 + 3 * dcdx) >> FIXED_SHIFT) & (1 << 11); >>> - mask |= ((c3 + 0 * dcdx) >> FIXED_SHIFT) & (1 << 12); >>> - mask |= ((c3 + 1 * dcdx) >> FIXED_SHIFT) & (1 << 13); >>> - mask |= ((c3 + 2 * dcdx) >> FIXED_SHIFT) & (1 << 14); >>> - mask |= ((c3 + 3 * dcdx) >> FIXED_SHIFT) & (1 << 15); >>> + int32_t c0 = c; >>> + int32_t c1 = c0 + dcdy; >>> + int32_t c2 = c1 + dcdy; >>> + int32_t c3 = c2 + dcdy; >>> + >>> + mask |= ((c0 + 0 * dcdx) >> 31) & (1 << 0); >>> + mask |= ((c0 + 1 * dcdx) >> 31) & (1 << 1); >>> + mask |= ((c0 + 2 * dcdx) >> 31) & (1 << 2); >>> + mask |= ((c0 + 3 * dcdx) >> 31) & (1 << 3); >>> + mask |= ((c1 + 0 * dcdx) >> 31) & (1 << 4); >>> + mask |= ((c1 + 1 * dcdx) >> 31) & (1 << 5); >>> + mask |= ((c1 + 2 * dcdx) >> 31) & (1 << 6); >>> + mask |= ((c1 + 3 * dcdx) >> 31) & (1 << 7); >>> + mask |= ((c2 + 0 * dcdx) >> 31) & (1 << 8); >>> + mask |= ((c2 + 1 * dcdx) >> 31) & (1 << 9); >>> + mask |= ((c2 + 2 * dcdx) >> 31) & (1 << 10); >>> + mask |= ((c2 + 3 * dcdx) >> 31) & (1 << 11); >>> + mask |= ((c3 + 0 * dcdx) >> 31) & (1 << 12); >>> + mask |= ((c3 + 1 * dcdx) >> 31) & (1 << 13); >>> + mask |= ((c3 + 2 * dcdx) >> 31) & (1 << 14); >>> + mask |= ((c3 + 3 * dcdx) >> 31) & (1 << 15); >>> >>> return mask; >>> } >>> >>> >>> static inline void >>> -build_masks(int64_t c, >>> - int64_t cdiff, >>> - int64_t dcdx, >>> - int64_t dcdy, >>> - unsigned *outmask, >>> - unsigned *partmask) >>> +build_masks(int32_t c, >>> + int32_t cdiff, >>> + int32_t dcdx, >>> + int32_t dcdy, >>> + unsigned *outmask, >>> + unsigned *partmask) >>> { >>> *outmask |= build_mask_linear(c, dcdx, dcdy); >>> *partmask |= build_mask_linear(c + cdiff, dcdx, dcdy); >>> @@ -168,12 +168,12 @@ lp_rast_triangle_32_3_4(struct >>> lp_rasterizer_task *task, >>> >>> >>> static inline void >>> -build_masks_32(int c, >>> - int cdiff, >>> - int dcdx, >>> - int dcdy, >>> - unsigned *outmask, >>> - unsigned *partmask) >>> +build_masks_sse(int c, >>> + int cdiff, >>> + int dcdx, >>> + int dcdy, >>> + unsigned *outmask, >>> + unsigned *partmask) >>> { >>> __m128i cstep0 = _mm_setr_epi32(c, c+dcdx, c+dcdx*2, c+dcdx*3); >>> __m128i xdcdy = _mm_set1_epi32(dcdy); >>> @@ -214,7 +214,7 @@ build_masks_32(int c, >>> >>> >>> static inline unsigned >>> -build_mask_linear_32(int c, int dcdx, int dcdy) >>> +build_mask_linear_sse(int c, int dcdx, int dcdy) >>> { >>> __m128i cstep0 = _mm_setr_epi32(c, c+dcdx, c+dcdx*2, c+dcdx*3); >>> __m128i xdcdy = _mm_set1_epi32(dcdy); >>> @@ -474,8 +474,15 @@ lp_rast_triangle_32_3_4(struct lp_rasterizer_task >>> *task, >>> #endif >>> >>> >>> +#ifdef PIPE_ARCH_SSE >>> +#define BUILD_MASKS(c, cdiff, dcdx, dcdy, omask, pmask) >>> build_masks_sse((int)c, (int)cdiff, dcdx, dcdy, omask, pmask) >>> +#define BUILD_MASK_LINEAR(c, dcdx, dcdy) >>> build_mask_linear_sse((int)c, dcdx, dcdy) >>> +#else >>> #define BUILD_MASKS(c, cdiff, dcdx, dcdy, omask, pmask) >>> build_masks(c, cdiff, dcdx, dcdy, omask, pmask) >>> #define BUILD_MASK_LINEAR(c, dcdx, dcdy) build_mask_linear(c, dcdx, >>> dcdy) >>> +#endif >>> + >>> +#define RASTER_64 1 >>> >>> #define TAG(x) x##_1 >>> #define NR_PLANES 1 >>> @@ -512,12 +519,7 @@ lp_rast_triangle_32_3_4(struct lp_rasterizer_task >>> *task, >>> #define NR_PLANES 8 >>> #include "lp_rast_tri_tmp.h" >>> >>> -#ifdef PIPE_ARCH_SSE >>> -#undef BUILD_MASKS >>> -#undef BUILD_MASK_LINEAR >>> -#define BUILD_MASKS(c, cdiff, dcdx, dcdy, omask, pmask) >>> build_masks_32((int)c, (int)cdiff, dcdx, dcdy, omask, pmask) >>> -#define BUILD_MASK_LINEAR(c, dcdx, dcdy) build_mask_linear_32((int)c, >>> dcdx, dcdy) >>> -#endif >>> +#undef RASTER_64 >>> >>> #define TAG(x) x##_32_1 >>> #define NR_PLANES 1 >>> diff --git a/src/gallium/drivers/llvmpipe/lp_rast_tri_tmp.h >>> b/src/gallium/drivers/llvmpipe/lp_rast_tri_tmp.h >>> index e0aea94..05d2242 100644 >>> --- a/src/gallium/drivers/llvmpipe/lp_rast_tri_tmp.h >>> +++ b/src/gallium/drivers/llvmpipe/lp_rast_tri_tmp.h >>> @@ -50,9 +50,15 @@ TAG(do_block_4)(struct lp_rasterizer_task *task, >>> int j; >>> >>> for (j = 0; j < NR_PLANES; j++) { >>> - mask &= ~BUILD_MASK_LINEAR(c[j] - 1, >>> - -plane[j].dcdx, >>> - plane[j].dcdy); >>> +#ifdef RASTER_64 >>> + mask &= ~BUILD_MASK_LINEAR(((c[j] - 1) >> (int64_t)FIXED_ORDER), >>> + -plane[j].dcdx >> FIXED_ORDER, >>> + plane[j].dcdy >> FIXED_ORDER); >>> +#else >>> + mask &= ~BUILD_MASK_LINEAR((c[j] - 1), >>> + -plane[j].dcdx, >>> + plane[j].dcdy); >>> +#endif >>> } >>> >>> /* Now pass to the shader: >>> @@ -79,17 +85,33 @@ TAG(do_block_16)(struct lp_rasterizer_task *task, >>> partmask = 0; /* outside one or more trivial >>> accept planes */ >>> >>> for (j = 0; j < NR_PLANES; j++) { >>> +#ifdef RASTER_64 >>> + int32_t dcdx = -plane[j].dcdx >> FIXED_ORDER; >>> + int32_t dcdy = plane[j].dcdy >> FIXED_ORDER; >>> + const int32_t cox = plane[j].eo >> FIXED_ORDER; >>> + const int32_t ei = (dcdy + dcdx - cox) << 2; >>> + const int32_t cox_s = cox << 2; >>> + const int32_t co = (int32_t)(c[j] >> (int64_t)FIXED_ORDER) + >>> cox_s; >>> + int32_t cdiff; >>> + cdiff = ei - cox_s + ((int32_t)((c[j] - 1) >> >>> (int64_t)FIXED_ORDER) - >>> + (int32_t)(c[j] >> (int64_t)FIXED_ORDER)); >>> + dcdx <<= 2; >>> + dcdy <<= 2; >>> +#else >>> const int64_t dcdx = -IMUL64(plane[j].dcdx, 4); >>> const int64_t dcdy = IMUL64(plane[j].dcdy, 4); >>> const int64_t cox = IMUL64(plane[j].eo, 4); >>> - const int64_t ei = plane[j].dcdy - plane[j].dcdx - >>> (int64_t)plane[j].eo; >>> + const int32_t ei = plane[j].dcdy - plane[j].dcdx - >>> (int64_t)plane[j].eo; >>> const int64_t cio = IMUL64(ei, 4) - 1; >>> + int32_t co, cdiff; >>> + co = c[j] + cox; >>> + cdiff = cio - cox; >>> +#endif >>> >>> - BUILD_MASKS(c[j] + cox, >>> - cio - cox, >>> - dcdx, dcdy, >>> - &outmask, /* sign bits from c[i][0..15] + cox */ >>> - &partmask); /* sign bits from c[i][0..15] + cio */ >>> + BUILD_MASKS(co, cdiff, >>> + dcdx, dcdy, >>> + &outmask, /* sign bits from c[i][0..15] + cox */ >>> + &partmask); /* sign bits from c[i][0..15] + cio */ >>> } >>> >>> if (outmask == 0xffff) >>> @@ -179,14 +201,65 @@ TAG(lp_rast_triangle)(struct lp_rasterizer_task >>> *task, >>> c[j] = plane[j].c + IMUL64(plane[j].dcdy, y) - >>> IMUL64(plane[j].dcdx, x); >>> >>> { >>> - const int64_t dcdx = -IMUL64(plane[j].dcdx, 16); >>> - const int64_t dcdy = IMUL64(plane[j].dcdy, 16); >>> - const int64_t cox = IMUL64(plane[j].eo, 16); >>> - const int64_t ei = plane[j].dcdy - plane[j].dcdx - >>> (int64_t)plane[j].eo; >>> - const int64_t cio = IMUL64(ei, 16) - 1; >>> - >>> - BUILD_MASKS(c[j] + cox, >>> - cio - cox, >>> +#ifdef RASTER_64 >>> + /* >>> + * Strip off lower FIXED_ORDER bits. Note that those bits from >>> + * dcdx, dcdy, eo are always 0 (by definition). >>> + * c values, however, are not. This means that for every >>> + * addition of the form c + n*dcdx the lower FIXED_ORDER >>> bits will >>> + * NOT change. And those bits are not relevant to the sign >>> bit (which >>> + * is only what we need!) that is, >>> + * sign(c + n*dcdx) == sign((c >> FIXED_ORDER) + n*(dcdx >> >>> FIXED_ORDER)) >>> + * This means we can get away with using 32bit math for the >>> most part. >>> + * Only tricky part is the -1 adjustment for cdiff. >>> + */ >>> + int32_t dcdx = -plane[j].dcdx >> FIXED_ORDER; >>> + int32_t dcdy = plane[j].dcdy >> FIXED_ORDER; >>> + const int32_t cox = plane[j].eo >> FIXED_ORDER; >>> + const int32_t ei = (dcdy + dcdx - cox) << 4; >>> + const int32_t cox_s = cox << 4; >>> + const int32_t co = (int32_t)(c[j] >> (int64_t)FIXED_ORDER) + >>> cox_s; >>> + int32_t cdiff; >>> + /* >>> + * Plausibility check to ensure the 32bit math works. >>> + * Note that within a tile, the max we can move the edge >>> function >>> + * is essentially dcdx * TILE_SIZE + dcdy * TILE_SIZE. >>> + * TILE_SIZE is 64, dcdx/dcdy are nominally 21 bit (for 8192 >>> max size >>> + * and 8 subpixel bits), I'd be happy with 2 bits more too >>> (1 for >>> + * increasing fb size to 16384, the required d3d11 value, >>> another one >>> + * because I'm not quite sure we can't be _just_ above the >>> max value >>> + * here). This gives us 30 bits max - hence if c would >>> exceed that here >>> + * that means the plane is either trivial reject for the >>> whole tile >>> + * (in which case the tri will not get binned), or trivial >>> accept for >>> + * the whole tile (in which case plane_mask will not include >>> it). >>> + */ >>> + assert((c[j] >> (int64_t)FIXED_ORDER) > (int32_t)0xb0000000 && >>> + (c[j] >> (int64_t)FIXED_ORDER) < (int32_t)0x3fffffff); >>> + /* >>> + * Note the fixup part is constant throughout the tile - >>> thus could >>> + * just calculate this and avoid _all_ 64bit math in >>> rasterization >>> + * (except exactly this fixup calc). >>> + * In fact theoretically could move that even to setup, >>> albeit that >>> + * seems tricky (pre-bin certainly can have values larger >>> than 32bit, >>> + * and would need to communicate that fixup value through). >>> + * And if we want to support msaa, we'd probably don't want >>> to do the >>> + * downscaling in setup in any case... >>> + */ >>> + cdiff = ei - cox_s + ((int32_t)((c[j] - 1) >> >>> (int64_t)FIXED_ORDER) - >>> + (int32_t)(c[j] >> >>> (int64_t)FIXED_ORDER)); >>> + dcdx <<= 4; >>> + dcdy <<= 4; >>> +#else >>> + const int32_t dcdx = -plane[j].dcdx << 4; >>> + const int32_t dcdy = plane[j].dcdy << 4; >>> + const int32_t cox = plane[j].eo << 4; >>> + const int32_t ei = plane[j].dcdy - plane[j].dcdx - >>> (int32_t)plane[j].eo; >>> + const int32_t cio = (ei << 4) - 1; >>> + int32_t co, cdiff; >>> + co = c[j] + cox; >>> + cdiff = cio - cox; >>> +#endif >>> + BUILD_MASKS(co, cdiff, >>> dcdx, dcdy, >>> &outmask, /* sign bits from c[i][0..15] + >>> cox */ >>> &partmask); /* sign bits from c[i][0..15] + >>> cio */ >>> >> > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev