On Fri, Aug 13, 2010 at 11:35 PM, Keith Whitwell <kei...@vmware.com> wrote: > On Fri, 2010-08-13 at 08:09 -0700, Chia-I Wu wrote: >> On Fri, Aug 13, 2010 at 10:51 PM, Keith Whitwell <kei...@vmware.com> wrote: >> > On Fri, 2010-08-13 at 07:46 -0700, Chia-I Wu wrote: >> >> On Fri, Aug 13, 2010 at 10:14 PM, Keith Whitwell <kei...@vmware.com> >> >> wrote: >> >> > On Fri, 2010-08-13 at 07:04 -0700, Chia-I Wu wrote: >> >> >> Hi, >> >> >> >> >> >> There are two primitive transformations in gallium draw module. In >> >> >> varray, primitives are "split"ted. When a primitive has more vertices >> >> >> than the middle end can handle, varray splits the primitive and calls >> >> >> the middle end multiple times. >> >> >> >> >> >> In vcache, primitives are "decompose"d. More advanced primitives are >> >> >> decomposed into one of point, line(_adj), or triangle(_adj). >> >> >> Similarly, vcache may call the middle end multiple times to flush its >> >> >> internal buffer. In some cases, vcache passes the primitves through >> >> >> without decomposing nor splitting, as can be seen in vcache_check_run. >> >> >> >> >> >> The issue with vcache is that it has to decompose a primitive >> >> >> differently depending on the provoking convention, as explained in >> >> >> >> >> >> >> >> >> http://lists.freedesktop.org/archives/mesa-dev/2010-August/001797.html >> >> >> >> >> >> It becomes a problem when GS is active. >> >> >> >> >> >> My proposal is to make vcache split instead of decompose. Because >> >> >> varray only splits and vcache has a pass-through path, the rest of the >> >> >> workflow already has to support all primitive types. Switching from >> >> >> decompose to split does not require a big change to the rest of the >> >> >> workflow. >> >> >> >> >> >> But then vcache will look a lot like varray, only with indexed >> >> >> primitive support. It leads me to a new frontend that replaces both >> >> >> varray and vcache: vsplit >> >> >> >> >> >> http://cgit.freedesktop.org/~olv/mesa/log/?h=draw-vsplit >> >> >> >> >> >> vsplit is based on varray. It uses some code from vcache to support >> >> >> indexed primitives. When vcache decomposes, there are flags being set >> >> >> to indicate that if the stipple counter should be reset or if some >> >> >> edge of a triangle should be omitted in unfilled mode. The segments >> >> >> of a splitted primitive have flags for similar purposes too: >> >> >> >> >> >> DRAW_SPLIT_AFTER More segments to come after this one >> >> >> DRAW_SPLIT_BEFORE There are preceding segments >> >> >> >> >> >> These flags are set by vsplit and the middle ends pass them to the >> >> >> other stages. Therefore, the run methods of middle ends are augmented >> >> >> to take the flags. >> >> >> >> >> >> To summarize, vsplit >> >> >> >> >> >> - fixes GS when (flatshade && flatshade_first) is on >> >> >> - never sends more vertices than the middle end claims to handle >> >> >> - is faster than vcache: split instead of decompose, no get_elt >> >> >> calls >> >> >> - no longer uses the higher bits of draw_elts for stipple/edge flags >> >> >> >> >> >> Suggestions? >> >> > >> >> > >> >> > Hi - I haven't looked at the patches yet, but a couple of questions: >> >> > >> >> > How does this interact with the draw_pipe_* code - which requires >> >> > decomposed primitives? >> >> draw_pipe.c decomposes the primitives. It is there before because it >> >> has to support varray and vcache_check_run which do not decompose. >> > >> > OK. >> > >> >> > How does this cope with indexed rendering where the vertex buffers >> >> > themselves are too large (for hardware or some other entity)? Eg. >> >> > imagine the hardware could cope with up to 64k vertices, and you have a >> >> > drawelements call randomly referencing vertices in range 0..128k ? >> >> Vertex fetching happens in the middle end so the range of the indices >> >> is not a problem. Though vsplit guarantees that it never calls the >> >> middle end with more vertices than the middle end claims to support >> >> (as returned by draw_pt_middle_end::prepare). The limit is usually >> >> decidied by the size of the buffer for vertex emitting. >> > >> > I guess I'm wondering how it does this. If the middle end says it >> > supports 64k vertices, and the vertex element looks like >> > >> > [0, 128k, 64k, 32k, 96k, 16k, 1, ... ] >> > >> > what gets sent? (Sorry, I still haven't looked at the code, you could >> > well have addressed this). >> I see. The frontend would set >> >> fetch_elts = [0, 128k, 64k, 32k, 96k, 16k, 1, ... ] >> draw_elts = [0, 1, 2, 3, 4, 5, 6, ...] >> >> fetch_elts is processed by the middle end and it will fetch the given >> vertices. draw_elts will be passed to draw_emit or the pipeline. It >> is the new index buffer, which indexes into the fetched vertices. >> >> It is actual the same as vcache. So when fetch_elts is >> >> [0, 128k, 64k, 64k, 128k, 16k, ...], >> >> draw_elts would be set to >> >> [0, 1, 2, 2, 1, 3, ...] >> >> The number of elements to fetch (and shade) is minimized. > > Thanks Chia-I, I've taken a look at the code & this makes sense - the > fetch/draw cache is still there, but specialized into 4 versions for > each element type. And it seems like you take some steps not to hit it > unnecessarily. > > I'm coming up to speed on it though, so a couple more questions - for > fan primitives, it seems like you always end up in the segment_cache > code -- is that true, or is there a fastpath I missed? In particular, > if the whole fan fits within the limits of the middle end, will it still > end up going through the cache? Yes, if it exceeds vsplit's limit (SEGMENT_SIZE). > Actually it looks like this happens in an early-out at the bottom of the > patch: > > > + /* no splitting required */ > + if (count <= max_count_simple) { > + SEGMENT_SIMPLE(0x0, start, count); > + } > > > where max_count_simple is either > > vsplit->max_vertices > or > vsplit->segment_size (for indexed primitives) > > These in turn are generated as: > > + middle->prepare(middle, vsplit->prim, opt, &vsplit->max_vertices); > + > + vsplit->segment_size = MIN2(SEGMENT_SIZE, vsplit->max_vertices); > > and SEGMENT_SIZE is 1024. > > > So any indexed primitive where the number of vertices (or is it number > of indices) exceeds 1024, will end up on the cache path? > I know this used to be true as well -- just wondering if there is a way > to improve on this... max_count_simple is set to the segment size (<= 1024) because the middle end expects draw_elts to be of type ushort. vsplit needs to use its internal fixed-size buffer when the index_size!=2.
The limit may be lifted for index_size==2. The attached patch should relax the limit (untested as it is getting late here :-). Another way that comes to my mind now is to make the internal buffer dynamically sized, and make SEGMENT_SIZE a large limit on the dynamic size. -- o...@lunarg.com
commit 59ef2404b50b24a281ff3999fa3538d0b7b425b8 Author: Chia-I Wu <o...@lunarg.com> Date: Sat Aug 14 00:05:28 2010 +0800 blah diff --git a/src/gallium/auxiliary/draw/draw_pt_vsplit_tmp.h b/src/gallium/auxiliary/draw/draw_pt_vsplit_tmp.h index efeaa56..b2c2813 100644 --- a/src/gallium/auxiliary/draw/draw_pt_vsplit_tmp.h +++ b/src/gallium/auxiliary/draw/draw_pt_vsplit_tmp.h @@ -44,10 +44,23 @@ CONCAT(vsplit_segment_fast_, ELT_TYPE)(struct vsplit_frontend *vsplit, const unsigned max_index = draw->pt.user.max_index; const int elt_bias = draw->pt.user.eltBias; unsigned fetch_start, fetch_count; - const ushort *draw_elts; + const ushort *draw_elts = NULL; unsigned i; - assert(icount <= vsplit->segment_size); + /* use the ib directly */ + if (min_index == 0 && sizeof(ib[0]) == sizeof(draw_elts[0])) { + draw_elts = (const ushort *) ib; + + for (i = 0; i < icount; i++) { + ELT_TYPE idx = ib[istart + i]; + assert(idx >= min_index && idx <= max_index); + } + } + else { + /* have to go through vsplit->draw_elts */ + if (icount > vsplit->segment_size) + return FALSE; + } /* this is faster only when we fetch less elements than the normal path */ if (max_index - min_index > icount - 1) @@ -65,14 +78,7 @@ CONCAT(vsplit_segment_fast_, ELT_TYPE)(struct vsplit_frontend *vsplit, fetch_start = min_index + elt_bias; fetch_count = max_index - min_index + 1; - if (min_index == 0 && sizeof(ib[0]) == sizeof(draw_elts[0])) { - for (i = 0; i < icount; i++) { - ELT_TYPE idx = ib[istart + i]; - assert(idx >= min_index && idx <= max_index); - } - draw_elts = (const ushort *) ib; - } - else { + if (!draw_elts) { if (min_index == 0) { for (i = 0; i < icount; i++) { ELT_TYPE idx = ib[istart + i]; @@ -170,12 +176,6 @@ CONCAT(vsplit_segment_simple_, ELT_TYPE)(struct vsplit_frontend *vsplit, unsigned istart, unsigned icount) { - /* the primitive is not splitted */ - if (!(flags)) { - if (CONCAT(vsplit_segment_fast_, ELT_TYPE)(vsplit, - flags, istart, icount)) - return; - } CONCAT(vsplit_segment_cache_, ELT_TYPE)(vsplit, flags, istart, icount, FALSE, 0, FALSE, 0); } @@ -213,6 +213,9 @@ CONCAT(vsplit_segment_fan_, ELT_TYPE)(struct vsplit_frontend *vsplit, const unsigned max_count_loop = vsplit->segment_size - 1; \ const unsigned max_count_fan = vsplit->segment_size; +#define SEGMENT_FAST(flags, istart, icount) \ + CONCAT(vsplit_segment_fast_, ELT_TYPE)(vsplit, flags, istart, icount) + #else /* ELT_TYPE */ static void diff --git a/src/gallium/auxiliary/draw/draw_split_tmp.h b/src/gallium/auxiliary/draw/draw_split_tmp.h index 40ab0b7..129bd5c 100644 --- a/src/gallium/auxiliary/draw/draw_split_tmp.h +++ b/src/gallium/auxiliary/draw/draw_split_tmp.h @@ -52,6 +52,12 @@ FUNC(FUNC_VARS) max_count_loop >= first + incr && max_count_fan >= first + incr); +#ifdef SEGMENT_FAST + /* optional fast path */ + if (SEGMENT_FAST(0x0, start, count)) + return; +#endif + /* no splitting required */ if (count <= max_count_simple) { SEGMENT_SIMPLE(0x0, start, count); @@ -166,6 +172,7 @@ FUNC(FUNC_VARS) #undef FUNC_VARS #undef LOCAL_VARS +#undef SEGMENT_FAST #undef SEGMENT_SIMPLE #undef SEGMENT_LOOP #undef SEGMENT_FAN
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/mesa-dev