Interesting problem. It's similar to the performance issues that we see in
FlightGear.

On Fri, Jan 21, 2011 at 10:13 PM, Jean-Sébastien Guay <
[email protected]> wrote:

> Hi all,
>
> I thought I had a pretty firm grasp on what to optimize given a certain set
> of scene stats, but I've optimized what I can and I'm still getting little
> improvement in results. So I'll explain my situation here and hope you guys
> have some good suggestions. Sorry if this is a long message, but I prefer to
> give all the relevant data now rather than get asked later.
>
> The whole scene is about a 200m x 200m square (apart from the ocean and
> skydome but these are not significant, I have
>
...

> The other thing is that there are a lot of dynamic objects, so there are a
> lot of transforms. But I can't change this, it's part of our simulation.
>
> As an aside, you may need to change your approach to the dynamic objects
from a "naive" scene graph to one where the geometry of the dynamic objects
is instanced and / or coalesced.

> So, after doing some optimization (removing redundant groups, building
> texture atlases where possible, merging geodes and geometry, generating
> triangle strips, most of which I did with the osgUtil::Optimizer), I get the
> following stats, which I'll talk about a bit later:
>
> Scene stats:
> StateSets     1345
> Groups         392
> Transforms     672
> Geodes         992
> Geometry       992
> Vertices    139859
> Primitives   87444
>
> This is a big scene graph, and especially compared to the small amount of
geometry that is in the scene.

> Camera stats:
> State graphs       1282
> Drawables          2151
> PrimitiveSets     73953
> Triangles          3538
> Tri. Strips      211091
> Tri. Fans            16
> Quads             11526
> Quad Strips         534
> Total primitives 226705
>
> In addition to the geometry problems you attack later on, the state graph
number is very high, as is the drawable number.  "State graphs" is a measure
of the number of unique graphic states set up during  the rendering
 traversal.


> And, both in our simulator and in osgViewer, for the same scene and same
> viewpoint, I get:
>
> FPS: ~35
> Cull: 5.4ms
> Draw: 19ms
> GPU: 19ms
>
> This is on a pretty good machine: Core i7 920, GeForce GTX 260.
>
> The cull time seems high to me. I think this is caused by the expense of
traversing a big scene graph every frame.

Draw time is very long, and GPU time probably suffers as a result. Ideally
you want to shoot for a tiny draw time ( < 5ms) and let GPU time extend well
past the end of the draw traversal, even overlapping with the next frame's
update and cull traversals. The relationship between draw and GPU time can
be hard to predict. I've found that it is very rare to have a GPU time that
is less than draw time, and I think this is caused by the GPU stalling,
waiting on OpenGL commands from the CPU.

> First of all, the stats above tell me that the "Primitives" part of the
> scene stats refers to primitive sets, not just primitives... Since the
> camera stats tell me there are over 226000 primitives in the current view.
>
> As you can see, the number of primitiveSets is very high. If I understand
> correctly, each PrimitiveSet will result in an OpenGL draw call, and since
> my draw time is what's high now, I would want to reduce that (since I'm
> currently at about 3 primitives per primitiveSet on average). If I remove
> triangle strip generation from the optimizer options, the stats become:
>
That is basically true. When you use display lists, the draw calls for a
Drawable are all included in one display list, but this still results in
OpenGL command overhead, as far as I can tell.

>
> Scene stats:
> StateSets     1345
> Groups         392
> Transforms     672
> Geodes         992
> Geometry       992
> Vertices    190392
> Primitives   51197
>
> Camera stats:
> State graphs       1254
> Drawables          2117
> PrimitiveSets      4899
> Triangles         17122
> Tri. Strips         191
> Tri. Fans          7212
> Quads            106464
> Quad Strips         534
> Total primitives 131523
>
> This indicates to me that the tristrip visitor in the optimizer does a
> pretty bad job. I looked at an .osg dump, and it seems to generate a
> separate strip for each quad (so one strip for 4 vertices) which is
> ridiculous... But that's a subject for another day.
>
> Yeah, I think we're advising against the tristripper these days.

> When I disabled the tristripper, you can see a massive decrease in the
> number of primitiveSets (and even in the number of primitives), however
> there was no significant change in the frame rate and timings. I don't
> understand this. I would have expected, with more primitives per
> primitiveSet (I'm now at about 26 prims per primSet on average, as opposed
> to around 3 before) and much less draw calls, that the draw time would have
> been much lower. That's not what happens in practice.
>
> There are two factors here. 26 prims is still about 2 orders of magnitude
below the ideal. You have all those drawables, and you will need to do a lot
of work to reduce that number, both with the Optimizer and perhaps
reorganizing your graph.

But there's also that big state graph number; that could be the bottleneck.

> My previous attempts at optimizing (using the osgUtil::Optimizer) were also
> centered around lowering the number of primitives (by creating texture
> atlases and sharing state so the merging of geodes and geometry objects gave
> good results). And even though that also lowered the numbers (I started at
> around 2215 Geodes and 2521 Geometry objects in the same scene, compare that
> to 992 each now), it also had underwhelming results in practice.
>
> Clearly there are more than one primitiveSet per Geometry in the above
> stats. What I see in the dumped .osg file, is there is often things like:
>
>          PrimitiveSets 4
>          {
>            DrawArrays TRIANGLES 0 12
>            DrawArrays QUADS 12 152
>            DrawArrays TRIANGLES 164 12
>            DrawArrays QUADS 176 152
>          }
>
> I would expect, by reordering the vertex/color/normal/texCoord data, I
> would be able to get only 2 primitiveSets there, one TRIANGLES and one
> QUADS. Am I wrong? Why does the osgUtil::Optimizer not do this already when
> merging Geometry objects? I expect because it's easier not to do it, but
> still, it gives sub-optimal results...
>
You can get one primitive set there; quads are just two triangles. A happy
side effect of the  INDEX_MESH | VERTEX_POSTTRANSFORM | VERTEX_PRETRANSFORM
combo of optimizers, which are relatively new and not part of the default
set, is that they combine multiple primitive sets into one. You could try
those, even if their primary purpose -- optimizing cache use -- won't do
much for these small meshes.

>
> Of course I can't do that for strips or fans, unless I insert new vertices
> to restart the strip. Again this is something that could be done, but might
> bring diminishing returns in my case given that my own scene contains many
> more triangles and quads than strips and fans (when I turn off
> tristripping).
>
> Dumping all the strips and fans into one big indexed DrawElements gives
better results these days, and the new optimizers do that.

> So, first of all, am I on the right track trying to reduce the number of
> primitiveSets? Do you think on current hardware, disabling tristripping is a
> good idea?
>
> Why, when disabling tristripping which reduced the number of primitiveSets
> from 73953 to 4899, didn't I see an increase in performance?
>
That number is still very big. But there's also the large state graph
number.

>
> Is there some other way to find out what's going on and seeing what I can
> improve to increase the performance? I've tried running our app in
> gDEBugger, which tipped me off that I was batching poorly when using
> triangle strips (about 3 prims per primitiveSet as I said above). Turning
> off triangle strips improved the situation (as gDEBugger sees it), but not
> by that much, which is probably coherent with what I'm seeing in practice,
> but I'm no closer to finding out what to improve next. What is not mergeable
> now is like that because of different settings in StateSets (backface
> culling on vs off, can't use texture atlas because the wrap mode is set to
> REPEAT, etc.), so I don't think osgUtil::Optimizer can help me improve the
> situation further...
>
I haven't found good tools on Linux for attacking this sort of thing. On
Windows the NVidia PerfTools, or PerfDisplay, or whatever it's called,
should give you a clue. I usually proceed by dumping the scene graph to a
.osg, like you have done, and pouring  over it with a text editor :/.

>
> I have looked at video memory usage by the way, and I'm fine in that
> respect, so I don't think I'm getting any thrashing or paging between video
> RAM and main RAM at runtime. Also, I'm using display lists for most of the
> objects in the scene, I tried using Vertex Buffer Objects and it actually
> slowed it down.
>
> I should also mention that these results are obtained using
> osgShadow::LightSpacePerspectiveShadowMap. I can run the dumped .osg file
> with
>
>  osgshadow --lispsm --noUpdate --mapres 2048 <dumped_file>.osg
>
> and I get the results above, which are pretty similar to our simulator. If
> I run the same data file in plain osgViewer without shadows, it runs at a
> solid 60Hz, with stats and timings:
>
> Hah, you saved the best for last :)

> Scene stats:
> StateSets     1345
> Groups         392
> Transforms     672
> Geodes         992
> Geometry       992
> Vertices    190392
> Primitives   51197
>
> Camera stats:
> State graphs        321
> Drawables           810
> PrimitiveSets      1774
> Triangles          7243
> Tri. Strips          85
> Tri. Fans          2508
> Quads             39370
> Quad Strips         178
> Total primitives  49384
>
> FPS: 60
> Cull: 1.7ms
> Draw: 8ms
> GPU: 6.8ms
>
> (that's the no tristrips version, so compare these stats to the second set
> of stats from the top, not the first)
>
> I would have expected most numbers there to be half what they were with
> shadows enabled, but as you can see they're consistently less than half, so
> shadows added more than a 100% overhead... Note that even if it added
> exactly 100% overhead, I would still be at 16ms draw, which is too much, but
> I'm just mentioning it in case it may prompt some other suggestions.
>
> I'm not sure I could send my whole scene to everyone on the list, but I
> might be able to send it to someone if they want to see firsthand. Just the
> bare .osg file without any textures and without ocean and skydome shows the
> problem adequately well.
>
> Thanks in advance for any suggestions you might have. I really need to
> improve this, and I've been working for a while already with only a small
> improvement to show for my time...
>
> Some final thoughts:
You are traversing  the scene twice, so reducing  overhead in the traversal
is doubly important.

Don't forget about the cull time. Draw can't start until cull is finished,
so if you can knock 1ms off the cull time you are making good headway.
Reducing the number of groups, geodes, drawables to the minimum should be
the goal there.

You need to attack the state graph number. These may be challenging. By way
of an example, look at the textures that are repeating, and therefore
defeating your atlas efforts, and think about expanding them.

Keep the figure of 1000 verts per drawable (and primitive set) in mind. It
might be hard to get there, but it's a worthy goal.

Tim

> J-S
> --
> ______________________________________________________
> Jean-Sebastien Guay    [email protected]
>                               http://www.cm-labs.com/
>                        http://whitestar02.webhop.org/
> _______________________________________________
> osg-users mailing list
> [email protected]
> http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org
>
_______________________________________________
osg-users mailing list
[email protected]
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org

Reply via email to