Hi, all,
This is a general request to the community for some advice and
expertise. This is a bit lengthy, but if you can spare a few minutes to
look over this and send along your thoughts, we would really appreciate it.
I've been working on a project for NIST recently. For background, you
might find this osg-users thread from December 2010 useful:
http://thread.gmane.org/gmane.comp.graphics.openscenegraph.user/63954/focus=64014
Briefly, an OSG-based test application loads a scene and displays it in
a window on one or more screens (Single Viewer, multiple slave cameras,
one GPU and context per screen). The problem was that a single screen
would draw the scene at a given frame rate, but as additional screens
were added, the frame rate would drop significantly (427 fps on one
screen, 396 fps on two, 291 on four). Given that the scene is identical
and static, four contexts on four GPUs should be able to draw at or at
least very nearly the same rate as one screen on one GPU.
Based on initial tests (including running a non-OSG, OpenGL-based
program), we first suspected that the problem had to do with thread
contention inside of OSG itself. To get to the bottom of this, I added
a per-thread logging mechanism to OpenThreads::PThread, which allowed
each thread to output to a unique file. Using this, I added some timing
code at various places in the rendering process. I started with timing
the various Operations at the high level, then I started digging into
the various function calls, placing timing code at the important points
along the way. I eventually drilled all the way down to
PrimitiveSet::draw(), and at that level, it became obvious that the code
that was taking all of the time (and not scaling well to multiple
screens) was the OpenGL draw call itself. For example, on one screen a
given PrimitiveSet would take 0.05 ms, and on two screens, it would take
0.11 ms on the first screen and 0.12 ms on the second (roughly twice the
amount of time).
At this point it was beginning to look like it wasn't OSG's fault. To
be sure, I decided to write a pure multithreaded OpenGL program from
scratch. I tried to keep the rendering structure the same as OSG
(without the scene graph structure, or the update and cull traversals,
of course). I wrote enough of a .osg file loaded so that I could load
the same data with the same structure, and produce the same OpenGL
command stream when drawing a frame as OSG does (verified with
gDEBugger). Once this was complete, we saw a similar lack of
scalability as additional screens were added (708 fps on one screen, 702
fps on two, 376 fps on four).
At this point, I started looking for something else to blame. Examining
the data set itself, I discovered that it was composed of about 5500
triangle strips, none of which were longer than 112 vertices (the data
set had about 600,000 vertices total). There were only about 10
different StateSets in the scene, so state changes aren't a problem.
After some digging, I found the MeshOptimizers portion of
osgUtil::Optimizer, and based on a message I found from Jean-Sebastian,
I tried a pass of VERTEX_PRETRANSFORM | INDEX_MESH |
VERTEX_POSTTRANSFORM, followed by another pass of MERGE_GEODES |
MERGE_GEOMETRY. This reduced the number of draw calls from around 5500
to 9, and completely eliminated the scalability problem for both the OSG
test program, and the pure OpenGL program. This leads me to believe
that bus contention was causing the lack of scalability. As more
screens were added, the thousands of draw calls required by the
unoptimized data set couldn't fit within the bus bandwidth, effectively
causing the draw calls to take longer. The unoptimized data, only
requiring 9 draw calls per screen, could easily fit.
To demonstrate this effectively, I created a variation of the original
OSG test program that would run render the original test data for a
given time period, then run a configurable set of optimization passes,
and then render the optimized data for a given time period. It then
reported the pre-optimization and post-optimization frame rates. We
also tried amplifying the data by essentially instancing it 8 times and
64 times (not using instanced drawing, just drawing multiple copies of
the same data). We ran the new OSG test app, applying the same
optimizations as mentioned above, as well as running the same data (both
unoptimized and optimized) through the pure OpenGL program. I've
attached the timings for two complete runs.
The 1x and 8x data sets appear to scale well in all cases, except for
one anomalous case where the fps drops from 76.77 fps on three screens
to 53.26 on four (the subsequent run only drops to 70.89, so this may
not be a real problem). The 64x data sets are more interesting. The
OpenGL program clearly scales well (almost perfectly, in fact) on the
optimized data. The OSG program doesn't scale as well, dropping to 8
fps on four screens vs. 11 fps on one.
So, here are our questions. Does it make sense that bus contention
would be causing the lack of scalability? Are the mesh optimizations
mentioned above the most effective way to solve the problem? Are there
any cases where the mesh optimizations wouldn't be sufficient, and
additional steps would need to be taken (I briefly mentioned state
changes above, which could be problematic, anything else)? Why doesn't
the 64x data set seem to scale as well as the 1x and 8x data sets (does
this indicate that the bottleneck has moved from the bus to somewhere else)?
Any thoughts on these issues or other thoughts you could provide would
be very valuable.
Thanks, all!
--"J"
Jason Daly
University of Central Florida
Institute for Simulation and Training
====================================================================================================================
January 23, 2012
Running on Tylium: QuadroPlex D2 (four FX5800)
export __GL_SYNC_TO_VBLANK=0
export OSG_SERIALIZE_DRAW_DISPATCH=OFF
export OSG_THREADING=CullThreadPerCameraDrawThreadPerContext
use ./run_1.6 ID Command
>From log files use collective times
1/23/2012 1/24/2012
FPS
FPS FPS FPS
ID Command line: pre-opt
post-opt pre-opt post-opt
-- -------------------------------------------------- -------
-------- ------- --------
OSGa0 osgMultiCardOpt_v1.4 -e 60 -o 60 0 testex.ive 427.64
522.63 491.82 523.42
OSGa1 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 testex.ive 396.95
527.63 362.84 528.38
OSGa2 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2 testex.ive 349.54
527.51 358.56 523.79
OSGa3 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2 3 testex.ive 291.72
520.12 306.78 518.70
GLUa0 multithreadTest_v1.6 -t 60 0 testex.osg 708.58
707.98
GLUa1 multithreadTest_v1.6 -t 60 0 1 testex.osg 702.55
702.43
GLUa2 multithreadTest_v1.6 -t 60 0 1 2 testex.osg 602.16
596.58
GLUa3 multithreadTest_v1.6 -t 60 0 1 2 3 testex.osg 376.69
391.55
GLOa0 multithreadTest_v1.6 -t 60 0 testex.opt.osg
708.75 708.62
GLOa1 multithreadTest_v1.6 -t 60 0 1 testex.opt.osg
702.96 703.22
GLOa2 multithreadTest_v1.6 -t 60 0 1 2 testex.opt.osg
699.22 699.12
GLOa3 multithreadTest_v1.6 -t 60 0 1 2 3 testex.opt.osg
696.27 696.23
OSGb0 osgMultiCardOpt_v1.4 -e 60 -o 60 0 testex_2x2x2.ive 59.98
76.44 59.98 76.70
OSGb1 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 testex_2x2x2.ive 53.30
77.41 52.79 77.78
OSGb2 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2 testex_2x2x2.ive 45.00
76.77 44.02 77.29
OSGb3 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2 3 testex_2x2x2.ive 42.10
53.26 42.08 70.89
GLUb0 multithreadTest_v1.6 -t 60 0 testex_2x2x2.osg 107.89
107.88
GLUb1 multithreadTest_v1.6 -t 60 0 1 testex_2x2x2.osg 107.75
107.74
GLUb2 multithreadTest_v1.6 -t 60 0 1 2 testex_2x2x2.osg 85.80
87.65
GLUb3 multithreadTest_v1.6 -t 60 0 1 2 3 testex_2x2x2.osg 44.54
45.00
GLOb0 multithreadTest_v1.6 -t 60 0 testex_2x2x2.opt.osg
107.71 107.63
GLOb1 multithreadTest_v1.6 -t 60 0 1 testex_2x2x2.opt.osg
107.52 107.51
GLOb2 multithreadTest_v1.6 -t 60 0 1 2 testex_2x2x2.opt.osg
107.46 107.48
GLOb3 multithreadTest_v1.6 -t 60 0 1 2 3 testex_2x2x2.opt.osg
107.42 107.42
OSGc0 osgMultiCardOpt_v1.4 -e 60 -o 60 0 testex_2x2x2_2x2x2.ive 7.14
10.87 6.51 10.82
OSGc1 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 testex_2x2x2_2x2x2.ive 6.32
9.53 5.99 9.56
OSGc2 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2 testex_2x2x2_2x2x2.ive 5.32
8.78 5.69 8.93
OSGc3 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2 3 testex_2x2x2_2x2x2.ive 5.64
7.31 5.70 8.05
GLUc0 multithreadTest_v1.6 -t 60 0 testex_2x2x2_2x2x2.osg 7.99
7.93
GLUc1 multithreadTest_v1.6 -t 60 0 1 testex_2x2x2_2x2x2.osg 1.04
.96
GLUc2 multithreadTest_v1.6 -t 60 0 1 2 testex_2x2x2_2x2x2.osg .59
.99
GLUc3 multithreadTest_v1.6 -t 60 0 1 2 3 testex_2x2x2_2x2x2.osg .94
.48
GLOc0 multithreadTest_v1.6 -t 60 0 testex_2x2x2_2x2x2.opt.osg
14.73 14.69
GLOc1 multithreadTest_v1.6 -t 60 0 1 testex_2x2x2_2x2x2.opt.osg
14.74 14.69
GLOc2 multithreadTest_v1.6 -t 60 0 1 2 testex_2x2x2_2x2x2.opt.osg
14.74 14.69
GLOc3 multithreadTest_v1.6 -t 60 0 1 2 3 testex_2x2x2_2x2x2.opt.osg
14.74 14.69
_______________________________________________
osg-users mailing list
[email protected]
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org