Hi, all,

This is a general request to the community for some advice and expertise. This is a bit lengthy, but if you can spare a few minutes to look over this and send along your thoughts, we would really appreciate it.

I've been working on a project for NIST recently. For background, you might find this osg-users thread from December 2010 useful:

http://thread.gmane.org/gmane.comp.graphics.openscenegraph.user/63954/focus=64014


Briefly, an OSG-based test application loads a scene and displays it in a window on one or more screens (Single Viewer, multiple slave cameras, one GPU and context per screen). The problem was that a single screen would draw the scene at a given frame rate, but as additional screens were added, the frame rate would drop significantly (427 fps on one screen, 396 fps on two, 291 on four). Given that the scene is identical and static, four contexts on four GPUs should be able to draw at or at least very nearly the same rate as one screen on one GPU.

Based on initial tests (including running a non-OSG, OpenGL-based program), we first suspected that the problem had to do with thread contention inside of OSG itself. To get to the bottom of this, I added a per-thread logging mechanism to OpenThreads::PThread, which allowed each thread to output to a unique file. Using this, I added some timing code at various places in the rendering process. I started with timing the various Operations at the high level, then I started digging into the various function calls, placing timing code at the important points along the way. I eventually drilled all the way down to PrimitiveSet::draw(), and at that level, it became obvious that the code that was taking all of the time (and not scaling well to multiple screens) was the OpenGL draw call itself. For example, on one screen a given PrimitiveSet would take 0.05 ms, and on two screens, it would take 0.11 ms on the first screen and 0.12 ms on the second (roughly twice the amount of time).

At this point it was beginning to look like it wasn't OSG's fault. To be sure, I decided to write a pure multithreaded OpenGL program from scratch. I tried to keep the rendering structure the same as OSG (without the scene graph structure, or the update and cull traversals, of course). I wrote enough of a .osg file loaded so that I could load the same data with the same structure, and produce the same OpenGL command stream when drawing a frame as OSG does (verified with gDEBugger). Once this was complete, we saw a similar lack of scalability as additional screens were added (708 fps on one screen, 702 fps on two, 376 fps on four).

At this point, I started looking for something else to blame. Examining the data set itself, I discovered that it was composed of about 5500 triangle strips, none of which were longer than 112 vertices (the data set had about 600,000 vertices total). There were only about 10 different StateSets in the scene, so state changes aren't a problem. After some digging, I found the MeshOptimizers portion of osgUtil::Optimizer, and based on a message I found from Jean-Sebastian, I tried a pass of VERTEX_PRETRANSFORM | INDEX_MESH | VERTEX_POSTTRANSFORM, followed by another pass of MERGE_GEODES | MERGE_GEOMETRY. This reduced the number of draw calls from around 5500 to 9, and completely eliminated the scalability problem for both the OSG test program, and the pure OpenGL program. This leads me to believe that bus contention was causing the lack of scalability. As more screens were added, the thousands of draw calls required by the unoptimized data set couldn't fit within the bus bandwidth, effectively causing the draw calls to take longer. The unoptimized data, only requiring 9 draw calls per screen, could easily fit.

To demonstrate this effectively, I created a variation of the original OSG test program that would run render the original test data for a given time period, then run a configurable set of optimization passes, and then render the optimized data for a given time period. It then reported the pre-optimization and post-optimization frame rates. We also tried amplifying the data by essentially instancing it 8 times and 64 times (not using instanced drawing, just drawing multiple copies of the same data). We ran the new OSG test app, applying the same optimizations as mentioned above, as well as running the same data (both unoptimized and optimized) through the pure OpenGL program. I've attached the timings for two complete runs.

The 1x and 8x data sets appear to scale well in all cases, except for one anomalous case where the fps drops from 76.77 fps on three screens to 53.26 on four (the subsequent run only drops to 70.89, so this may not be a real problem). The 64x data sets are more interesting. The OpenGL program clearly scales well (almost perfectly, in fact) on the optimized data. The OSG program doesn't scale as well, dropping to 8 fps on four screens vs. 11 fps on one.

So, here are our questions. Does it make sense that bus contention would be causing the lack of scalability? Are the mesh optimizations mentioned above the most effective way to solve the problem? Are there any cases where the mesh optimizations wouldn't be sufficient, and additional steps would need to be taken (I briefly mentioned state changes above, which could be problematic, anything else)? Why doesn't the 64x data set seem to scale as well as the 1x and 8x data sets (does this indicate that the bottleneck has moved from the bus to somewhere else)?

Any thoughts on these issues or other thoughts you could provide would be very valuable.

Thanks, all!

--"J"

Jason Daly
University of Central Florida
Institute for Simulation and Training

====================================================================================================================

January 23, 2012
Running on Tylium: QuadroPlex D2 (four FX5800)


export __GL_SYNC_TO_VBLANK=0
export OSG_SERIALIZE_DRAW_DISPATCH=OFF
export OSG_THREADING=CullThreadPerCameraDrawThreadPerContext

use ./run_1.6 ID Command

>From log files use collective times
                                                                           
1/23/2012                 1/24/2012
                                                                        FPS     
  FPS             FPS       FPS
ID   Command line:                                                      pre-opt 
  post-opt        pre-opt   post-opt
--   --------------------------------------------------                 ------- 
  --------        -------   --------

OSGa0 osgMultiCardOpt_v1.4 -e 60 -o 60 0       testex.ive               427.64  
    522.63        491.82    523.42
OSGa1 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1     testex.ive               396.95  
    527.63        362.84    528.38
OSGa2 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2   testex.ive               349.54  
    527.51        358.56    523.79
OSGa3 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2 3 testex.ive               291.72  
    520.12        306.78    518.70

GLUa0 multithreadTest_v1.6 -t 60 0       testex.osg                     708.58  
                  707.98
GLUa1 multithreadTest_v1.6 -t 60 0 1     testex.osg                     702.55  
                  702.43
GLUa2 multithreadTest_v1.6 -t 60 0 1 2   testex.osg                     602.16  
                  596.58
GLUa3 multithreadTest_v1.6 -t 60 0 1 2 3 testex.osg                     376.69  
                  391.55

GLOa0 multithreadTest_v1.6 -t 60 0       testex.opt.osg                         
    708.75                  708.62
GLOa1 multithreadTest_v1.6 -t 60 0 1     testex.opt.osg                         
    702.96                  703.22
GLOa2 multithreadTest_v1.6 -t 60 0 1 2   testex.opt.osg                         
    699.22                  699.12
GLOa3 multithreadTest_v1.6 -t 60 0 1 2 3 testex.opt.osg                         
    696.27                  696.23



OSGb0 osgMultiCardOpt_v1.4 -e 60 -o 60 0       testex_2x2x2.ive          59.98  
     76.44         59.98     76.70
OSGb1 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1     testex_2x2x2.ive          53.30  
     77.41         52.79     77.78
OSGb2 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2   testex_2x2x2.ive          45.00  
     76.77         44.02     77.29
OSGb3 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2 3 testex_2x2x2.ive          42.10  
     53.26         42.08     70.89

GLUb0 multithreadTest_v1.6 -t 60 0       testex_2x2x2.osg               107.89  
                  107.88
GLUb1 multithreadTest_v1.6 -t 60 0 1     testex_2x2x2.osg               107.75  
                  107.74
GLUb2 multithreadTest_v1.6 -t 60 0 1 2   testex_2x2x2.osg                85.80  
                   87.65
GLUb3 multithreadTest_v1.6 -t 60 0 1 2 3 testex_2x2x2.osg                44.54  
                   45.00

GLOb0 multithreadTest_v1.6 -t 60 0       testex_2x2x2.opt.osg                   
    107.71                  107.63
GLOb1 multithreadTest_v1.6 -t 60 0 1     testex_2x2x2.opt.osg                   
    107.52                  107.51
GLOb2 multithreadTest_v1.6 -t 60 0 1 2   testex_2x2x2.opt.osg                   
    107.46                  107.48
GLOb3 multithreadTest_v1.6 -t 60 0 1 2 3 testex_2x2x2.opt.osg                   
    107.42                  107.42



OSGc0 osgMultiCardOpt_v1.4 -e 60 -o 60 0       testex_2x2x2_2x2x2.ive      7.14 
     10.87          6.51     10.82
OSGc1 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1     testex_2x2x2_2x2x2.ive      6.32 
      9.53          5.99      9.56
OSGc2 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2   testex_2x2x2_2x2x2.ive      5.32 
      8.78          5.69      8.93
OSGc3 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2 3 testex_2x2x2_2x2x2.ive      5.64 
      7.31          5.70      8.05

GLUc0 multithreadTest_v1.6 -t 60 0       testex_2x2x2_2x2x2.osg            7.99 
                    7.93
GLUc1 multithreadTest_v1.6 -t 60 0 1     testex_2x2x2_2x2x2.osg            1.04 
                     .96
GLUc2 multithreadTest_v1.6 -t 60 0 1 2   testex_2x2x2_2x2x2.osg             .59 
                     .99
GLUc3 multithreadTest_v1.6 -t 60 0 1 2 3 testex_2x2x2_2x2x2.osg             .94 
                     .48

GLOc0 multithreadTest_v1.6 -t 60 0       testex_2x2x2_2x2x2.opt.osg             
    14.73                    14.69
GLOc1 multithreadTest_v1.6 -t 60 0 1     testex_2x2x2_2x2x2.opt.osg             
    14.74                    14.69
GLOc2 multithreadTest_v1.6 -t 60 0 1 2   testex_2x2x2_2x2x2.opt.osg             
    14.74                    14.69
GLOc3 multithreadTest_v1.6 -t 60 0 1 2 3 testex_2x2x2_2x2x2.opt.osg             
    14.74                    14.69

_______________________________________________
osg-users mailing list
[email protected]
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org

Reply via email to