[osg-users] Request for input

Jason Daly Wed, 25 Jan 2012 13:59:51 -0800


Hi, all,

This is a general request to the community for some advice andexpertise. This is a bit lengthy, but if you can spare a few minutes tolook over this and send along your thoughts, we would really appreciate it.

I've been working on a project for NIST recently. For background, youmight find this osg-users thread from December 2010 useful:


http://thread.gmane.org/gmane.comp.graphics.openscenegraph.user/63954/focus=64014

Briefly, an OSG-based test application loads a scene and displays it ina window on one or more screens (Single Viewer, multiple slave cameras,one GPU and context per screen). The problem was that a single screenwould draw the scene at a given frame rate, but as additional screenswere added, the frame rate would drop significantly (427 fps on onescreen, 396 fps on two, 291 on four). Given that the scene is identicaland static, four contexts on four GPUs should be able to draw at or atleast very nearly the same rate as one screen on one GPU.

Based on initial tests (including running a non-OSG, OpenGL-basedprogram), we first suspected that the problem had to do with threadcontention inside of OSG itself. To get to the bottom of this, I addeda per-thread logging mechanism to OpenThreads::PThread, which allowedeach thread to output to a unique file. Using this, I added some timingcode at various places in the rendering process. I started with timingthe various Operations at the high level, then I started digging intothe various function calls, placing timing code at the important pointsalong the way. I eventually drilled all the way down toPrimitiveSet::draw(), and at that level, it became obvious that the codethat was taking all of the time (and not scaling well to multiplescreens) was the OpenGL draw call itself. For example, on one screen agiven PrimitiveSet would take 0.05 ms, and on two screens, it would take0.11 ms on the first screen and 0.12 ms on the second (roughly twice theamount of time).

At this point it was beginning to look like it wasn't OSG's fault. Tobe sure, I decided to write a pure multithreaded OpenGL program fromscratch. I tried to keep the rendering structure the same as OSG(without the scene graph structure, or the update and cull traversals,of course). I wrote enough of a .osg file loaded so that I could loadthe same data with the same structure, and produce the same OpenGLcommand stream when drawing a frame as OSG does (verified withgDEBugger). Once this was complete, we saw a similar lack ofscalability as additional screens were added (708 fps on one screen, 702fps on two, 376 fps on four).

At this point, I started looking for something else to blame. Examiningthe data set itself, I discovered that it was composed of about 5500triangle strips, none of which were longer than 112 vertices (the dataset had about 600,000 vertices total). There were only about 10different StateSets in the scene, so state changes aren't a problem.After some digging, I found the MeshOptimizers portion ofosgUtil::Optimizer, and based on a message I found from Jean-Sebastian,I tried a pass of VERTEX_PRETRANSFORM | INDEX_MESH |VERTEX_POSTTRANSFORM, followed by another pass of MERGE_GEODES |MERGE_GEOMETRY. This reduced the number of draw calls from around 5500to 9, and completely eliminated the scalability problem for both the OSGtest program, and the pure OpenGL program. This leads me to believethat bus contention was causing the lack of scalability. As morescreens were added, the thousands of draw calls required by theunoptimized data set couldn't fit within the bus bandwidth, effectivelycausing the draw calls to take longer. The unoptimized data, onlyrequiring 9 draw calls per screen, could easily fit.

To demonstrate this effectively, I created a variation of the originalOSG test program that would run render the original test data for agiven time period, then run a configurable set of optimization passes,and then render the optimized data for a given time period. It thenreported the pre-optimization and post-optimization frame rates. Wealso tried amplifying the data by essentially instancing it 8 times and64 times (not using instanced drawing, just drawing multiple copies ofthe same data). We ran the new OSG test app, applying the sameoptimizations as mentioned above, as well as running the same data (bothunoptimized and optimized) through the pure OpenGL program. I'veattached the timings for two complete runs.

The 1x and 8x data sets appear to scale well in all cases, except forone anomalous case where the fps drops from 76.77 fps on three screensto 53.26 on four (the subsequent run only drops to 70.89, so this maynot be a real problem). The 64x data sets are more interesting. TheOpenGL program clearly scales well (almost perfectly, in fact) on theoptimized data. The OSG program doesn't scale as well, dropping to 8fps on four screens vs. 11 fps on one.

So, here are our questions. Does it make sense that bus contentionwould be causing the lack of scalability? Are the mesh optimizationsmentioned above the most effective way to solve the problem? Are thereany cases where the mesh optimizations wouldn't be sufficient, andadditional steps would need to be taken (I briefly mentioned statechanges above, which could be problematic, anything else)? Why doesn'tthe 64x data set seem to scale as well as the 1x and 8x data sets (doesthis indicate that the bottleneck has moved from the bus to somewhere else)?

Any thoughts on these issues or other thoughts you could provide wouldbe very valuable.


Thanks, all!

--"J"

Jason Daly
University of Central Florida
Institute for Simulation and Training

====================================================================================================================

January 23, 2012
Running on Tylium: QuadroPlex D2 (four FX5800)


export __GL_SYNC_TO_VBLANK=0
export OSG_SERIALIZE_DRAW_DISPATCH=OFF
export OSG_THREADING=CullThreadPerCameraDrawThreadPerContext

use ./run_1.6 ID Command

>From log files use collective times
                                                                           
1/23/2012                 1/24/2012
                                                                        FPS     
  FPS             FPS       FPS
ID   Command line:                                                      pre-opt 
  post-opt        pre-opt   post-opt
--   --------------------------------------------------                 ------- 
  --------        -------   --------

OSGa0 osgMultiCardOpt_v1.4 -e 60 -o 60 0       testex.ive               427.64  
    522.63        491.82    523.42
OSGa1 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1     testex.ive               396.95  
    527.63        362.84    528.38
OSGa2 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2   testex.ive               349.54  
    527.51        358.56    523.79
OSGa3 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2 3 testex.ive               291.72  
    520.12        306.78    518.70

GLUa0 multithreadTest_v1.6 -t 60 0       testex.osg                     708.58  
                  707.98
GLUa1 multithreadTest_v1.6 -t 60 0 1     testex.osg                     702.55  
                  702.43
GLUa2 multithreadTest_v1.6 -t 60 0 1 2   testex.osg                     602.16  
                  596.58
GLUa3 multithreadTest_v1.6 -t 60 0 1 2 3 testex.osg                     376.69  
                  391.55

GLOa0 multithreadTest_v1.6 -t 60 0       testex.opt.osg                         
    708.75                  708.62
GLOa1 multithreadTest_v1.6 -t 60 0 1     testex.opt.osg                         
    702.96                  703.22
GLOa2 multithreadTest_v1.6 -t 60 0 1 2   testex.opt.osg                         
    699.22                  699.12
GLOa3 multithreadTest_v1.6 -t 60 0 1 2 3 testex.opt.osg                         
    696.27                  696.23



OSGb0 osgMultiCardOpt_v1.4 -e 60 -o 60 0       testex_2x2x2.ive          59.98  
     76.44         59.98     76.70
OSGb1 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1     testex_2x2x2.ive          53.30  
     77.41         52.79     77.78
OSGb2 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2   testex_2x2x2.ive          45.00  
     76.77         44.02     77.29
OSGb3 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2 3 testex_2x2x2.ive          42.10  
     53.26         42.08     70.89

GLUb0 multithreadTest_v1.6 -t 60 0       testex_2x2x2.osg               107.89  
                  107.88
GLUb1 multithreadTest_v1.6 -t 60 0 1     testex_2x2x2.osg               107.75  
                  107.74
GLUb2 multithreadTest_v1.6 -t 60 0 1 2   testex_2x2x2.osg                85.80  
                   87.65
GLUb3 multithreadTest_v1.6 -t 60 0 1 2 3 testex_2x2x2.osg                44.54  
                   45.00

GLOb0 multithreadTest_v1.6 -t 60 0       testex_2x2x2.opt.osg                   
    107.71                  107.63
GLOb1 multithreadTest_v1.6 -t 60 0 1     testex_2x2x2.opt.osg                   
    107.52                  107.51
GLOb2 multithreadTest_v1.6 -t 60 0 1 2   testex_2x2x2.opt.osg                   
    107.46                  107.48
GLOb3 multithreadTest_v1.6 -t 60 0 1 2 3 testex_2x2x2.opt.osg                   
    107.42                  107.42



OSGc0 osgMultiCardOpt_v1.4 -e 60 -o 60 0       testex_2x2x2_2x2x2.ive      7.14 
     10.87          6.51     10.82
OSGc1 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1     testex_2x2x2_2x2x2.ive      6.32 
      9.53          5.99      9.56
OSGc2 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2   testex_2x2x2_2x2x2.ive      5.32 
      8.78          5.69      8.93
OSGc3 osgMultiCardOpt_v1.4 -e 60 -o 60 0 1 2 3 testex_2x2x2_2x2x2.ive      5.64 
      7.31          5.70      8.05

GLUc0 multithreadTest_v1.6 -t 60 0       testex_2x2x2_2x2x2.osg            7.99 
                    7.93
GLUc1 multithreadTest_v1.6 -t 60 0 1     testex_2x2x2_2x2x2.osg            1.04 
                     .96
GLUc2 multithreadTest_v1.6 -t 60 0 1 2   testex_2x2x2_2x2x2.osg             .59 
                     .99
GLUc3 multithreadTest_v1.6 -t 60 0 1 2 3 testex_2x2x2_2x2x2.osg             .94 
                     .48

GLOc0 multithreadTest_v1.6 -t 60 0       testex_2x2x2_2x2x2.opt.osg             
    14.73                    14.69
GLOc1 multithreadTest_v1.6 -t 60 0 1     testex_2x2x2_2x2x2.opt.osg             
    14.74                    14.69
GLOc2 multithreadTest_v1.6 -t 60 0 1 2   testex_2x2x2_2x2x2.opt.osg             
    14.74                    14.69
GLOc3 multithreadTest_v1.6 -t 60 0 1 2 3 testex_2x2x2_2x2x2.opt.osg             
    14.74                    14.69

_______________________________________________
osg-users mailing list
[email protected]
http://lists.openscenegraph.org/listinfo.cgi/osg-users-openscenegraph.org

[osg-users] Request for input

Reply via email to