On Mon, Nov 08, 2004 at 11:32:24AM -0800, Ian Romanick wrote:
| This is something I've been thinking about ever since I saw the 
| profiling tools in Nvidia's drivers at SIGGRAPH.  There's a LOT of 
| information that would be useful to get out of the driver about performance

Have you taken a look at the SGIX_instruments extension?  It provides a
framework that's intended for gathering profiling information
asynchronously.  The idea was that you'd add separate extensions that
defined the actual instrumentation (SGIX_ir_instrument1 was an early
example).

I searched my archives for things I'd written on this subject in the
past.  The following is probably the most comprehensive summary.  Some
of it's out-of-date now, or has implications for hardware design that's
out of our control, but some of it still looks useful.

Allen



Purposes of Instrumentation

        Tuning
                Analyzing the app or database to improve overall
                performance and/or rendering quality.  Typically done
                during the development phase.  Examples:  determining
                what percentage of triangles are clipped, or how well
                texture memory is utilized.

        Load Monitoring
                Gathering information to modify the behavior of the app
                or the structure of the database dynamically, to
                maintain a constant frame-rate.  Typically done in
                real-time by production apps.  Examples:  determining
                how much time is spent in geometric processing and how
                much time in pixel-fill, in order to choose object
                level-of-detail.

        Debugging/Testing
                Graphics systems are extremely complex, and their behavior
                isn't always predictable.  We can anticipate a need for
                machine-specific instrumentation in order to understand
                surprisingly high or low performance of an application,
                or for use during driver development.

Infrastructure

        The SGIX_instruments extension provides scaffolding for
        pipeline instrumentation.  The framework allows the app to:

                Specify a buffer into which measurements will be
                delivered (asynchronously) by the pipe.

                Enable/disable an arbitrary collection of instruments.

                Start/stop/snapshot measurements by the
                currently-enabled set of instruments.

                Label a measurement with a user-selectable marker.

                Poll or wait for completion of a particular measurement.

        We must write one or more new extensions to define instruments
        that fit into the SGIX_instruments framework.  This outline
        sketches some of the instruments that might be appropriate.

        Since some measurements are performed by real-time apps, it's
        important to keep the overhead low.  The asynchronous delivery
        scheme helps with this, but it's also desirable to keep other
        issues in mind (for example, avoid flushing the pipe if at all
        possible).

Suggested Instruments

        Rendering Statistics

                Number of bytes of data sent to pipe
                Number of bytes of data sent from pipe
                        These are used to identify data transfer
                        bottlenecks arising from geometry-path
                        commands, pixel-path commands, and texture
                        management.

                Number of geometric primitives sent to pipe
                Number of geometric primitives trivially accepted or rejected
                Number of geometric primitives subjected to 3D clipping
                Number of geometric primitives resulting from 3D clipping
                Number of geometric primitives face-culled
                Number of matrix ops sent to pipe
                        These measure culling effectiveness and
                        determine the cause of geometry-processing
                        bottlenecks (e.g., too many vertices, too much
                        clipping, or too many attribute changes).

                Number of DrawPixels commands sent to pipe
                Number of Bitmap commands sent to pipe
                Number of ReadPixels commands sent to pipe
                Number of CopyPixels commands sent to the pipe
                        Together with the data transfer statistics,
                        these help determine whether pixel-oriented
                        apps are running into data transfer or pixel
                        operation setup bottlenecks.

                Number of MakeCurrent/MakeCurrentRead commands executed
                        This should help determine when apps are using
                        more than the optimal number of contexts, and
                        thus causing an inordinate number of context
                        switches.

                Number of fragments generated, for each rasterizer
                Number of fragments passing depth test, for each rasterizer
                        Together with other statistics, these help
                        estimate average triangle size, depth
                        complexity, and effectiveness of depth
                        sorting.

                Open Issues:
                        Is there a way to track the number of bytes
                        processed by CopyPixels-style operations?
                        These aren't accounted-for by the transfers
                        to and from the pipe.

        Texture Statistics

                Number of texture binds performed
                        Pinpoints an important attribute-change
                        bottleneck.

                Number of TexImage/TexSubImage commands
                Number of CopyTexImage/CopyTexSubImage commands
                Number of texture downloads initiated by texture manager
                Number of GetTexImage commands
                Number of texture uploads initiated by texture manager
                        Together with other stats, determines cost of
                        texture management operations.

                Texture memory utilization
                        Initial/Max/Min/Final fraction of texture memory
                        in use over the measurement interval.

                Open issues:
                        Number of texture fetches, per rasterizer?

        Timing Measurements

                Return these times for all commands appearing between
                two ``bracketing'' commands issued by the app:

                        Host CPU time (usecs)
                        Geometry (total for vector and scalar units)
                                processing time (usecs)
                        Rasterization (for each rasterizer) processing
                                time (usecs)
                        Wall clock time (usecs)

                Note that the above measurements should reflect the
                ``useful work'' performed by the associated pipe
                stages; they should be repeatable no matter what is in
                the pipe before the first bracketing command is issued
                and no matter what is placed in the pipe after the
                second bracketing command is issued.  (Thus, counting
                FIFO full/empty states isn't sufficient.)

Instruments NOT Recommended

        Number of FIFO high-water interrupts
                Not sure this is needed.  Provided we do a good job of
                accounting for time spent in each stage of the pipe,
                that accounting should be of more use than the raw
                number of interrupts, and interpreting it should
                involve less system-dependent code.
                
        Number of graphics context switches
                Superseded by recording the number of MakeCurrent
                commands (which should be more useful on a per-context
                basis than the global number of context switches per
                pipe).

        Number of geometric primitives scissored
                See note under Issues/Resolutions below.

        Number of bytes transferred due to DrawPixel/Bitmap commands
        Number of bytes transferred due to ReadPixel commands
        Number of bytes transferred due to CopyPixels commands
        Number of bytes of texture data transferred as a result
            of TexImage, CopyTexImage, GetTexImage, etc.
                These seem reasonable, but I suspect we'll get adequate
                bang-for-the-buck just by counting the number of bytes
                transferred to/from the pipe.  (Tracking bytes
                transferred for Copy* operations is an open issue.)

        Coarse Z-culling stats of some kind?
                My current guess is that if we can provide statistics
                on number of fragments generated and the number of
                fragments passing the depth test, it's unlikely we'll
                need more stats on coarse Z-culling.

Issues/Resolutions

        In principle, the application can handle some of the
        measurements described above (counting the number of times a
        given command is executed, for example).  Should we bother
        implementing instruments to capture such measurements?

                I believe we should.  Although it makes good design
                sense to avoid duplicating what's easily accomplished
                in the apps, there are two problems with requiring
                users to make measurements on their own:

                (1) Doing so could require wholesale changes to source
                code.  (Consider what would be needed to handle display
                lists correctly.)  It's unlikely many users would do
                this.

                (2) Users typically don't have access to the source code
                for high-level libraries that issue OpenGL commands, so
                requiring source code changes makes it impractical for
                them to measure the commands executed by those libraries.

        Why not use a library like GLS or a utility like ogldebug to
        trace OpenGL commands and make such measurements?

                Good arguments have been made for this, but I'm not
                completely convinced.

                In some cases, using GLS or ogldebug mitigates the
                problems mentioned above.  For example, it would be
                easier to maintain counts of the number of times a
                command is executed, since no access to source code is
                needed.  (Handling display lists correctly seems
                possible, though it would require a good bit of work,
                especially for shared dlists.)

                There are problems merging the results of counts from
                the tracing utilities with timing measurements made by
                other instruments.  The tracing utilities would need to
                interpret the instrumentation commands to know when to
                start and stop counting.  The counts wouldn't be
                available to the application under test, so it couldn't
                make on-the-fly decisions based on them.

                Also, in many cases I suspect it's more work to put
                this functionality into the tracing utilities than it
                is to fold the functionality into the instruments.
                Counting pixel and texture commands might be
                accomplished with just a few lines of microcode, for
                example.

        It's difficult to measure the number of scissored geometric
        primitives, because a primitive may be scissored in one
        rasterizer but not in others.  Determining which primitives
        have been scissored essentially requires tagging each primitive
        so that the status from all rasterizers can be combined
        meaningfully.

                Good point.  That statistic has been dropped from the
                current proposal.

        It would be worthwhile to consider instruments that would help
        debug performance problems, but would not necessarily be
        exposed for general use.  (A count of the number of cycles for
        which each type of memory request [texture, video, command
        fifo, etc.] stalls, for example.)

                Yes.  The proposal now mentions a ``Debug/Test''
                category of instruments.

        Beware of adding readable hardware counters, particularly when
        they affect multiple blocks of logic and software (consider
        testability, new special command packets that would be
        required, context switching, etc.).

                True.  Not all of these instruments will be practical.

        For multiple geometry engines, some measurements will need to be
        maintained on a per-GE basis.  The extension spec must reflect
        this (as it must reflect the existence of multiple rasterizers).


-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
--
_______________________________________________
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel

Reply via email to