Joseph Benden
Thu, 01 Jan 2009 18:47:49 -0800
GPU Audio Codec Transcoding within Asterisk PBX ===============================================
Abstract -------- This article describes the failed attempt at using GPU technology to optimize transcoding within the Asterisk PBX system. GPU technologies, such as those offered by nVidia, push algorithmic processing from the CPU to specialized processors called GPU's. An algorithm is compiled and transfered to a GPU where the GPU performs floating-point and/or integer math very quickly by utilizing parallel threads of execution. The goal of this project was to utilize the high-end GPU offering from nVidia, the Tesla C1060, which is a PCIe x16 card offering a peak processing capability of 933 MFLOPS to perform a large number of transcoding operations. The Tesla C1060 is supported on Linux and Windows operating systems. A Project Failure ----------------- The project started with transcoding g.711u to signed linear and the reverse. It was thought that performing other transcoding operations would be reasonably represented by this. The nVidia GPU's utilize a C API called CUDA which gives any program access to the power of the GPU. Some important aspects of CUDA are: - CUDA requires the same thread, in a multi-threaded application, operate all aspects of the GPU. Multiple threads may create many different contexts with CUDA; however, performance will decrease as they contend with each other. - CUDA GPU threads are secondary to graphics on nVidia graphics cards, meaning if a nVidia graphics card capable of CUDA were used all graphics handling trumps any CUDA operations. - CUDA recommends GPU threads minimally execute in groups of 32 to 64 with the optimal number being 256. - Memory blocks of at least 16 by 16 by 4 or 4,096 bytes. Blocks of memory are referred to as cells/blocks (?), which have four vectors each. - Memory should be properly aligned and page locked. In attempting this project, the following information was learned: - Memory copy overhead and latency. CUDA recommends that memory buffers coming from the application be transfered to the GPU in a staged approach to maximize the parallel activity of the GPU. While this is a perfectly reasonable recommendation, it introduces additional latency in our real-time processing environment. It is important to remember that these latencies are a trade-off: if the algorithm has a marked improvement in execution time which offsets the introduced memory latency, then it is completely reasonable to go this route. This would be perfectly acceptable for video transcoding, because of the larger amount of data to process with possibility of more complex algorithms. This is not the case for audio transcoding. Architectural Aspects --------------------- In order to maximize the number of simultaneous transcoding operations, Asterisk PBX would require a separate thread of execution to handle all CUDA operations. All transcode requests must be queued from the channel threads onto a circular queue (with an implementation specifically chosen to minimize thread contention, eg: wait-free or lock-free.) The CUDA thread would then be able to coalesce multiple waiting transcodes into a single processing request to the GPU. It would be most wise to implement dynamic transcoding back-engines, such that when the transcoding thread in Asterisk PBX is ready to coalesce, it takes the current count of transcode operations and uses this to properly select which engine to use. In testing with audio streams, it was found that the diagram below is true. Single-threaded typical transcode algorithm <= SIMD transcode algorithm <= CUDA transcode algorithm For any given transcode, there is a point at which one of the above implementations is best suited. By created a standalone tool, these values can be measured for each hardware environment and properly configured for any individual machine. Finally, because of the architecture, this would allow for the structure that is placed onto the circular queue to contain the source codec and destination codec. If two calls are bridged and Asterisk PBX does not need information from the stream, it would be possible to directly transcode from call A's codec to call B's codec. This would also be more efficient because the request pushed to CUDA is a single request for two separate CUDA "kernels" (function in C terminology) to execute, with only a single memory transfer overhead. Contrast this with having to transcode call A's codec to signed linear, then signed linear to call C's codec - with multiple memory buffer transfers and multiple times dealing with the circular queue. Other Thoughts -------------- While GPU transcoding is not optimal at this time for audio transcoding, using SIMD instructions to optimize hot-spots within codecs is a completely worthwhile investment. SIMD instructions do give instant benefits in areas of code that GPU processing was thought to give a marked improvement, because SIMD has immediate and quick access to the memory regions holding the buffers to actually transcode. GPU usage will become important if Asterisk PBX deals with video streams. It is at this point when the topic should again be brought in to the light. References ---------- nVidia Tesla C1060 http://www.nvidia.com/object/product_tesla_c1060_us.html CUDA http://www.nvidia.com/cuda SIMD http://en.wikipedia.org/wiki/SIMD About the Author ---------------- Joseph Benden is the owner of Thralling Penguin LLC. Thralling Penguin designs, develops, and extends software technologies for the most demanding business applications, as well as offering VoIP Consulting services. _______________________________________________ --Bandwidth and Colocation Provided by http://www.api-digital.com-- asterisk-dev mailing list To UNSUBSCRIBE or update options visit: http://lists.digium.com/mailman/listinfo/asterisk-dev