GPU is blazingly fast only and only for algorithms executing the same instruction simultaneously upon massive array of numbers - a.k.a SIMD (Single Instruction Multiple Data). Multi-core Intel/AMD CPU is MIMD machine (Multiple Instructions Multiple Data) therefore, the algorithms with inherently diverged code path run more efficiently on MIMD and SIMD friendly algorithms are better fit for today's GPU. CUDA interface pretends to be a MIMD but it is VERY missleading, once threads of warp are out of sync the performance going down to the speed of single thread - a dramatic slowdown. Ray-tracing is one of examples where MIMD architecture is required to run it efficiently. The best ray-tracing multi-core CPU implementations outperforms dramatically the best GPU ray-tracing for big number of triangles and the bigger number the bigger gap. GPU fans used to compare single core-2 with GeForce 285 ray tracing and as a rule the CPU code is just a port of GPU shaders accomplished by authors of GPU ray-tracing implementations however the optimal for CPU ray-tracing is not optimal for GPU and vs. verse. Ray-Tracing experts know this GPU/SIMD limitations too well so NVIDIA worry for its tomorrow 3D positions where ray-tracing is going to rule. Larabee is going to be kind of MIMD/SIMD hybrid closer to clean MIMD and G300 is kind of hybrid but closer to SIMD. Actually even today G200 is not a pure SIMD it has undependable several SIMD units but still it is really good only for SIMD friendly algorithms.
Stefan On Jul 31, 6:20 pm, Peter Teoh <htmldevelo...@gmail.com> wrote: > http://www.ddj.com/hpc-high-performance-computing/218100902CUDA, > Supercomputing for the Masses: Part 13(Page1of4) > > Rob Farber > > Using texture memory in CUDA > > Rob Farber is a senior scientist at Pacific Northwest National Laboratory. He > has worked in massively parallel computing at several national laboratories > and as co-founder of several startups. He can be reached atrmfar...@gmail.com. > > > > InCUDA, Supercomputing for the Masses: Part 12of this article series on CUDA, > I took a quick detour to discuss some of the paradigm changing features of > the latest CUDA Toolkit 2.2 release. This article resumes the discussion of > "texture memory" which I began inPart 11of this series. In addition, this > installment includes information on the new CUDA Toolkit 2.2 texture > capability that allows some programs to eliminate extra copies by providing > the ability to write to global memory on the GPU that has a 2D texture bound > to it. > > > > From a C-programmer's perspective, texture memory provides an unusual > combination of cache memory (separate from register, global, and shared > memory), local processing capability (separate from the scalar processors), > and a way to interact with the display capabilities of the GPU. This article > focuses on the cache and local processor capabilities of texture memory while > the next column will discuss how to perform viewable graphic operations with > the GPU. > > > > Don't be put off from using texture memory because it is different and has > many options. The use of texture memory can improve performance for both > bandwidth and latency limited programs. For example, some programs can exceed > the maximum theoretical memory bandwidth of the underlying global memory > through judicious use of the texture memory cache. While the latency of > texture cache reference is generally the same as DRAM, there are some special > cases that can deliver data with slightly less than 100 cycles of latency. As > usual in CUDA, the use of many threads can hide memory access latency > regardless if texture cache or global memory is being accessed. > > > > For CUDA programmers, the most salient points about using texture memory as a > cache are: it is optimized for 2D spatial locality, very small (effectively > about 8KB per multiprocessor), and can provide a performance benefit by > having all the threads in a warp access nearby locations in the texture (as > demonstrated inCache-Efficient Numerical Algorithms using Graphics Hardware). > Another tip from the forums is to pack data up if you can because a > singlefloat4texture read is faster than four separatefloattexture reads. > > > > One ingenious mapping of a random-access data structure to texture memory has > been implemented by the CUDA-EC software. In the CUDA code, NVIDIA implements > aBloom filterto test for set membership. The CUDA-EC software is available > for free download athttp://cuda-ec.sourceforge.net/. > > > > TheCUDA Toolkit 2.2introduced the ability to write to 2D textures bound to > pitch linear memory on the GPU that has a texture bound to it. In other > words, the data within the texture can be updated within a kernel running on > the GPU. This is a very nice feature because it allows many codes to better > utilize the caching behavior of texture memory while also eliminating copies. > One common example that immediately springs to mind are calculations that > require two passes through the data: one to calculate a value (such as a mean > or maximum) and a second pass to update the data in place. Such calculations > are common when changing the data range or calculating probabilities. The use > of an updatable texture can potentially speed these types of calculations. > > > > The cuBLAS library uses texture memory for many of the single-pass > calculations (sasum,sdot, and etc). However, comments in the source code > indicate that texture memory should not be used for vectors that are short or > those that are aligned and have unit stride and thus have nicely coalesced > behavior. (The source forcuBLAS libraryandcuFFTare available for those who > have signed up as NVIDIA developers.) > > > > Texture cache is part of each TPC, here short for "Thread Processing Cluster" > since I am discussing operations incompute mode. (TPC stands for "Texture > Processing Cluster" ingraphics mode, which I don't address in this article.) > Each TPC contains multiple streaming multiprocessors and a single texture > cache. It is important to note that in the GTX 200 series, the texture cache > supports three SM (Streaming Multiprocessors) per TPC while the G80/G92 > architecture only supports two. > > > > Figure 1 depicts a high-level view of the GeForce GTX 280 GPU in parallel > computing mode: A hardware-based thread scheduler at the top manages > scheduling threads across the TPCs, which includes the texture caches and > memory interface units. The elements indicated as "atomic" refer to the > ability to perform atomic read-modify-write operations to memory. For more > information, please seeGeForce GTX 200 GPU Technical Brief. > > Figure 1:High-Level view of GTX 280 Architecture (Courtesy NVIDIA). > > > > > > Figure 2 represents a lower-level view of a single TPC. Note that TF stands > for "Texture Filtering" and IU is the abbreviation for "Instruction Unit". > > Figure 2:Lower-level view of a single GTX 280 TPC (Courtesy NVIDIA). > > > > > > Textures are bound to global memory and can provide both cache and some > processing capabilities. How the global memory was created dictates some of > the capabilities the texture can provide. For this reason, it is important to > distinguish between three memory types that can be bound to a texture: > > Table 1:Distinguishing between memory types. > > 1 Introduction |2 Linear Memory|3 An Example|4 ConclusionNext PageRELATED > ARTICLES > > OmniTI Reconnoiter: Web Management and Analysis > > Breach Security Launches Open-Source Project > > Zend Releases New Versions of Framework, Studio Tools > > Google Releases 'Simple' Programming Language > > Dr. Dobb's Agile Update 07/09TOP 5 ARTICLES > > No Top Articles. > > 090623cuda13_f1.gif > 109KViewDownload > > 090623cuda13_f2.gif > 37KViewDownload > > 090623cuda13_t1.gif > 13KViewDownload > > redarrow.gif > < 1KViewDownload > > blank.gif > < 1KViewDownload