Thank you for the comment. This is interesting. Since these chips are not so common, I don't think there exists any opensource OS running on these processor yet. So the personal computers from Nvidia must be running some proprietary OS.....am I right? Or are these running some form of modified Linux kernel, and perhaps better still, x86-compatible opcodes?
On 8/2/09, stefanba...@yahoo.com <stefanba...@yahoo.com> wrote: > > > GPU is blazingly fast only and only for algorithms executing the same > instruction simultaneously upon massive array of numbers - a.k.a SIMD > (Single Instruction Multiple Data). Multi-core Intel/AMD CPU is MIMD > machine (Multiple Instructions Multiple Data) therefore, the > algorithms with inherently diverged code path run more efficiently on > MIMD and SIMD friendly algorithms are better fit for today's GPU. CUDA > interface pretends to be a MIMD but it is VERY missleading, once > threads of warp are out of sync the performance going down to the > speed of single thread - a dramatic slowdown. Ray-tracing is one of > examples where MIMD architecture is required to run it efficiently. > The best ray-tracing multi-core CPU implementations outperforms > dramatically the best GPU ray-tracing for big number of triangles and > the bigger number the bigger gap. GPU fans used to compare single > core-2 with GeForce 285 ray tracing and as a rule the CPU code is just > a port of GPU shaders accomplished by authors of GPU ray-tracing > implementations however the optimal for CPU ray-tracing is not optimal > for GPU and vs. verse. Ray-Tracing experts know this GPU/SIMD > limitations too well so NVIDIA worry for its tomorrow 3D positions > where ray-tracing is going to rule. Larabee is going to be kind of > MIMD/SIMD hybrid closer to clean MIMD and G300 is kind of hybrid but > closer to SIMD. Actually even today G200 is not a pure SIMD it has > undependable several SIMD units but still it is really good only for > SIMD friendly algorithms. > > Stefan > > > On Jul 31, 6:20 pm, Peter Teoh <htmldevelo...@gmail.com> wrote: > > http://www.ddj.com/hpc-high-performance-computing/218100902CUDA, > > Supercomputing for the Masses: Part 13(Page1of4) > > > > Rob Farber > > > > Using texture memory in CUDA > > > > Rob Farber is a senior scientist at Pacific Northwest National Laboratory. > > He has worked in massively parallel computing at several national > > laboratories and as co-founder of several startups. He can be reached > > atrmfar...@gmail.com. > > > > > > > > InCUDA, Supercomputing for the Masses: Part 12of this article series on > > CUDA, I took a quick detour to discuss some of the paradigm changing > > features of the latest CUDA Toolkit 2.2 release. This article resumes the > > discussion of "texture memory" which I began inPart 11of this series. In > > addition, this installment includes information on the new CUDA Toolkit 2.2 > > texture capability that allows some programs to eliminate extra copies by > > providing the ability to write to global memory on the GPU that has a 2D > > texture bound to it. > > > > > > > > From a C-programmer's perspective, texture memory provides an unusual > > combination of cache memory (separate from register, global, and shared > > memory), local processing capability (separate from the scalar processors), > > and a way to interact with the display capabilities of the GPU. This > > article focuses on the cache and local processor capabilities of texture > > memory while the next column will discuss how to perform viewable graphic > > operations with the GPU. > > > > > > > > Don't be put off from using texture memory because it is different and has > > many options. The use of texture memory can improve performance for both > > bandwidth and latency limited programs. For example, some programs can > > exceed the maximum theoretical memory bandwidth of the underlying global > > memory through judicious use of the texture memory cache. While the latency > > of texture cache reference is generally the same as DRAM, there are some > > special cases that can deliver data with slightly less than 100 cycles of > > latency. As usual in CUDA, the use of many threads can hide memory access > > latency regardless if texture cache or global memory is being accessed. > > > > > > > > For CUDA programmers, the most salient points about using texture memory as > > a cache are: it is optimized for 2D spatial locality, very small > > (effectively about 8KB per multiprocessor), and can provide a performance > > benefit by having all the threads in a warp access nearby locations in the > > texture (as demonstrated inCache-Efficient Numerical Algorithms using > > Graphics Hardware). Another tip from the forums is to pack data up if you > > can because a singlefloat4texture read is faster than four > > separatefloattexture reads. > > > > > > > > One ingenious mapping of a random-access data structure to texture memory > > has been implemented by the CUDA-EC software. In the CUDA code, NVIDIA > > implements aBloom filterto test for set membership. The CUDA-EC software is > > available for free download athttp://cuda-ec.sourceforge.net/. > > > > > > > > TheCUDA Toolkit 2.2introduced the ability to write to 2D textures bound to > > pitch linear memory on the GPU that has a texture bound to it. In other > > words, the data within the texture can be updated within a kernel running > > on the GPU. This is a very nice feature because it allows many codes to > > better utilize the caching behavior of texture memory while also > > eliminating copies. One common example that immediately springs to mind are > > calculations that require two passes through the data: one to calculate a > > value (such as a mean or maximum) and a second pass to update the data in > > place. Such calculations are common when changing the data range or > > calculating probabilities. The use of an updatable texture can potentially > > speed these types of calculations. > > > > > > > > The cuBLAS library uses texture memory for many of the single-pass > > calculations (sasum,sdot, and etc). However, comments in the source code > > indicate that texture memory should not be used for vectors that are short > > or those that are aligned and have unit stride and thus have nicely > > coalesced behavior. (The source forcuBLAS libraryandcuFFTare available for > > those who have signed up as NVIDIA developers.) > > > > > > > > Texture cache is part of each TPC, here short for "Thread Processing > > Cluster" since I am discussing operations incompute mode. (TPC stands for > > "Texture Processing Cluster" ingraphics mode, which I don't address in this > > article.) Each TPC contains multiple streaming multiprocessors and a single > > texture cache. It is important to note that in the GTX 200 series, the > > texture cache supports three SM (Streaming Multiprocessors) per TPC while > > the G80/G92 architecture only supports two. > > > > > > > > Figure 1 depicts a high-level view of the GeForce GTX 280 GPU in parallel > > computing mode: A hardware-based thread scheduler at the top manages > > scheduling threads across the TPCs, which includes the texture caches and > > memory interface units. The elements indicated as "atomic" refer to the > > ability to perform atomic read-modify-write operations to memory. For more > > information, please seeGeForce GTX 200 GPU Technical Brief. > > > > Figure 1:High-Level view of GTX 280 Architecture (Courtesy NVIDIA). > > > > > > > > > > > > Figure 2 represents a lower-level view of a single TPC. Note that TF stands > > for "Texture Filtering" and IU is the abbreviation for "Instruction Unit". > > > > Figure 2:Lower-level view of a single GTX 280 TPC (Courtesy NVIDIA). > > > > > > > > > > > > Textures are bound to global memory and can provide both cache and some > > processing capabilities. How the global memory was created dictates some of > > the capabilities the texture can provide. For this reason, it is important > > to distinguish between three memory types that can be bound to a texture: > > > > Table 1:Distinguishing between memory types. > > > > 1 Introduction |2 Linear Memory|3 An Example|4 ConclusionNext PageRELATED > > ARTICLES > > > > OmniTI Reconnoiter: Web Management and Analysis > > > > Breach Security Launches Open-Source Project > > > > Zend Releases New Versions of Framework, Studio Tools > > > > Google Releases 'Simple' Programming Language > > > > Dr. Dobb's Agile Update 07/09TOP 5 ARTICLES > > > > No Top Articles. > > > > 090623cuda13_f1.gif > > 109KViewDownload > > > > 090623cuda13_f2.gif > > 37KViewDownload > > > > 090623cuda13_t1.gif > > 13KViewDownload > > > > redarrow.gif > > < 1KViewDownload > > > > blank.gif > > < 1KViewDownload -- Regards, Peter Teoh