[linuxkernelnewbies] Re: Dr. Dobb's | CUDA, Supercomputing for the Masses: Part 13 | June 23, 2009

Peter Teoh Sun, 02 Aug 2009 01:52:36 -0700

Thank you for the comment.   This is interesting.   Since these chips
are not so common, I don't think there exists any opensource OS
running on these processor yet.   So the personal computers from
Nvidia must be running some proprietary OS.....am I right?   Or are
these running some form of modified Linux kernel, and perhaps better
still, x86-compatible opcodes?


On 8/2/09, stefanba...@yahoo.com <stefanba...@yahoo.com> wrote:
>
>
> GPU is blazingly fast only and only for algorithms executing the same
> instruction simultaneously upon massive array of numbers - a.k.a SIMD
> (Single Instruction Multiple Data). Multi-core Intel/AMD CPU is MIMD
> machine (Multiple Instructions Multiple Data) therefore, the
> algorithms with inherently diverged code path run more efficiently on
> MIMD and SIMD friendly algorithms are better fit for today's GPU. CUDA
> interface pretends to be a MIMD but it is VERY missleading, once
> threads of warp are out of sync the performance going down to the
> speed of single thread - a dramatic slowdown. Ray-tracing is one of
> examples where MIMD architecture is required to run it efficiently.
> The best ray-tracing multi-core CPU implementations outperforms
> dramatically the best GPU ray-tracing for big number of triangles and
> the bigger number the bigger gap. GPU fans used to compare single
> core-2 with GeForce 285 ray tracing and as a rule the CPU code is just
> a port of GPU shaders accomplished by authors of GPU ray-tracing
> implementations however the optimal for CPU ray-tracing is not optimal
> for GPU and vs. verse. Ray-Tracing experts know this GPU/SIMD
> limitations too well so NVIDIA worry for its tomorrow 3D positions
> where ray-tracing is going to rule. Larabee is going to be kind of
> MIMD/SIMD hybrid closer to clean MIMD and G300 is kind of hybrid but
> closer to SIMD. Actually even today G200 is not a pure SIMD it has
> undependable several SIMD units but still it is really good only for
> SIMD friendly algorithms.
>
> Stefan
>
>
> On Jul 31, 6:20 pm, Peter Teoh <htmldevelo...@gmail.com> wrote:
> > http://www.ddj.com/hpc-high-performance-computing/218100902CUDA, 
> > Supercomputing for the Masses: Part 13(Page1of4)
> >
> > Rob Farber
> >
> > Using texture memory in CUDA
> >
> > Rob Farber is a senior scientist at Pacific Northwest National Laboratory. 
> > He has worked in massively parallel computing at several national 
> > laboratories and as co-founder of several startups. He can be reached 
> > atrmfar...@gmail.com.
> >
> >
> >
> > InCUDA, Supercomputing for the Masses: Part 12of this article series on 
> > CUDA, I took a quick detour to discuss some of the paradigm changing 
> > features of the latest CUDA Toolkit 2.2 release. This article resumes the 
> > discussion of "texture memory" which I began inPart 11of this series. In 
> > addition, this installment includes information on the new CUDA Toolkit 2.2 
> > texture capability that allows some programs to eliminate extra copies by 
> > providing the ability to write to global memory on the GPU that has a 2D 
> > texture bound to it.
> >
> >
> >
> > From a C-programmer's perspective, texture memory provides an unusual 
> > combination of cache memory (separate from register, global, and shared 
> > memory), local processing capability (separate from the scalar processors), 
> > and a way to interact with the display capabilities of the GPU. This 
> > article focuses on the cache and local processor capabilities of texture 
> > memory while the next column will discuss how to perform viewable graphic 
> > operations with the GPU.
> >
> >
> >
> > Don't be put off from using texture memory because it is different and has 
> > many options. The use of texture memory can improve performance for both 
> > bandwidth and latency limited programs. For example, some programs can 
> > exceed the maximum theoretical memory bandwidth of the underlying global 
> > memory through judicious use of the texture memory cache. While the latency 
> > of texture cache reference is generally the same as DRAM, there are some 
> > special cases that can deliver data with slightly less than 100 cycles of 
> > latency. As usual in CUDA, the use of many threads can hide memory access 
> > latency regardless if texture cache or global memory is being accessed.
> >
> >
> >
> > For CUDA programmers, the most salient points about using texture memory as 
> > a cache are: it is optimized for 2D spatial locality, very small 
> > (effectively about 8KB per multiprocessor), and can provide a performance 
> > benefit by having all the threads in a warp access nearby locations in the 
> > texture (as demonstrated inCache-Efficient Numerical Algorithms using 
> > Graphics Hardware). Another tip from the forums is to pack data up if you 
> > can because a singlefloat4texture read is faster than four 
> > separatefloattexture reads.
> >
> >
> >
> > One ingenious mapping of a random-access data structure to texture memory 
> > has been implemented by the CUDA-EC software. In the CUDA code, NVIDIA 
> > implements aBloom filterto test for set membership. The CUDA-EC software is 
> > available for free download athttp://cuda-ec.sourceforge.net/.
> >
> >
> >
> > TheCUDA Toolkit 2.2introduced the ability to write to 2D textures bound to 
> > pitch linear memory on the GPU that has a texture bound to it. In other 
> > words, the data within the texture can be updated within a kernel running 
> > on the GPU. This is a very nice feature because it allows many codes to 
> > better utilize the caching behavior of texture memory while also 
> > eliminating copies. One common example that immediately springs to mind are 
> > calculations that require two passes through the data: one to calculate a 
> > value (such as a mean or maximum) and a second pass to update the data in 
> > place. Such calculations are common when changing the data range or 
> > calculating probabilities. The use of an updatable texture can potentially 
> > speed these types of calculations.
> >
> >
> >
> > The cuBLAS library uses texture memory for many of the single-pass 
> > calculations (sasum,sdot, and etc). However, comments in the source code 
> > indicate that texture memory should not be used for vectors that are short 
> > or those that are aligned and have unit stride and thus have nicely 
> > coalesced behavior. (The source forcuBLAS libraryandcuFFTare available for 
> > those who have signed up as NVIDIA developers.)
> >
> >
> >
> > Texture cache is part of each TPC, here short for "Thread Processing 
> > Cluster" since I am discussing operations incompute mode. (TPC stands for 
> > "Texture Processing Cluster" ingraphics mode, which I don't address in this 
> > article.) Each TPC contains multiple streaming multiprocessors and a single 
> > texture cache. It is important to note that in the GTX 200 series, the 
> > texture cache supports three SM (Streaming Multiprocessors) per TPC while 
> > the G80/G92 architecture only supports two.
> >
> >
> >
> > Figure 1 depicts a high-level view of the GeForce GTX 280 GPU in parallel 
> > computing mode: A hardware-based thread scheduler at the top manages 
> > scheduling threads across the TPCs, which includes the texture caches and 
> > memory interface units. The elements indicated as "atomic" refer to the 
> > ability to perform atomic read-modify-write operations to memory. For more 
> > information, please seeGeForce GTX 200 GPU Technical Brief.
> >
> > Figure 1:High-Level view of GTX 280 Architecture (Courtesy NVIDIA).
> >
> >
> >
> >
> >
> > Figure 2 represents a lower-level view of a single TPC. Note that TF stands 
> > for "Texture Filtering" and IU is the abbreviation for "Instruction Unit".
> >
> > Figure 2:Lower-level view of a single GTX 280 TPC (Courtesy NVIDIA).
> >
> >
> >
> >
> >
> > Textures are bound to global memory and can provide both cache and some 
> > processing capabilities. How the global memory was created dictates some of 
> > the capabilities the texture can provide. For this reason, it is important 
> > to distinguish between three memory types that can be bound to a texture:
> >
> > Table 1:Distinguishing between memory types.
> >
> > 1 Introduction |2 Linear Memory|3 An Example|4 ConclusionNext PageRELATED 
> > ARTICLES
> >
> > OmniTI Reconnoiter: Web Management and Analysis
> >
> > Breach Security Launches Open-Source Project
> >
> > Zend Releases New Versions of Framework, Studio Tools
> >
> > Google Releases 'Simple' Programming Language
> >
> > Dr. Dobb's Agile Update 07/09TOP 5 ARTICLES
> >
> > No Top Articles.
> >
> >  090623cuda13_f1.gif
> > 109KViewDownload
> >
> >  090623cuda13_f2.gif
> > 37KViewDownload
> >
> >  090623cuda13_t1.gif
> > 13KViewDownload
> >
> >  redarrow.gif
> > < 1KViewDownload
> >
> >  blank.gif
> > < 1KViewDownload


-- 
Regards,
Peter Teoh

[linuxkernelnewbies] Re: Dr. Dobb's | CUDA, Supercomputing for the Masses: Part 13 | June 23, 2009

Reply via email to