[linuxkernelnewbies] Re: Dr. Dobb's | CUDA, Supercomputing for the Masses: Part 13 | June 23, 2009

stefanba...@yahoo.com Sun, 02 Aug 2009 17:02:23 -0700

>Since these chips are not so common

Larabee and G300 (GT300) are Intel & NVIDIA projects of next GPU
generation. Intel is open about  Larabee architecture and it is clear
that Larabee  is very different from today's GPU architeture. It is
closer to massive multi-core & hyper threading PC; each core/thread
appears as really undependable scalar processor (well, superscalar
more accurately). Information about GT300 is more speculative and not
very reliable so some NVIDIA's "leaks"  states that it is going to be
a truly MIMD architecture sounds for me more like marketing BS similar
to CUDA/MIMD hype - definitely I expect it to be mach closer to MIMD
yet I have my doubts that it is a match for Larabee in term of MIMD
performance but I would not be surprised to see that Larabee may be
slower for existing game then G300.


>I don't think there exists any opensource OS
> running on these processor yet.

Well, they are not available yet (expected next year??? or may be end
of this) but Larabee seems may be capable to run let say Linux (x86)
with minor modifications and I'm sure it is not the case for G300
(yet, too little reliable information about G300 to consider my
personal opinion as a reflection of indisputable facts).

>So the personal computers from
> Nvidia must be running some proprietary OS.....am I right?

"must be running" sounds more wishful joke ;o)))

--sb


On Aug 2, 1:52 am, Peter Teoh <htmldevelo...@gmail.com> wrote:
> Thank you for the comment.   This is interesting.   Since these chips
> are not so common, I don't think there exists any opensource OS
> running on these processor yet.   So the personal computers from
> Nvidia must be running some proprietary OS.....am I right?   Or are
> these running some form of modified Linux kernel, and perhaps better
> still, x86-compatible opcodes?
>
> On 8/2/09, stefanba...@yahoo.com <stefanba...@yahoo.com> wrote:
>
>
>
>
>
> > GPU is blazingly fast only and only for algorithms executing the same
> > instruction simultaneously upon massive array of numbers - a.k.a SIMD
> > (Single Instruction Multiple Data). Multi-core Intel/AMD CPU is MIMD
> > machine (Multiple Instructions Multiple Data) therefore, the
> > algorithms with inherently diverged code path run more efficiently on
> > MIMD and SIMD friendly algorithms are better fit for today's GPU. CUDA
> > interface pretends to be a MIMD but it is VERY missleading, once
> > threads of warp are out of sync the performance going down to the
> > speed of single thread - a dramatic slowdown. Ray-tracing is one of
> > examples where MIMD architecture is required to run it efficiently.
> > The best ray-tracing multi-core CPU implementations outperforms
> > dramatically the best GPU ray-tracing for big number of triangles and
> > the bigger number the bigger gap. GPU fans used to compare single
> > core-2 with GeForce 285 ray tracing and as a rule the CPU code is just
> > a port of GPU shaders accomplished by authors of GPU ray-tracing
> > implementations however the optimal for CPU ray-tracing is not optimal
> > for GPU and vs. verse. Ray-Tracing experts know this GPU/SIMD
> > limitations too well so NVIDIA worry for its tomorrow 3D positions
> > where ray-tracing is going to rule. Larabee is going to be kind of
> > MIMD/SIMD hybrid closer to clean MIMD and G300 is kind of hybrid but
> > closer to SIMD. Actually even today G200 is not a pure SIMD it has
> > undependable several SIMD units but still it is really good only for
> > SIMD friendly algorithms.
>
> > Stefan
>
> > On Jul 31, 6:20 pm, Peter Teoh <htmldevelo...@gmail.com> wrote:
> > >http://www.ddj.com/hpc-high-performance-computing/218100902CUDA, 
> > >Supercomputing for the Masses: Part 13(Page1of4)
>
> > > Rob Farber
>
> > > Using texture memory in CUDA
>
> > > Rob Farber is a senior scientist at Pacific Northwest National 
> > > Laboratory. He has worked in massively parallel computing at several 
> > > national laboratories and as co-founder of several startups. He can be 
> > > reached atrmfar...@gmail.com.
>
> > > InCUDA, Supercomputing for the Masses: Part 12of this article series on 
> > > CUDA, I took a quick detour to discuss some of the paradigm changing 
> > > features of the latest CUDA Toolkit 2.2 release. This article resumes the 
> > > discussion of "texture memory" which I began inPart 11of this series. In 
> > > addition, this installment includes information on the new CUDA Toolkit 
> > > 2.2 texture capability that allows some programs to eliminate extra 
> > > copies by providing the ability to write to global memory on the GPU that 
> > > has a 2D texture bound to it.
>
> > > From a C-programmer's perspective, texture memory provides an unusual 
> > > combination of cache memory (separate from register, global, and shared 
> > > memory), local processing capability (separate from the scalar 
> > > processors), and a way to interact with the display capabilities of the 
> > > GPU. This article focuses on the cache and local processor capabilities 
> > > of texture memory while the next column will discuss how to perform 
> > > viewable graphic operations with the GPU.
>
> > > Don't be put off from using texture memory because it is different and 
> > > has many options. The use of texture memory can improve performance for 
> > > both bandwidth and latency limited programs. For example, some programs 
> > > can exceed the maximum theoretical memory bandwidth of the underlying 
> > > global memory through judicious use of the texture memory cache. While 
> > > the latency of texture cache reference is generally the same as DRAM, 
> > > there are some special cases that can deliver data with slightly less 
> > > than 100 cycles of latency. As usual in CUDA, the use of many threads can 
> > > hide memory access latency regardless if texture cache or global memory 
> > > is being accessed.
>
> > > For CUDA programmers, the most salient points about using texture memory 
> > > as a cache are: it is optimized for 2D spatial locality, very small 
> > > (effectively about 8KB per multiprocessor), and can provide a performance 
> > > benefit by having all the threads in a warp access nearby locations in 
> > > the texture (as demonstrated inCache-Efficient Numerical Algorithms using 
> > > Graphics Hardware). Another tip from the forums is to pack data up if you 
> > > can because a singlefloat4texture read is faster than four 
> > > separatefloattexture reads.
>
> > > One ingenious mapping of a random-access data structure to texture memory 
> > > has been implemented by the CUDA-EC software. In the CUDA code, NVIDIA 
> > > implements aBloom filterto test for set membership. The CUDA-EC software 
> > > is available for free download athttp://cuda-ec.sourceforge.net/.
>
> > > TheCUDA Toolkit 2.2introduced the ability to write to 2D textures bound 
> > > to pitch linear memory on the GPU that has a texture bound to it. In 
> > > other words, the data within the texture can be updated within a kernel 
> > > running on the GPU. This is a very nice feature because it allows many 
> > > codes to better utilize the caching behavior of texture memory while also 
> > > eliminating copies. One common example that immediately springs to mind 
> > > are calculations that require two passes through the data: one to 
> > > calculate a value (such as a mean or maximum) and a second pass to update 
> > > the data in place. Such calculations are common when changing the data 
> > > range or calculating probabilities. The use of an updatable texture can 
> > > potentially speed these types of calculations.
>
> > > The cuBLAS library uses texture memory for many of the single-pass 
> > > calculations (sasum,sdot, and etc). However, comments in the source code 
> > > indicate that texture memory should not be used for vectors that are 
> > > short or those that are aligned and have unit stride and thus have nicely 
> > > coalesced behavior. (The source forcuBLAS libraryandcuFFTare available 
> > > for those who have signed up as NVIDIA developers.)
>
> > > Texture cache is part of each TPC, here short for "Thread Processing 
> > > Cluster" since I am discussing operations incompute mode. (TPC stands for 
> > > "Texture Processing Cluster" ingraphics mode, which I don't address in 
> > > this article.) Each TPC contains multiple streaming multiprocessors and a 
> > > single texture cache. It is important to note that in the GTX 200 series, 
> > > the texture cache supports three SM (Streaming Multiprocessors) per TPC 
> > > while the G80/G92 architecture only supports two.
>
> > > Figure 1 depicts a high-level view of the GeForce GTX 280 GPU in parallel 
> > > computing mode: A hardware-based thread scheduler at the top manages 
> > > scheduling threads across the TPCs, which includes the texture caches and 
> > > memory interface units. The elements indicated as "atomic" refer to the 
> > > ability to perform atomic read-modify-write operations to memory. For 
> > > more information, please seeGeForce GTX 200 GPU Technical Brief.
>
> > > Figure 1:High-Level view of GTX 280 Architecture (Courtesy NVIDIA).
>
> > > Figure 2 represents a lower-level view of a single TPC. Note that TF 
> > > stands for "Texture Filtering" and IU is the abbreviation for 
> > > "Instruction Unit".
>
> > > Figure 2:Lower-level view of a single GTX 280 TPC (Courtesy NVIDIA).
>
> > > Textures are bound to global memory and can provide both cache and some 
> > > processing capabilities. How the global memory was created dictates some 
> > > of the capabilities the texture can provide. For this reason, it is 
> > > important to distinguish between three memory types that can be bound to 
> > > a texture:
>
> > > Table 1:Distinguishing between memory types.
>
> > > 1 Introduction |2 Linear Memory|3 An Example|4 ConclusionNext PageRELATED 
> > > ARTICLES
>
> > > OmniTI Reconnoiter: Web Management and Analysis
>
> > > Breach Security Launches Open-Source Project
>
> > > Zend Releases New Versions of Framework, Studio Tools
>
> > > Google Releases 'Simple' Programming Language
>
> > > Dr. Dobb's Agile Update 07/09TOP 5 ARTICLES
>
> > > No Top Articles.
>
> > >  090623cuda13_f1.gif
> > > 109KViewDownload
>
> > >  090623cuda13_f2.gif
> > > 37KViewDownload
>
> > >  090623cuda13_t1.gif
> > > 13KViewDownload
>
> > >  redarrow.gif
> > > < 1KViewDownload
>
> > >  blank.gif
> > > < 1KViewDownload
>
> --
> Regards,
> Peter Teoh

[linuxkernelnewbies] Re: Dr. Dobb's | CUDA, Supercomputing for the Masses: Part 13 | June 23, 2009

Reply via email to