Novel Architectures

As microprocessor manufacturers are finding it increasingly difficult to keep up with Moore's law the HPC community are beginning to look seriously at the potential performance gains promised by a number of novel architectures. This page provides an overview of some of these architectures and presents some early performance comparisons between these novel approaches and more conventional processors.

FPGAs

These reconfigurable processors require significantly less power than conventional processors and could significantly increase compute density in HPC systems. As FPGAs provide access to completely reconfigurable logic the potential performance increases that they offer are huge. Performance boosts of well over 100X have been reported for certain applications when compared with conventional processors. The trouble is that not all applications are well suited to FPGAs. This is particularly the case for double precision floating point intensive applications as large amounts of logic are consumed by basic double precision floating point cores. The devices are also very difficult to program efficiently without extensive hardware design experience. A number of HPC vendors including Cray and SGI now produce systems that are designed to accomodate FPGAs as co-processors. For more information about FPGAs visit our FPGA page.

Cell

The Cell Broadband Engine from IBM/Toshiba/Sony has been designed primarily for the Sony Playstation 3 games console and will therefore be produced in very large volumes. The hope is that this will make the Cell an affordable option for large HPC systems. The Cell processor itself is made up of nine processors operating on a shared, coherent memory. The first generation of Cell has a single Power Architecture-based control processor (PPU) and eight SIMD Synergistic Processor Units (SPUs) but different configurations are likely to emerge.

IBM's Cell Broadband Engine resource center can be found here - http://www-128.ibm.com/developerworks/power/cell/

Clearspeed

ClearSpeed produces the Advance Accelerator PCI-X and PCIe boards which work by offloading compute-intensive math library routines called by applications running on the host processor. Clearspeeds website reports that it's CSX600 co-processor provides 25 GFLOPS of sustained single or double precision floating point performance, while dissipating a maximum of 10 Watts (25 Watts per board). The CSX600 is a system-on-a-chip (SoC) with a predefined functionality that cannot be reconfigured (like an FPGA can) but what the chip loses to FPGAs in flexibility it more than makes up for in usability as applications that already make use of standard math libraries (level 3 BLAS and FFTW) should work on these cards without the need to port code.

More recently ClearSpeed have introduced so-called CATS units which have twelve boards packed into one 1U server. We've got two CATS attached to our cseem64t cluster.

General-Purpose computation on GPUs

With the increasing programmability of commodity graphics processing units (GPUs), these chips are now considered to be useful for performing more than the specific graphics computations for which they were designed. They are now seen by some as capable coprocessors, useful for a variety of applications including scientific computing.

http://www.gpgpu.org/ catalogs the current and historical use of GPUs for general-purpose computation.

DGEMM Performance

This bar chart provides an indication of DGEMM performance in Gflop/s for a number of conventional and novel architectures.

The Cell (simulation), Virtex II Pro FPGA and Cray X1E vector processor all achieve in the region of 15 Gflop/s sustained DGEMM performance. This is two to three times the performance offered by current Itanium and Opteron processors. It is likely that the latest Xilinx Virtex 4 and Virtex 5 FPGAs would be able to significantly outperform the Virtex II Pro (Approximately 3X speedup on the largest chips). The Clearspeed CSX600 processor provides almost double the DGEMM performance of Cell (25Gflop/s sustained) but you would expect this from a dedicated floating point co-processor when compared to more general purpose chips. Finally the Cell+ is an optimized version of the Cell architecture proposed by a team at Lawrence Berkley National Laboratory in the US. Simulations based on a performance model for the Cell+ indicated that it could achieve 51 Gflop/s for DGEMM.

Data for the Cell and Cell+ are performance predictions taken from The Potential of the Cell Processor for Scientific Computing, Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick, May, 2006. http://www.cs.berkeley.edu/~samw/projects/cell/CF06.pdf

[linuxkernelnewbies] DisCo - The Distributed Computing Group at Daresbury Laboratory

Reply via email to