|
http://www.eecg.toronto.edu/pact/workshops.html
The Seventeenth
International Conference on
| ||||||||||||||||||||||
|
SATURDAY,
OCTOBER 25 |
|||
|
Morning |
TUTORIAL SimFlex and ProtoFlex: Fast, Accurate, and Flexible Simulation of Multicore Systems Eric Chung ([EMAIL PROTECTED]), Mike
Ferdman, Nikos Hardavellas |
TUTORIAL Introducing
Microthreading and its Programming Model Thomas Bernard ([EMAIL PROTECTED]), Mike Lankamp ([EMAIL PROTECTED]), Chris Jesshope ([EMAIL PROTECTED]) Universiteit van Amsterdam CANCELLED |
WORKSHOP Workshop
on Parallel Architectures and Bioinspired Algorithms J. Ignacio
Hidalgo ([EMAIL PROTECTED]) Universidad Complutense
de Madrid |
|
Afternoon |
TUTORIAL Programming
Models and Compiler Optimizations for GPUs and Multi-Core Processors |
TUTORIAL Productive
Parallel Programming in PGAS IBM |
|
|
SUNDAY,
OCTOBER 26 |
|||
|
Morning |
|
WORKSHOP MEDEA:
Workshop on MEmory performance: DEaling with Applications, systems and
architecture Sandro
Bartolini ([EMAIL PROTECTED]),
Pierfrancesco Foglia ([EMAIL PROTECTED]) and Università degli Studi
di Siena Università di Pisa |
TUTORIAL Transactional
Memory Presenters: Yang Ni, |
|
Afternoon |
WORKSHOP WoSPS:
Workshop on Soft Processor Systems |
||
Descriptions and Links:
Tutorial #1:
SimFlex and ProtoFlex: Fast, Accurate, and Flexible Simulation of
Multi-Core Systems
http://www.ece.cmu.edu/~simflex
Computer
architects have long relied on software simulation to evaluate the
functionality and performance of architectural innovations.
Unfortunately, modern cycleaccurate simulators are several
orders of magnitude slower than real hardware and the growing levels of
hardware integration increase simulation complexity even further. In
addition, conventional simulators are optimized for speed at the
expense of code flexibility and maintainability. In this tutorial, we
present the SimFlex and ProtoFlex family of simulation tools for fast,
accurate, and flexible simulation of uniprocessor, multi-core and
distributed shared-memory systems. SimFlex achieves fast simulation
turnaround while ensuring representative results by leveraging the
SMARTS simulation sampling framework. At the same time, its
component-based design allows for easy composition of complex
multi-core and multiprocessor systems. ProtoFlex is an FPGA-accelerated
simulation technology that complements SimFlex by enabling full-system
functional simulation of multiprocessor and multi-core systems at
speeds of one to two orders of magnitude faster than software tools.In
this tutorial, first we introduce attendees to the SMARTS simulation
sampling approach. We present relevant background from statistics and
compare and contrast statistical sampling with other sampling
proposals. Second, we present the design, implementation and use of the
Flexus simulator suite. Flexus is a family of component based C++
architecture simulators that implement timing-accurate models of
multi-core and multiprocessor systems. We give attendees hands-on
experience with Sim- Flex/TraceFlex, a Flexus model for fast functional
and memory system simulation, SimFlex/OoO, a Flexus model for
cycle-accurate simulation, and Flexus’ statistical managers and
sampling tools. Finally, we present a hands-on technology preview of
ProtoFlex. We give attendees the opportunity to compile, execute and
profile multithreaded applications on a real operating system running
on a BEE2 FPGA platform.
Tutorial #2:
Introducing Microthreading and its Programming Model
At
the University
of Amsterdam , our
research group developed a disruptive parallel programming model called
the SANE Virtual Processor, or SVP model [1] [2] [3] which captures all
the parallelism in the program in order to enable ideal performance
speedup. The model is also deadlock free, deterministic and includes
providing dynamic plus adaptive concurrency management.
This
half-day tutorial will introduce the audience to the SVP model.
Following this, we will present an instantiation of the model by
implementing its actions as instructions in a microgrid of DRISC
processors [4]. Then as the μTC language and its compiler, which is
based on GCC. We will also show how we aim to parallelize C by
translating C to μTC. Finally we present results we have achieved so
far using our compiler and emulator, which show scalable speedup with
number of processors used across many orders of magnitude.
[1]
K. Bousias, L. Guang, C.R. Jesshope, M. Lankamp (2008), Implementation
and Evaluation of a Microthread Architecture, Submitted to: Journal of
Systems Architecture
[2] C. R. Jesshope (2007), A model for the design and programming of
multicores, to be published in: Advances in Parallel Computing, IOS
Press, Amsterdam|
[3] C. R. Jesshope (2007), SVP and μTC - A dynamic model of concurrency
and its implementation as a compiler target, Report (unpublished)
[4] T. Bernard, K. Bousias, L. Guang, C. R. Jesshope, M. Lankamp, M. W.
van Tol and L. Zhang (2008), A general model of concurrency and its
implementation as many-core dynamic RISC processors, SAMOS 08
Tutorial #3:
Programming Models and Compiler Optimizations for GPUs and Multi-Core
Processors
On-chip
parallelism with multiple cores is now ubiquitous. Because of power and cooling constraints, recent performance
improvements in both
general-purpose and special-purpose processors have come primarily from increased on-chip parallelism rather than
increased clock ates. Parallelism is therefore
of considerable interest to a much broader group than developers of
parallel applications for high-end supercomputers.
Several programming environments have recently emerged in response to
the need to develop applications for GPUs, the Cell rocessor,
and multi-core processors from AMD, IBM, Intel etc. As commodity
computing platforms all go parallel, programming these platforms in
order to attain high erformance has become an extremely important
issue. There has been considerable recent interest in two complementary
approaches:
·
developing
programming models that explicitly expose the programmer to
parallelism; and,
·
compiler
optimization frameworks to automatically transform sequential programs
for parallel execution.
This
tutorial will provide an introductory survey covering both these
aspects. In contrast to conventional multicore architectures, GPUs and
the Cell processor have to exploit parallelism while managing the
physical memory on the processor (since there are no caches) by
explicitly orchestrating the movement of data between large off-chip
memories and the limited on-chip memory. This tutorial will address the
issue of explicit memory management in detail.
Tutorial #4:
Productive Parallel Programming in PGAS
http://domino.research.ibm.com/comm/research_projects.nsf/pages/xlupc.confs.html
Partitioned
Global Address Space (PGAS) programming languages offer an attractive,
high-productivity programming model for parallel programming. PGAS
languages, such as Unified Parallel C (UPC), combine the simplicity of
shared-memory programming with the efficiency of the message-passing
paradigm. The efficiency is obtained through a combination of factors:
programmers declare how the data is partitioned and distributed between
threads and use the SPMD programming model to define work; compilers
can use the data annotations to optimize accesses and communication. We
have demonstrated that UPC applications can outperform MPI applications
on large-scale machines, such as BlueGene/L.
In
this tutorial we shall present our work on the IBM's XLUPC Compiler. We
will discuss language issues, compiler optimizations for PGAS
languages, runtime trade-offs for scalability and performance results
obtained on a number of benchmarks and applications. Attendants should
not only gain a better understanding of parallel programming, but also
learn about compiler and system limitations. The expected outcome is
that programmers will be able to code their applications such that
performance optimization opportunities are exposed and exploited.
Tutorial #5:
Transactional Memory
Transactions
have recently emerged as a promising alternative to lock-based
synchronization. The tutorial will cover a range of topics related to
transactional memory spanning from the description of high-level
language constructs and their semantics to the low-level details of
specific algorithms used to support efficient execution of these
constructs. We will take a programming systems view of transactional
memory and walk the audience through each layer of the system starting
from the top-level programmer's view of transactional memory and
working down to the implementation level. We show how transactional
memory can avoid the problems of lock-based synchronization such as
deadlock and poor scalability when lock-based software modules are
composed. We discuss how transactional constructs can be added to
languages, such as C/C++ or Java, as an alternative to current
synchronization constructs. We present software strategies for
implementing transactional memory and show how to leverage compiler
optimizations to reduce its overheads. We also describe our experience
writing transactional applications and present the experimental results
comparing their performance with that of the lock-based applications.
Finally, we discuss the advanced topics related to the semantics of
transactional language constructs including isolation levels and
integration with the language memory models.
Yang Ni is a Research Scientist
in Intel's Programming Systems Lab. He has been working on programming
languages for platforms from mobile devices to chip multi processors.
His current research focuses on transactional memory. He is a major
contributor to the Intel C/C++ TM compiler. Yang received his Ph.D. in
Computer Science from
Adam Welc is a Research Scientist
in Intel's Programming Systems Lab. His work is in the area of
programming language design and implementation, with specific interests
in concurrency control, compiler and run-time system optimizations,
transactional processing as well as architectural support for
programming languages and applications. Adam received the Master of
Science in Computer Science from Poznan University of Technology,
Poland, in July 1999. He continued his graduate studies at
Tatiana Shpeisman is a Research Scientist
in Intels Programming Systems Lab. Her general research interest lies
in finding ways to simplify software development while improving
program efficiency. Her current research focuses on the semantics of
transactional memory. In the past, she worked on dynamic compilation
for managed runtime environments, IPF code generation and compiler
support for sparse matrix computations. She holds Ph.D. in Computer
Science from
Workshop #1:
Workshop on Parallel Architectures and Biosinpired Algorithms
http://atcadmin.dacya.ucm.es/bioinspired/
Parallel Computer
Architecture and Bioinspired Algorithms have been coming together
during the last years. On one hand, the application of Bioinspired
Algorithm to solve difficult problems has shown that they need high
computation power and communications technology. Parallel architectures
and Distributed systems have offered an interesting alternative to
sequential counterparts. On the other hand, and perhaps which is more
interesting for the Computer Architecture community, Bioinspired
algorithms comprises a series of heuristics that can help to optimize a
wide range of tasks required for Parallel and Distributed architectures
to work efficiently. Genetic Algorithms (GAs), Genetic Programming
(GP), Ant Colonies Algorithms (ACAs) or Simulated Annealing (SA) are
nowadays helping computer designers on the advance of Computer
Architecture, while improvement on parallel architectures are allowing
to run computing intensive Bioinspired algorithms for solving other
difficult problems. We can find in the literature several evolutionary
solutions for design problems such as partitioning, place and route,
etc.. which allows technology improvements. Researchers have also used
this kind of meta-heuristics for the optimization of computer
architectures, balancing computer load, instructions code, and other
related problems. Nevertheless, any effort for increasing the
relationship between them would be very welcome by the community. This
workshop will gather scientists, engineers, and practitioners to share
and exchange their experiences, discuss challenges, and report
state-of-the-art and in-progress research on all aspects of the answer
to two questions: What can Bioinspired Algorithms do for Parallel
Computer Architectures? And What can Parallel Computer Architectures do
for Bioinspired Algorithms?
Workshop #2:
WoSPS: Workshop on Soft Processor Systems
http://www.eecg.toronto.edu/wosps08/
Processors implemented in
programmable logic, called soft processors, are becoming increasingly
important in both industry and academia. FPGA-based processors provide
an easy way for software programmers to target FPGAs without having to
write hardware-description language---hence designers of FPGA-based
embedded systems are increasingly including soft processors in their
designs. Soft processors will also likely play an important role in
FPGA-based co-processors for high-performance computing. Furthermore,
academics are embracing FPGA-based processors as the foundation of
systems for faster architectural simulation. In all cases, we need to
develop a deeper understanding of processor and multiprocessor
architecture for this new medium.
This workshop will serve
as a forum for academia and industry to discuss and present challenges,
ideas, and recent developments in soft processors, soft
multiprocessors, application-specific soft processors, and
soft-processor-based accelerators and architectural simulation
platforms.
Workshop #3:
MEDEA: Workshop on MEmory performance: DEaling with Applications,
Systems and Architecture
http://garga.iet.unipi.it/medea08/
MEDEA aims to continue
the high level of interest of the previous editions held with PACT
Conference since 2000.
Due to the
ever-increasing gap between CPU and memory speed, there is always great
interest in evaluating and proposing processor, multiprocessor, CMP,
multi-core and system architectures dealing with the "memory wall" and
wire-delay problems. At the same time, a modular high-level design is
becoming more and more attracting in order to reduce design costs.
In this scenario, design
solutions and their corresponding performance are shaped by the
combined pressure of a) technological opportunities and limitations, b)
features and organization of system architecture and c) critical
requirements of specific application domains. Evaluating and
controlling the effects on the memory subsystem (e.g. caches,
interconnection, bus, memory, coherence) of any architectural proposal
is extremely important both from the performance (e.g. bandwidth,
latency, predictability) and power (e.g. static, dynamic,
manageability) points of view.
In particular, the
emerging trend of single-chip multi-core solutions, will push towards
new design principles for memory hierarchy and interconnection
networks, especially when the design is aimed to build systems with a
high number of cores (many-core instead of multi-core systems), which
aim to scale performance and power efficiency in a variety of
application domains.
From a slightly different
point of view, the mutual interaction between the application behavior
and the system on which it executes, is responsible of the figures of
merit of the memory subsystem and, therefore, pushes towards specific
solutions. In addition, it can suggest specific compile/link time
tunings for adapting the application to the features of the target
architecture.
In the overall picture,
power consumption requirements are increasingly important cross-cutting
issues and raise specific challenges.
Typical architectural
choices of interest include, but are not limited to, single processors,
chip and board multiprocessors, SoC, traditional and tiled/clustered
architectures, multithreaded or VLIW architectures with emphasis on
single-chip design, massive parallelism designs, heterogeneous
architectures, architectures equipped with application-domain
accelerators as well as endowed with reconfigurable modules.
Application domains encompass embedded (e.g. multimedia, mobile,
automotive, automation, medical), commercial (e.g. Web, DB,
multimedia), scientific and networking applications, security, etc. The
emerging network on chip infrastructure and transactional memory may
suggest new solutions and issues.
MEDEA Workshop wants to
continue to be a forum for academic and industrial people to meet,
discuss and exchange their ideas, experience and solutions in the
design and evaluation of architectures for embedded, commercial and
general/special purpose systems taking into account memory issues, both
directly and indirectly.
Proceedings of the
Workshop will be published under ACM ISBN, and appear also in the ACM
Digital Library. As in the previous years, a selection of papers will
be considered for publication on transactions on HIPEAC
(http://www.hipeac.net/journal)
The format of the
workshop includes the presentation of selected papers and discussion
after each presentation.
