Author: Hakan Ardo <[email protected]>
Branch: extradoc
Changeset: r4585:4887f7fc2e99
Date: 2012-08-15 19:43 +0200
http://bitbucket.org/pypy/extradoc/changeset/4887f7fc2e99/
Log: describe the scimark benchmakrs
diff --git a/talk/dls2012/licm.pdf b/talk/dls2012/licm.pdf
index
d0e3ca21bc58e605bbf333d46f6acdc18de2a29d..d44c6adbc9741258517f595c681611569b3e9240
GIT binary patch
[cut]
diff --git a/talk/dls2012/paper.tex b/talk/dls2012/paper.tex
--- a/talk/dls2012/paper.tex
+++ b/talk/dls2012/paper.tex
@@ -63,7 +63,7 @@
\newboolean{showcomments}
-\setboolean{showcomments}{true}
+\setboolean{showcomments}{false}
\ifthenelse{\boolean{showcomments}}
{\newcommand{\nb}[2]{
\fbox{\bfseries\sffamily\scriptsize#1}
@@ -931,8 +931,9 @@
we see improvements in several cases. The ideal loop for this optimization
is short and contains numerical calculations with no failing guards and no
external calls. Larger loops involving many operations on complex objects
-typically benefit less from it. Loop peeling never makes runtime performance
worse, in
-the worst case the peeled loop is exactly the same as the preamble. Therefore
we
+typically benefit less from it. Loop peeling never makes the generated code
worse, in
+the worst case the peeled loop is exactly the same as the preamble.
+Therefore we
chose to present benchmarks of small numeric kernels where loop peeling can
show
its use.
@@ -983,30 +984,30 @@
\subsection{Python}
The Python interpreter of the RPython framework is a complete Python
version 2.7 compatible interpreter. A set of numerical
-calculations were implemented in both Python and in C and their
+calculations were implemented in both Python, C and Lua and their
runtimes are compared in Figure~\ref{fig:benchmarks}. The benchmarks are
\begin{itemize}
-\item {\bf sqrt}: approximates the square root of $y$. The approximation is
+\item {\bf sqrt}$\left(T\right)$: approximates the square root of $y$. The
approximation is
initiated to $x_0=y/2$ and the benchmark consists of a single loop updating
this
approximation using $x_i = \left( x_{i-1} + y/x_{i-1} \right) / 2$ for $1\leq
i < 10^8$.
Only the latest calculated value $x_i$ is kept alive as a local variable
within the loop.
There are three different versions of this benchmark where $x_i$
- is represented with different type of objects: int's, float's and
+ is represented with different type of objects, $T$,: int's, float's and
Fix16's. The latter, Fix16, is a custom class that implements
fixpoint arithmetic with 16 bits precision. In Python there is only
a single implementation of the benchmark that gets specialized
depending on the class of it's input argument, $y$, while in C,
there are three different implementations.
-\item {\bf conv3}: one-dimensional convolution with fixed kernel-size $3$. A
single loop
+\item {\bf conv3}$\left(n\right)$: one-dimensional convolution with fixed
kernel-size $3$. A single loop
is used to calculate a vector ${\bf b} = \left(b_1, \cdots, b_n\right)$ from a
vector
${\bf a} = \left(a_1, \cdots, a_n\right)$ and a kernel ${\bf k} = \left(k_1,
k_2, k_3\right)$ using
$b_i = k_3 a_i + k_2 a_{i+1} + k_1 a_{i+2}$ for $1 \leq i \leq n$. Both the
output vector, $\bf b$,
and the input vectors, $\bf a$ and $\bf k$, are allocated prior to running the
benchmark. It is executed
with $n=10^5$ and $n=10^6$.
-\item {\bf conv5}: one-dimensional convolution with fixed kernel-size $5$.
Similar to conv3, but with
+\item {\bf conv5}$\left(n\right)$: one-dimensional convolution with fixed
kernel-size $5$. Similar to conv3, but with
${\bf k} = \left(k_1, k_2, k_3, k_4, k_5\right)$. The enumeration of the
elements in $\bf k$ is still
hardcoded into the implementation making the benchmark consist of a single
loop too.
-\item {\bf conv3x3}: two-dimensional convolution with kernel of fixed
+\item {\bf conv3x3}$\left(n\right)$: two-dimensional convolution with kernel
of fixed
size $3 \times 3$ using a custom class to represent two-dimensional
arrays. It is implemented as two nested loops that iterates over the
elements of the
$n\times n$ output matrix ${\bf B} = \left(b_{i,j}\right)$ and calculates each
element from the input matrix
@@ -1021,12 +1022,12 @@
\end{equation}
for $1 \leq i \leq n$ and $1 \leq j \leq n$.
The memory for storing the matrices are again allocated outside the benchmark
and $n=1000$ was used.
-\item {\bf dilate3x3}: two-dimensional dilation with kernel of fixed
+\item {\bf dilate3x3}$\left(n\right)$: two-dimensional dilation with kernel of
fixed
size $3 \times 3$. This is similar to convolution but instead of
summing over the terms in Equation~\ref{eq:convsum}, the maximum over those
terms is taken. That places a
external call to a max function within the loop that prevents some
of the optimizations.
-\item {\bf sobel}: a low-level video processing algorithm used to
+\item {\bf sobel}$\left(n\right)$: a low-level video processing algorithm used
to
locate edges in an image. It calculates the gradient magnitude
using sobel derivatives. A Sobel x-derivative, $D_x$, of a $n \times n$
image, ${I}$, is formed
by convolving ${I}$ with
@@ -1050,11 +1051,31 @@
on top of a custom two-dimensional array class.
It is
a straightforward implementation providing 2 dimensional
-indexing with out of bounds checks. For the C implementations it is
+indexing with out of bounds checks and
+data stored in row-major order.
+For the C implementations it is
implemented as a C++ class. The other benchmarks are implemented in
plain C. All the benchmarks except sqrt operate on C double-precision floating
point numbers, both in the Python and the C code.
+In addition we also ported the
+SciMark\footnote{\texttt{http://math.nist.gov/scimark2/}} benchmakts to
python, and compared
+their runtimes with the already existing Lua and C implementations.
+This port was performed after the release of the pypy used to run the
benchmarks which means that
+these benchmarks have not influenced the pypy implementation.
+SciMark consists of
+
+\begin{itemize}
+\item {\bf SOR}$\left(n, c\right)$: Jacobi successive over-relaxation on a
$n\times n$ grid repreated $c$ times.
+The same custom two-dimensional array class as described above is used to
represent
+the gird.
+\item {\bf SparseMatMult}$\left(n, z, c\right)$: Matrix multiplication between
a $n\times n$ sparse matrix,
+stored in compressed-row format, and a full storage vector, stored in a normal
array. The matrix has $z$ non-zero elements and the calculation is repeated $c$
times.
+\item {\bf MonteCarlo}$\left(n\right)$: Monte Carlo integration by generating
$n$ points uniformly distributed over the unit square and computing the ratio
of those within the unit circle.
+\item {\bf LU}$\left(n, c\right)$: Computes the LU factorization of a $n
\times n$ matrix. The rows of the matrix is shuffled which makes the previously
used two-dimensional array class unsuitable. Instead a list of arrays is used
to represent the matrix. The calculation is repeated $c$ times.
+\item {\bf FFT}$\left(n, c\right)$: Fast Fourier Transform of a vector with
$n$ elements, represented as an array, repeated $c$ times.
+\end{itemize}
+
Benchmarks were run on Intel i7 M620 @2.67GHz with 4M cache and 8G of RAM
using Ubuntu Linux 11.4 in 32bit mode.
The machine was otherwise unoccupied. We use the following software
_______________________________________________
pypy-commit mailing list
[email protected]
http://mail.python.org/mailman/listinfo/pypy-commit