Re: Implementing let-over-lambda chapter 7 badger network for sorting

Christian Schuhegger Sun, 08 Aug 2010 06:31:29 -0700

Hello,

after my first attempts with clojure I needed to evaluate the clojure
results against pure Java results.


First to make my report complete here are my OS and JVM details. I run
my tests on an Ubuntu 64bit 10.04 on Intel i7 3GHz hardware with a
custom compiled kernel using the BFS scheduler.
-- snip start --
Linux i7 2.6.34-custom-201005231602 #1 SMP PREEMPT Sun May 23 16:06:01
CEST 2010 x86_64 GNU/Linux

java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

-Xmx512m -Xms512m -server -XX:+DisableExplicitGC -XX:+AggressiveOpts -
XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=60 -
XX:ParallelGCThreads=5 -XX:MaxGCPauseMillis=10 -XX:+UseParNewGC -XX:
+CMSParallelRemarkEnabled -XX:+CMSParallelSurvivorRemarkEnabled -XX:
+UseNUMA -XX:+UseTLAB -XX:+BindGCTaskThreadsToCPUs -XX:
+UseGCTaskAffinity -XX:-DontCompileHugeMethods
-- snip end --

What I did is to convert my clojure program into a Java source code
generator that create a sort implementation like this:

-- snip start --
import java.util.Comparator;

public class BatcherSortNetwork50 {

        private static <T> void swap(T[] input, int i1, int i2) {
                T tmp = input[i1];input[i1] = input[i2];input[i2] = tmp;
        }

        private static <T> void cas(T[] input, int i1, int i2, Comparator<T>
comparator) {
                if(1 == comparator.compare(input[i1],input[i2])) swap(input, i1,
i2);
        }

        public static <T> T[] sort(T[] input, Comparator<T> comparator) {
                cas(input, 0, 32, comparator);
                cas(input, 1, 33, comparator);
                cas(input, 2, 34, comparator);
...
-- snip end --

In addition I use the benchmark code from ellipticgroup described
here:
http://ellipticgroup.com/html/benchmarkingArticle.html
http://www.ibm.com/developerworks/java/library/j-benchmark1.html
http://www.ibm.com/developerworks/library/j-benchmark2/index.html

The test always works the same:
1) initially I create an array of random items with length 50.
2) then I iterate over the following block:
3a) clone the array
3b) sort the array

For all tests I use the same initial random array, e.g. all sort
implementations have the same piece of work to perform.

The measurements show that the cloning of the array takes roughly
60ns (e.g. this is negligible).

The Arrays.sort implementation needs 2.628us.

The batcher sort implementation needs 2.895us.

All in all the two versions are  more or less equal. But that was not
satisfactory, because in the end I wanted to get a performance
improvement and not an implementation that is more or less on par with
the provided generic implementation.

Therefore I modified the clojure source code generator in such a way
that it creates C code. I compared the performance of the qsort libc
implementation against my batcher implementation. The code in C looks
like this:

-- snip start --
static int compare_doubles (const double *x,const double *y) {
  if (*x > *y)
    return 1;
  else if (*x < *y)
    return -1;
  else
    return 0;
}

static void batcher_sort_50(double a[]) {
  double tmp;
  if(1 == compare_doubles(&(a[0]),&(a[32]))) {tmp = a[0]; a[0] =
a[32]; a[32]=a[0];}
  if(1 == compare_doubles(&(a[1]),&(a[33]))) {tmp = a[1]; a[1] =
a[33]; a[33]=a[1];}
...
-- snip end --

The C code is not 100% comparable to the Java code, because in C I
only sort double arrays and in Java I compare Object
arrays. Nevertheless the results look much more promising:

qsort       : 2.880us
batcher sort: 0.315us

Finally! I have my factor of 10 performance improvement that I was
looking for. The qsort implementation seems to be on the same level as
the Java Arrays.sort() implementation but the C version of my batcher
sort is much faster. I would be interested to hear about ideas on how
to explain why the Java batcher version cannot keep up with the C
version?

Also interesting to note:
Initially my Java implementation of the batcher sort looked the same
as the C version, e.g. I did not call the cas() method but had one
large message doing everything "inline". In that case my batcher sort
was a factor of 10 slower than the Arrays.sort() implementation and
when looking at the output of "-XX:+PrintCompilation" I saw a message:
  "COMPILE SKIPPED: out of nodes parsing method"
The reason why my first Java version was slower had something to do
with not compiling the method, even if I explicitly said
"-XX:-DontCompileHugeMethods". No idea what happened here.

I used the Sun Studio collect/analyzer pair to identify the root cause
why the Clojure version cannot keep up with the native Java
implementation. I saw several calls to the reflection API in the
different call stacks even if *warn-on-reflection* did not warn me
about anything in my code. I have to dig deeper here.

What follows is the complete output of the ellipticgroup benchmark
classes plus my Callable implementations for the different tests. If
anybody is interested I can provide full source code of my tests via
mail.

Initially I used a loop count of 1 so that "dead code optimizations"
could not simply throw away the loop, but the results were more or
less the same as with a loop count of 10000.

--
loopcount = 10000;

        public static class ArrayCloneCallable implements Callable<Quote[]> {

                protected final Quote[] quotes;

                public ArrayCloneCallable(Quote[] quotes) {
                        this.quotes = quotes;
                }

                public Quote[] call() throws Exception {
                        int i = 0;
                        Quote[] quotes1 = null;
                        for(i=0; i < loopcount; i++) {
                                quotes1 = quotes.clone();
                        }
                        return quotes1;
                }

        }

clone: action statistics: first = 321.989 ns, mean = 61.017 ns (CI
deltas: -74.780 ps, +145.937 ps), sd = 1.809 us (CI deltas: -690.644
ns, +978.646 ns) WARNING: EXECUTION TIMES HAVE EXTREME OUTLIERS, SD
VALUES MAY BE INACCURATE
        ----------
        --the action statistics were calculated from block statistics
        --each block measured 2048 task executions
        --the user says that task internally performs m = 10000
actions
        --then the number of actions per block measurement is a =
20480000
        --block statistics: mean = 1.250 s (CI deltas: -1.531 ms,
+2.989 ms), sd = 8.185 ms (CI deltas: -3.125 ms, +4.429 ms)
        --the forumla used to convert block statistics to action
statistics (mean scales as 1/a, sd scales as 1/sqrt(a)) assumes that
the action execution times are iid
        ----------
        --each confidence interval (CI) is reported as either +-
deltas from the point estimate, or as a closed interval ([x, y])
        --each confidence interval has confidence level = 0.95
        ----------
        --EXECUTION TIMES APPEAR TO HAVE OUTLIERS
        --this was determined using the boxplot algorithm with median
= 1.247 s, interquantileRange = 1.678 ms
        --6 are EXTREME (on the high side): #54 = 1.254 s, #55 = 1.254
s, #56 = 1.274 s, #57 = 1.275 s, #58 = 1.278 s, #59 = 1.289 s
        --1 are mild (on the high side): #53 = 1.251 s
        ----------
        --sd results have unknown validity (the environmental noise
test was skipped)
        ----------
        --action sd values ALMOST CERTAINLY GROSSLY INFLATED by
outliers
        --they cause at least 99.9697752636478% of the measured
VARIANCE according to a equi-valued outlier model
        --model quantities: a = 2.048E7, muB = 1.2496361766166666,
sigmaB = 0.008184720159601516, muA = 6.101739143636067E-8, sigmaA =
1.8085847271776477E-6, tMin = 0.0, muGMin = 3.0508695718180334E-8,
sigmaG = 7.627173929545084E-9, cMax1 = 23284, cMax2 = 5826, cMax =
5826, cOutMin = 5826, varOutMin = 6.696939664767769E-5, muG(cOutMin) =
3.0509159217517505E-8, U(cOutMin) = 1.0727537627935279E-4
--

--
        public static class BatcherSortCallable implements Callable<Quote[]>
{

                protected final Quote[] quotes;

                public BatcherSortCallable(Quote[] quotes) {
                        this.quotes = quotes;
                }

                public Quote[] call() throws Exception {
                        int i = 0;
                        Quote[] quotes1 = null;
                        for(i=0; i < loopcount; i++) {
                                quotes1 = quotes.clone();
                                BatcherSortNetwork50.sort(quotes1, qc);
                        }

                        return quotes1;
                }

        }

javaSort: action statistics: first = 69.175 us, mean = 2.628 us (CI
deltas: -629.044 ps, +626.206 ps), sd = 1.988 us (CI deltas: -272.189
ns, +379.098 ns) WARNING: SD VALUES MAY BE INACCURATE
        ----------
        --the action statistics were calculated from block statistics
        --each block measured 64 task executions
        --the user says that task internally performs m = 10000
actions
        --then the number of actions per block measurement is a =
640000
        --block statistics: mean = 1.682 s (CI deltas: -402.588 us,
+400.772 us), sd = 1.591 ms (CI deltas: -217.751 us, +303.278 us)
        --the forumla used to convert block statistics to action
statistics (mean scales as 1/a, sd scales as 1/sqrt(a)) assumes that
the action execution times are iid
        ----------
        --each confidence interval (CI) is reported as either +-
deltas from the point estimate, or as a closed interval ([x, y])
        --each confidence interval has confidence level = 0.95
        ----------
        --sd results have unknown validity (the environmental noise
test was skipped)
        ----------
        --action sd values ALMOST CERTAINLY GROSSLY INFLATED by
outliers
        --they cause at least 67.8818670249685% of the measured
VARIANCE according to a equi-valued outlier model
        --model quantities: a = 640000.0, muB = 1.6822165199500003,
sigmaB = 0.0015906297382019161, muA = 2.6284633124218756E-6, sigmaA =
1.9882871727523954E-6, tMin = 0.0, muGMin = 1.3142316562109378E-6,
sigmaG = 3.2855791405273445E-7, cMax1 = 408539, cMax2 = 197190, cMax =
197190, cOutMin = 197190, varOutMin = 1.7174811296527663E-6,
muG(cOutMin) = 1.3142344203126216E-6, U(cOutMin) =
5.5796966189531335E-6
--


--
        public static class JavaDefaultSortCallable implements
Callable<Quote[]> {

                protected final Quote[] quotes;

                public JavaDefaultSortCallable(Quote[] quotes) {
                        this.quotes = quotes;
                }

                public Quote[] call() throws Exception {
                        int i = 0;
                        Quote[] quotes1 = null;
                        for(i=0; i < loopcount; i++) {
                                quotes1 = quotes.clone();
                                Arrays.sort(quotes1, qc);
                        }
                        return quotes1;
                }

        }

batcherSort: action statistics: first = 13.157 us, mean = 2.895 us (CI
deltas: -507.973 ps, +637.506 ps), sd = 1.784 us (CI deltas: -365.894
ns, +670.226 ns) WARNING: execution times have mild outliers,
execution times may have serial correlation, SD VALUES MAY BE
INACCURATE
        ----------
        --the action statistics were calculated from block statistics
        --each block measured 64 task executions
        --the user says that task internally performs m = 10000
actions
        --then the number of actions per block measurement is a =
640000
        --block statistics: mean = 1.853 s (CI deltas: -325.103 us,
+408.004 us), sd = 1.427 ms (CI deltas: -292.715 us, +536.181 us)
        --the forumla used to convert block statistics to action
statistics (mean scales as 1/a, sd scales as 1/sqrt(a)) assumes that
the action execution times are iid
        ----------
        --each confidence interval (CI) is reported as either +-
deltas from the point estimate, or as a closed interval ([x, y])
        --each confidence interval has confidence level = 0.95
        ----------
        --EXECUTION TIMES APPEAR TO HAVE OUTLIERS
        --this was determined using the boxplot algorithm with median
= 1.853 s, interquantileRange = 1.725 ms
        --2 are mild (on the high side): #58 = 1.857 s, #59 = 1.859 s
        ----------
        --EXECUTION TIMES HAVE SERIAL CORRELATION
        --1 of the 15 autocorrelation function coefficients (r[k])
that were computed are expected to fall outside their 95% CI
        --but found these 3: r[1] is 2.3544099434807064 sigma above
its mean, r[2] is 2.718366760543633 sigma above its mean, r[3] is
2.0643815453057113 sigma above its mean
        --the 95% CI for the r[k] was calculated as mean +- 1.96 sigma
(i.e. a Gaussian distribution was assumed)
        ----------
        --sd results have unknown validity (the environmental noise
test was skipped)
        ----------
        --action sd values ALMOST CERTAINLY GROSSLY INFLATED by
outliers
        --they cause at least 58.240036949263654% of the measured
VARIANCE according to a equi-valued outlier model
        --model quantities: a = 640000.0, muB = 1.8530796451166673,
sigmaB = 0.0014274778063804268, muA = 2.8954369454947925E-6, sigmaA =
1.7843472579755334E-6, tMin = 0.0, muGMin = 1.4477184727473963E-6,
sigmaG = 3.6192961818684907E-7, cMax1 = 465280, cMax2 = 257877, cMax =
257877, cOutMin = 257877, varOutMin = 1.18675309071405E-6,
muG(cOutMin) = 1.4477211539948363E-6, U(cOutMin) =
5.040667041220033E-6
--

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Implementing let-over-lambda chapter 7 badger network for sorting

Reply via email to