I can't offer definite answers to your questions, but I can suggest a few issues you should consider:

1. Merge sort doesn't parallelize all that well--when the blocks are small, the parallelization overhead is large in comparison with the productive work that is to be done, and when the blocks get large, the amount of parallelization possible is not great. Quicksort and quickersort, of course, suffer from the same issue. The end result is that your timings will be heavily dependent on your hardware, software, and the properties of the particular data set you use for testing.

2. You need to account for I/O buffering (not only by your OP system in RAM, but by your disk controller)--after the first set of I/O operations, your data may be in buffers, so subsequent uses may retrieve data from buffers rather than from the disk itself. Similarly, you also have to take into account paging and cache issues, which could make the first run much slower than immediate subsequent runs.

3. A better benchmark would be provided by a counting sort, which does parallelize well (O(n * (n/k), where k is the number of processors, and n is the number of elements to be sorted). A major advantage of using a counting sort for benchmarking is that it runs slowly enough to make it relatively easy to compare sequential and parallel timings.

4. Depending on your system defaults, there may also be memory allocation issues that need to be taken into account (which could also easily cause the first run to be considerably slower than subsequent runs made immediately after the first).



Murray Gross
Brooklyn College



On Sat, 19 Apr 2008, Andrew Coppin wrote:

OK, so just for fun, I decided to try implementing a parallel merge sort using the seq and par combinators. My plan was to generate some psuedo-random data and time how long it takes to sort it. To try to account for lazy evaluation, what the program actually does is this:

1. Write the input data to disk without any sorting. (This ought to force it to be fully evaluated.)
2. Sort and save the data to disk 8 times. (So I can average the runtimes.)

This is done with two data sets - one with 1 million items, and another with 2 million rows. Each data set is run through both the purely sequential algorithm and a simple parallel one. (Split the list in half, merge-sort each half in parallel, and then merge them.)

The results of this little benchmark utterly defy comprehension. Allow me to enumerate:

Weird thing #1: The first time you sort the data, it takes a few seconds. The other 7 times, it takes a split second - roughly 100x faster. Wuh?

Weird thing #2: The parallel version runs *faster* than the sequential one in all cases - even with SMP disabled! (We're only talking a few percent faster, but still.)

Weird thing #3: Adding the "-threaded" compiler option makes *everything* run a few percent faster. Even with only 1 OS thread.

Weird thing #4: Adding "-N2" makes *everything* slow down a few percent. In particular, Task Manager shows only one CPU core in use.

Adding more than 2 OS threads makes everything slow down even further - but that's hardly surprising.

Can anybody explain any of this behaviour? I have no idea what I'm benchmarking, but it certainly doesn't appear to be the performance of a parallel merge sort!

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to