GitHub user aarondav opened a pull request:
https://github.com/apache/spark/pull/1502
SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage
### Why and what?
Currently, the AppendOnlyMap performs an "in-place" sort by converting its
array of [key, value, key, value] pairs into a an array of [(key, value), (key,
value)] pairs. However, this causes us to allocate many Tuple2 objects, which
come at a nontrivial overhead.
This patch adds a Sorter API, intended for in memory sorts, which simply
ports the Java OpenJDK 6 implementation of Arrays.sort() (which uses a merge
sort) and abstracts the interface in a way which introduces no more than 1
virtual function invocation of overhead at each abstraction point.
Please compare our port of the Java 6 sort with the original
implementation: http://www.diffchecker.com/kh9ufcqo
### Memory implications
An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements
within its underlying array. Each of these elements is 4 bytes wide in a
[compressed
OOP](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system,
which is the default.
Today's approach immediately allocates N Tuple2 objects, which take up 24N
bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6
version immediately copies the entire array (4N bytes here), while the Java 7
version has a worst-case allocation of half the array (2N bytes).
This results in a sorting overhead of 24N + 4N = 28N bytes (for Java 6).
The Sorter does not require allocating any tuples, but since it uses the
Java 6 merge sort algorithm, it does copy the entire array (and that is the
entire array, not just the half needed for Tuples).
This results in a sorting overhead of 8N bytes.
Thus, we have reduced the overhead of the sort by roughly 20 bytes times
the number of elements.
### Performance implications
As the destructiveSortedIterator is used for spilling in an
ExternalAppendOnlyMap, the purpose of this patch is to provide stability by
reducing memory usage rather than improve performance.
Indeed, this PR implements Java 6's merge sort rather than the Java 7
Timsort, which is much more performant. A future optimization is to port the
Timsort over, which the SortDataFormat API should support with minimal changes.
Nevertheless, here are the results of a microbenchmark that sorted 25
million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests
were run **only on the keys**, and thus moved less data. Our current
implementation is called "Tuple-sort using Arrays.sort()".
<table>
<tr><th>Test</th><th>First run (JDK6)</th><th>Average of 10
(JDK6)</th><th>First run (JDK7)</th><th>Average of 10 (JDK7)</th></tr>
<tr><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190
ms</td><td>2724 ms</td><td>131 ms (!!)</td></tr>
<tr><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td><td>13201
ms</td><td>878 ms</td></tr>
<tr><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550
ms</td><td>20990 ms</td><td>1919 ms</td></tr>
<tr><td><b>KV-array using Sorter</b></td><td></td><td></td><td><b>18232
ms</b></td><td><b>2030 ms</b></td></tr>
<tr><td>Microbenchmarks are stupid</td><td></td><td></td><td>25708
ms</td><td>2400 ms</td></tr>
</table>
Note that the final test was the same as "KV-array using Sorter", but with
a second implementation of SortDataFormat loaded in the JVM (presumably causing
a de-opt for the virtual function call shortcircuit).
The results show that this Sorter performs exactly as expected -- it is
about as fast as the Java 6 Arrays.sort() (which shares the same algorithm),
but is significantly faster than the Tuple-sort on Java 6.
The Java 7 Timsort provided a huge speedup in this benchmark, suggesting
that using it instead would result in roughly a 2x speedup. However, the
Tuple-based approach is still not significantly faster despite the much better
algorithm.
In short, this patch should significantly improve performance for users
running Java 6, and provide a minor performance degradation for users running
Java 7.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/aarondav/spark sort
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1502.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1502
----
commit 6307338402e7872368bf61993b1be4ed7302cb6e
Author: Aaron Davidson <[email protected]>
Date: 2014-07-19T19:21:29Z
SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage
Currently, the AppendOnlyMap performs an "in-place" sort by converting
its array of [key, value, key, value] pairs into a an array of
[(key, value), (key, value)] pairs. However, this causes us to allocate
many Tuple2 objects, which come at a nontrivial overhead.
This patch adds a Sorter API, intended for in memory sorts, which simply
ports the Java OpenJDK 6 implementation of Arrays.sort() (which uses
a merge sort) and abstracts the interface in a way which introduces no more
than 1 virtual function invocation of overhead at each abstraction point.
Please compare our port of the Java 6 sort with the original implementation:
http://www.diffchecker.com/kh9ufcqo
=== Memory implications ===
An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements
within its underlying array. Each of these elements is 4 bytes wide in a
[compressed
OOP](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system,
which is the default.
Today's approach immediately allocates N Tuple2 objects, which take up 24N
bytes
in total (exposed via YourKit), and undergoes a Java sort. The Java 6
version
immediately copies the entire array (4N bytes here), while the Java 7
version has a worst-case allocation of half the array (2N bytes).
This results in a sorting overhead of 24N + 4N = 28N bytes (for Java 6).
The Sorter does not require allocating any tuples, but since it uses the
Java 6
merge sort algorithm, it does copy the entire array (and that is the entire
array,
not just the half needed for Tuples).
This results in a sorting overhead of 8N bytes.
Thus, we have reduced the overhead of the sort by roughly 20 bytes times the
number of elements.
=== Performance implications ===
As the destructiveSortedIterator is used for spilling in an
ExternalAppendOnlyMap,
the purpose of this patch is to provide stability by reducing memory usage
rather
than improve performance.
Indeed, this PR implements Java 6's merge sort rather than the Java 7
Timsort, which
is much more performant. A future optimization is to port the Timsort over,
which the
SortDataFormat API should support with minimal changes.
Nevertheless, here are the results of a microbenchmark that sorted 25
million, randomly
distributed (Float, Int) pairs. The Java Arrays.sort() tests were run
**only on the keys**,
and thus moved less data. Our current implementation is called "Tuple-sort
using Arrays.sort()".
<table>
<tr><th>Java version</th><th>Test</th><th>First run</th><th>Average of
10</th></tr>
<tr><td>6</td><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190
ms</td></tr>
<tr><td>6</td><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td></tr>
<tr><td>6</td><td>Tuple-sort using Arrays.sort()</td><td>31813
ms</td><td>3550 ms</td></tr>
<tr><td>7</td><td>primitive Arrays.sort()</td><td>2724 ms</td><td>131 ms
(!!)</td></tr>
<tr><td>7</td><td>Arrays.sort()</td><td>13201 ms</td><td>878 ms</td></tr>
<tr><td>7</td><td>Tuple-sort using Arrays.sort()</td><td>20990
ms</td><td>1919 ms</td></tr>
<tr><td>7</td><td>**KV-sort using Sorter**</td><td>**18232
ms**</td><td>**2030 ms**</td></tr>
<tr><td>7</td><td>Microbenchmarks are stupid</td><td>25708 ms</td><td>2400
ms</td></tr>
</table>
Note that the final test was the same as KV-sort using Sorter, but with a
second
impelementation of SortDataFormat loaded in the JVM (presumably causing a
de-opt
for the virtual function call shortcircuit).
The results show that this Sorter performs exactly as expected -- it is
about as fast as the
Java 6 Arrays.sort() (which shares the same algorithm), but is
significantly faster than
the Tuple-sort on Java 6.
The Java 7 Timsort provided a huge speedup in this benchmark, suggesting
that using it instead
would result in roughly a 2x speedup. However, the Tuple-based approach is
still not
significantly faster despite the much better algorithm.
In short, this patch should significantly improve performance on users
running Java 6, and
provide a minor performance degradation for users running Java 7.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---