GitHub user darabos opened a pull request:
https://github.com/apache/spark/pull/276
Do not re-use objects in the EdgePartition/EdgeTriplet iterators.
This avoids a silent data corruption issue
(https://spark-project.atlassian.net/browse/SPARK-1188) and has no performance
impact by my measurements. It also simplifies the code. As far as I can tell
the object re-use was nothing but premature optimization.
I did actual benchmarks for all the included changes, and there is no
performance difference. I am not sure where to put the benchmarks. Does Spark
not have a benchmark suite?
This is an example benchmark I did:
test("benchmark") {
val builder = new EdgePartitionBuilder[Int]
for (i <- (1 to 10000000)) {
builder.add(i.toLong, i.toLong, i)
}
val p = builder.toEdgePartition
p.map(_.attr + 1).iterator.toList
}
It ran for 10 seconds both before and after this change.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/darabos/spark spark-1188
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/276.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #276
----
commit c55f52fffa79f0ee227367a555172f6cb4ce5cee
Author: Daniel Darabos <[email protected]>
Date: 2014-03-31T10:58:05Z
Tests that reproduce the problems from SPARK-1188.
commit 0182f2b329b2bb6e6ca8c41245f09db83b71908b
Author: Daniel Darabos <[email protected]>
Date: 2014-03-31T10:58:37Z
Do not re-use objects in the EdgePartition/EdgeTriplet iterators. This
avoids a silent data corruption issue (SPARK-1188) and has no performance
impact in my measurements. It also simplifies the code.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---