GitHub user darabos opened a pull request:

    https://github.com/apache/spark/pull/276

    Do not re-use objects in the EdgePartition/EdgeTriplet iterators.

    This avoids a silent data corruption issue 
(https://spark-project.atlassian.net/browse/SPARK-1188) and has no performance 
impact by my measurements. It also simplifies the code. As far as I can tell 
the object re-use was nothing but premature optimization.
    
    I did actual benchmarks for all the included changes, and there is no 
performance difference. I am not sure where to put the benchmarks. Does Spark 
not have a benchmark suite?
    
    This is an example benchmark I did:
    
    test("benchmark") {
      val builder = new EdgePartitionBuilder[Int]
      for (i <- (1 to 10000000)) {
        builder.add(i.toLong, i.toLong, i)
      }
      val p = builder.toEdgePartition
      p.map(_.attr + 1).iterator.toList
    }
    
    It ran for 10 seconds both before and after this change.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/darabos/spark spark-1188

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/276.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #276
    
----
commit c55f52fffa79f0ee227367a555172f6cb4ce5cee
Author: Daniel Darabos <[email protected]>
Date:   2014-03-31T10:58:05Z

    Tests that reproduce the problems from SPARK-1188.

commit 0182f2b329b2bb6e6ca8c41245f09db83b71908b
Author: Daniel Darabos <[email protected]>
Date:   2014-03-31T10:58:37Z

    Do not re-use objects in the EdgePartition/EdgeTriplet iterators. This 
avoids a silent data corruption issue (SPARK-1188) and has no performance 
impact in my measurements. It also simplifies the code.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to