Github user rxin commented on the pull request:
https://github.com/apache/spark/pull/276#issuecomment-39150913
Thanks for sending this in. It's something that is extremely confusing in
graphx and we need to fix it. However, I am not sure if taking out the object
reuse in edges is the way to fix this problem.
This is actually hard to test in micro benchmarks, because you rarely get
GCs in micro benchmarks. In these cases where an edge/triplet is returned by an
iterator, JVM's escape analysis cannot capture the scope and cannot do on-stack
allocation. As a result, there are lots of temporary objects allocated in the
heap.
Allocations of short-lived objects are supposed to be cheap. The most
expensive objects to gc are medium-lived objects. However, the more temporary
objects we allocate, the more frequent a young gen gc happens. The more
frequent young gen gc happens, the more likely for random objects to become
medium lived.
Maybe a better way to fix this is to leave the object reuse as is, but in
all places where we return the object to the user, we should make sure it
copies it.
I just looked at the code, and I think we can accomplish that by just
adding a copy to EdgeRDD.compute, i.e.
```scala
override def compute(part: Partition, context: TaskContext):
Iterator[Edge[ED]] = {
firstParent[(PartitionID, EdgePartition[ED])].iterator(part,
context).next._2.iterator.map(_.copy())
}
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---