Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/276#issuecomment-39150913
  
    Thanks for sending this in. It's something that is extremely confusing in 
graphx and we need to fix it. However, I am not sure if taking out the object 
reuse in edges is the way to fix this problem. 
    
    This is actually hard to test in micro benchmarks, because you rarely get 
GCs in micro benchmarks. In these cases where an edge/triplet is returned by an 
iterator, JVM's escape analysis cannot capture the scope and cannot do on-stack 
allocation. As a result, there are lots of temporary objects allocated in the 
heap.
    
    Allocations of short-lived objects are supposed to be cheap. The most 
expensive objects to gc are medium-lived objects. However, the more temporary 
objects we allocate, the more frequent a young gen gc happens. The more 
frequent young gen gc happens, the more likely for random objects to become 
medium lived. 
    
    Maybe a better way to fix this is to leave the object reuse as is, but in 
all places where we return the object to the user, we should make sure it 
copies it.
    
    I just looked at the code, and I think we can accomplish that by just 
adding a copy to EdgeRDD.compute, i.e.
    ```scala
      override def compute(part: Partition, context: TaskContext): 
Iterator[Edge[ED]] = {
        firstParent[(PartitionID, EdgePartition[ED])].iterator(part, 
context).next._2.iterator.map(_.copy())
      }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to