[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

mallman Wed, 19 Apr 2017 17:29:59 -0700

Github user mallman commented on the issue:

    https://github.com/apache/spark/pull/15125
  
    > I think we want to bring GraphFrames to feature/performance parity with 
GraphX - @mallman would love to understand the challenges you have run into. 
Better yet, would be great to get some issues created to track them
    
    Regarding the scale/performance of GraphX vis-a-vis GraphFrames, I can 
speak from our experience with the connected components algorithm.
    
    As you know, there are two implementations of the connected components 
algorithm in the GraphFrames project. There's an implementation which 
"piggybacks" on GraphX. And there's an implementation that does not use GraphX.
    
    We don't really need any vertex or edge attributes when computing connected 
components. Any such attributes are strictly overhead. We found that using 
vertex and edge attributes of type `Boolean` and value `null` provide the least 
overhead. I would expect piggybacking on top of GraphX cannot scale better than 
using GraphX itself. And we just didn't get performance parity using 
GraphFrames in this way.
    
    GraphX uses custom implementations of the RDD interface with fast point 
indices for lookups and joins. By contrast, the Dataset interface is closed to 
extension by clients, and that's by design.
    
    Considering the problem that way, I think that bringing a Dataset-based 
graph library to performance parity with an RDD-based graph library will be 
quite challenging. This is especially true in cases where we the client doesn't 
even need vertex or edge attributes.
    
    I think that to even get to performance parity, Spark SQL needs to include 
support for some kind of columnar indices. But even if GraphFrames implements a 
better algorithm for connected components than that in GraphX, would that 
algorithm perform better in GraphX if it was ported to that codebase?
    
    We'd love to use something like GraphFrames pervasively, as it does provide 
a much more convenient interface when we do use vertex or edge attributes. In 
fact, before we discovered GraphFrames we made quite a lot of headway into 
building our own graph library of the sort. However, we found that the overhead 
incurred by the DataFrame approach (we took) was untenable.
    
    We use a sort of hybrid approach. We do everything except graph 
computations with dataframes. We convert to RDDs for graph computations.
    
    Cheers.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

Reply via email to