Github user mallman commented on the issue:
https://github.com/apache/spark/pull/15125
> I think we want to bring GraphFrames to feature/performance parity with
GraphX - @mallman would love to understand the challenges you have run into.
Better yet, would be great to get some issues created to track them
Regarding the scale/performance of GraphX vis-a-vis GraphFrames, I can
speak from our experience with the connected components algorithm.
As you know, there are two implementations of the connected components
algorithm in the GraphFrames project. There's an implementation which
"piggybacks" on GraphX. And there's an implementation that does not use GraphX.
We don't really need any vertex or edge attributes when computing connected
components. Any such attributes are strictly overhead. We found that using
vertex and edge attributes of type `Boolean` and value `null` provide the least
overhead. I would expect piggybacking on top of GraphX cannot scale better than
using GraphX itself. And we just didn't get performance parity using
GraphFrames in this way.
GraphX uses custom implementations of the RDD interface with fast point
indices for lookups and joins. By contrast, the Dataset interface is closed to
extension by clients, and that's by design.
Considering the problem that way, I think that bringing a Dataset-based
graph library to performance parity with an RDD-based graph library will be
quite challenging. This is especially true in cases where we the client doesn't
even need vertex or edge attributes.
I think that to even get to performance parity, Spark SQL needs to include
support for some kind of columnar indices. But even if GraphFrames implements a
better algorithm for connected components than that in GraphX, would that
algorithm perform better in GraphX if it was ported to that codebase?
We'd love to use something like GraphFrames pervasively, as it does provide
a much more convenient interface when we do use vertex or edge attributes. In
fact, before we discovered GraphFrames we made quite a lot of headway into
building our own graph library of the sort. However, we found that the overhead
incurred by the DataFrame approach (we took) was untenable.
We use a sort of hybrid approach. We do everything except graph
computations with dataframes. We convert to RDDs for graph computations.
Cheers.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]