Hey all,

I’ve often found that my spark programs run much more stable with a higher 
number of partitions, and a lot of the graphs I deal with will have a few 
hundred large part files. I was wondering if having a parameter in GraphLoader, 
defaulting to false, to set the shuffle parameter in coalesce is something that 
might be added to graphx, or if there was a good reason for not including it? 
I’ve been using this patch myself for a couple weeks.

—Jeff

diff --git a/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala 
b/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala
index f4c7936..b2f9e9c 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala
@@ -58,13 +58,14 @@ object GraphLoader extends Logging {
       canonicalOrientation: Boolean = false,
       minEdgePartitions: Int = 1,
       edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
-      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
+      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
+      shuffle: Boolean = false)
     : Graph[Int, Int] =
   {
     val startTime = System.currentTimeMillis

     // Parse the edge data table directly into edge partitions
-    val lines = sc.textFile(path, 
minEdgePartitions).coalesce(minEdgePartitions)
+    val lines = sc.textFile(path, 
minEdgePartitions).coalesce(minEdgePartitions, shuffle)
     val edges = lines.mapPartitionsWithIndex { (pid, iter) =>
       val builder = new EdgePartitionBuilder[Int, Int]
       iter.foreach { line =>

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to