RE: GraphX partition problem

Zhicharevich, Alex Sun, 25 May 2014 06:40:31 -0700

Thanks Ankur<http://www.ankurdave.com/>,


Built it from git and it works great.

I have another issue now. I am trying to process a huge graph with about 20 
billion edges with GraphX. I only load the file, compute connected components 
and persist it right back to disk. When working with subgraphs (with ~50M 
edges) this works well, but on the whole graph it seem to choke on the graph 
construction part.
Can you advise on how to tune spark to process memory parameters for this task.

Thanks,
Alex

From: Ankur Dave [mailto:ankurd...@gmail.com]
Sent: Thursday, May 22, 2014 6:59 PM
To: user@spark.apache.org
Subject: Re: GraphX partition problem

The fix will be included in Spark 1.0, but if you just want to apply the fix to 
0.9.1, here's a hotfixed version of 0.9.1 that only includes PR #367: 
https://github.com/ankurdave/spark/tree/v0.9.1-handle-empty-partitions. You can 
clone and build this.

Ankur<http://www.ankurdave.com/>

On Thu, May 22, 2014 at 4:53 AM, Zhicharevich, Alex 
<azhicharev...@ebay.com<mailto:azhicharev...@ebay.com>> wrote:
Hi,

I’m running a simple connected components code using GraphX (version 0.9.1)

My input comes from a HDFS text file partitioned to 400 parts. When I run the 
code on a single part or a small number of files (like 20) the code runs fine. 
As soon as I’m trying to read more files (more than 30) I’m getting an error 
and the job fails.
From looking at the logs I see the following exception
                java.util.NoSuchElementException: End of stream
       at org.apache.spark.util.NextIterator.next(NextIterator.scala:83)
       at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:29)
       at 
org.apache.spark.graphx.impl.RoutingTable$$anonfun$1.apply(RoutingTable.scala:52)
       at 
org.apache.spark.graphx.impl.RoutingTable$$anonfun$1.apply(RoutingTable.scala:51)
       at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:456)

From searching the web, I see it’s a known issue with GraphX
Here : https://github.com/apache/spark/pull/367
And here : https://github.com/apache/spark/pull/497

Are there some stable releases that include this fix? Should I clone the git 
repo and build it myself? How would you advise me to deal with this issue

Thanks,
Alex

RE: GraphX partition problem

Reply via email to