Re: Graphx seems to be broken while Creating a large graph(6B nodes in my case)

2014-08-23 Thread Jeffrey Picard
I’m seeing this issue also. I have graph with with 5828339535 vertices and 
7398447992 edges, graph.numVertices returns 1533266498 and graph.numEdges is 
correct and returns 7398447992. I also am having an issue that I’m beginning to 
suspect is caused by the same underlying problem where connected components 
stops after one iteration, returning an incorrect graph.
On Aug 22, 2014, at 8:43 PM, npanj nitinp...@gmail.com wrote:

 While creating a graph with 6B nodes and 12B edges, I noticed that
 *'numVertices' api returns incorrect result*; 'numEdges' reports correct
 number. For few times(with different dataset  2.5B nodes) I have also
 notices that numVertices is returned as -ive number; so I suspect that there
 is some overflow (may be we are using Int for some field?).
 
 Environment: Standalone mode running on EC2 . Using latest code from master
 branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .
 
 Here is some details of experiments I have done so far: 
 1. Input: numNodes=6101995593 ; noEdges=12163784626
 Graph returns: numVertices=1807028297 ; numEdges=12163784626
 2. Input : numNodes=*2157586441* ; noEdges=2747322705
 Graph Returns: numVertices=*-2137380855* ; numEdges=2747322705
 3. Input: numNodes=1725060105 ; noEdges=204176821
 Graph: numVertices=1725060105 ; numEdges=2041768213 
 
 
 You can find the code to generate this bug here:
 https://gist.github.com/npanj/92e949d86d08715bf4bf
 
 (I have also filed this jira ticket:
 https://issues.apache.org/jira/browse/SPARK-3190)
 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Graphx-seems-to-be-broken-while-Creating-a-large-graph-6B-nodes-in-my-case-tp7966.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org
 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Spark Contribution

2014-08-23 Thread Nicholas Chammas
That sounds like a good idea.

Continuing along those lines, what do people think of moving the
contributing page entirely from the wiki to GitHub? It feels like the right
place for it since GitHub is where we take contributions, and it also lets
people make improvements to it.

Nick


2014년 8월 23일 토요일, Sean Owenso...@cloudera.com님이 작성한 메시지:

 Can I ask a related question, since I have a PR open to touch up
 README.md as we speak (SPARK-3069)?

 If this text is in a file called CONTRIBUTING.md, then it will cause a
 link to appear on the pull request screen, inviting people to review
 the contribution guidelines:

 https://github.com/blog/1184-contributing-guidelines

 This is mildly important as the project wants to make it clear that
 you agree that your contribution is licensed under the AL2, since
 there is no formal ICLA.

 How about I propose moving the text to CONTRIBUTING.md with a pointer
 in README.md? or keep it both places?

 On Sat, Aug 23, 2014 at 1:08 AM, Reynold Xin r...@databricks.com
 javascript:; wrote:
  Great idea. Added the link
  https://github.com/apache/spark/blob/master/README.md
 
 
 
  On Thu, Aug 21, 2014 at 4:06 PM, Nicholas Chammas 
  nicholas.cham...@gmail.com javascript:; wrote:
 
  We should add this link to the readme on GitHub btw.
 
  2014년 8월 21일 목요일, Henry Saputrahenry.sapu...@gmail.com javascript:;님이
 작성한 메시지:
 
   The Apache Spark wiki on how to contribute should be great place to
   start:
  
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
  
   - Henry
  
   On Thu, Aug 21, 2014 at 3:25 AM, Maisnam Ns maisnam...@gmail.com
 javascript:;
   javascript:; wrote:
Hi,
   
Can someone help me with some links on how to contribute for Spark
   
Regards
mns
  
   -
   To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 javascript:; javascript:;
   For additional commands, e-mail: dev-h...@spark.apache.org
 javascript:;
  javascript:;
  
  
 



Mesos/Spark Deadlock

2014-08-23 Thread Gary Malouf
I just wanted to bring up a significant Mesos/Spark issue that makes the
combo difficult to use for teams larger than 4-5 people.  It's covered in
https://issues.apache.org/jira/browse/MESOS-1688.  My understanding is that
Spark's use of executors in fine-grained mode is a very different behavior
than many of the other common frameworks for Mesos.


Re: Mesos/Spark Deadlock

2014-08-23 Thread Gary Malouf
Hi Matei,

We have an analytics team that uses the cluster on a daily basis.  They use
two types of 'run modes':

1) For running actual queries, they set the spark.executor.memory to
something between 4 and 8GB of RAM/worker.

2) A shell that takes a minimal amount of memory on workers (128MB) for
prototyping out a larger query.  This allows them to not take up RAM on the
cluster when they do not really need it.

We see the deadlocks when there are a few shells in either case.  From the
usage patterns we have, coarse-grained mode would be a challenge as we have
to constantly remind people to kill their shells as soon as their queries
finish.

Am I correct in viewing Mesos in coarse-grained mode as being similar to
Spark Standalone's cpu allocation behavior?




On Sat, Aug 23, 2014 at 7:16 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hey Gary, just as a workaround, note that you can use Mesos in
 coarse-grained mode by setting spark.mesos.coarse=true. Then it will hold
 onto CPUs for the duration of the job.

 Matei

 On August 23, 2014 at 7:57:30 AM, Gary Malouf (malouf.g...@gmail.com)
 wrote:

 I just wanted to bring up a significant Mesos/Spark issue that makes the
 combo difficult to use for teams larger than 4-5 people. It's covered in
 https://issues.apache.org/jira/browse/MESOS-1688. My understanding is
 that
 Spark's use of executors in fine-grained mode is a very different behavior
 than many of the other common frameworks for Mesos.