Question about master memory requirement and GraphX pagerank performance !

2015-07-07 Thread Khaled Ammar
hours. There is one that was taking 4+ hours, and its input is 400+ GB. I must be doing something wrong, any comment? -- Thanks, -Khaled Ammar www.khaledammar.com

Re: GraphX Synth Benchmark

2015-07-09 Thread Khaled Ammar
Hi, I am not a spark expert but I found that passing a small partitions value might help. Try to use this option "--numEPart=$partitions" where partitions=3 (number of workers) or at most 3*40 (total number of cores). Thanks, -Khaled On Thu, Jul 9, 2015 at 11:37 AM, AshutoshRaghuvanshi < ashutos

Re: Graphx

2016-03-11 Thread Khaled Ammar
This is an interesting discussion, I have had some success running GraphX on large graphs with more than a Billion edges using clusters of different size up to 64 machines. However, the performance goes down when I double the cluster size to reach 128 machines of r3.xlarge. Does any one have exper

GraphX replication factor

2016-04-05 Thread Khaled Ammar
Hi, I wonder if it is possible to figure out the replication factor used in GraphX partitioning from its log files. -- Thanks, -Khaled

Re: RDD Partitions not distributed evenly to executors

2016-04-05 Thread Khaled Ammar
I have a similar experience. Using 32 machines, I can see than number of tasks (partitions) assigned to executors (machines) is not even. Moreover, the distribution change every stage (iteration). I wonder why Spark needs to move partitions around any way, should not the scheduler reduce network

Performance issues in SSSP using GraphX

2015-10-30 Thread Khaled Ammar
Hi all, I have an interesting behavior from GraphX while running SSSP. I use the stand-alone mode with 16+1 machines, each has 30GB memory and 4 cores. The dataset is 63GB. However, the input for some stages is huge, about 16 TB ! The computation takes very long time. I stopped it. For your info

Why some executors are lazy?

2015-11-03 Thread Khaled Ammar
Hi, I'm using the most recent Spark version on a standalone setup of 16+1 machines. While running GraphX workloads, I found that some executors are lazy? They *rarely* participate in computation. This causes some other executors to do their work. This behavior is consistent in all iterations and

What does "write time" means exactly in Spark UI?

2015-11-03 Thread Khaled Ammar
Hi, I wonder what does write time means exactly? I run GraphX workloads and noticed the main bottleneck in most stages is one or two tasks takes too long in "write time" and delay the whole job. Enabling speculation helps a little but I am still interested to know how to fix that? I use MEMORY_O

Re: Why some executors are lazy?

2015-11-04 Thread Khaled Ammar
? Thanks, -Khaled On Wed, Nov 4, 2015 at 7:21 AM, Adrian Tanase wrote: > If some of the operations required involve shuffling and partitioning, it > might mean that the data set is skewed to specific partitions which will > create hot spotting on certain executors. > > -adrian > &g

"Master: got disassociated, removing it."

2015-11-05 Thread Khaled Ammar
Hi, I am using GRAPHX in standalone SPARK 1.5.1 in a medium size cluster (64+1). I could execute PageRank with large number of iterations on this cluster. However, when I run SSSP, it always fail at iteration 23 or 24. This is always at after about 11 mins. Note that PageRank takes more than that

GraphX stopped without finishing and with no ERRORs !

2015-11-18 Thread Khaled Ammar
Hi all, I have a problem running some algorithms on GraphX. Occasionally, it stopped running without any errors. The task state is FINISHED, but the executers state is KILLED. However, I can see that one job is not finished yet. It took too much time (minutes) while every job/iteration were typica

Performance questions regarding Spark 1.3 standalone mode

2015-07-24 Thread Khaled Ammar
Hi all, I have a standalone spark cluster setup on EC2 machines. I did the setup manually without the ec2 scripts. I have two questions about Spark/GraphX performance: 1) When I run the PageRank example, the storage tab does not show that all RDDs are cached. Only one RDD is 100% cached, but the

Fwd: Performance questions regarding Spark 1.3 standalone mode

2015-07-27 Thread Khaled Ammar
Hi all, I wonder if any one has an explanation for this behavior. Thank you, -Khaled -- Forwarded message -- From: Khaled Ammar Date: Fri, Jul 24, 2015 at 9:35 AM Subject: Performance questions regarding Spark 1.3 standalone mode To: user@spark.apache.org Hi all, I have a

NaN in GraphX PageRank answer

2015-08-18 Thread Khaled Ammar
Hi all, I was trying to use GraphX to compute pagerank and found that pagerank value for several vertices is NaN. I am using Spark 1.3. Any idea how to fix that? -- Thanks, -Khaled

Basic GraphX deployment and usage question

2015-03-16 Thread Khaled Ammar
Hi, I'm very new to Spark and GraphX. I downloaded and configured Spark on a cluster, which uses Hadoop 1.x. The master UI shows all workers. The example command "run-example SparkPi" works fine and completes successfully. I'm interested in GraphX. Although the documentation says it is built-in w