Re: Learning GraphX Questions

2015-02-19 Thread Takeshi Yamamuro
Hi,

Vertices are simply hash-partitioned by spark.HashPartitioner, so
you easily calculate partition ids by yourself.

Also, you can type the lines to check ids;

import org.apache.spark.graphx._

graph.vertices.mapPartitionsWithIndex { (pid, iter) =
  val vids = Array.newBuilder[VertexId]
  for (d - iter) vids += d._1
  Iterator((pid, vids.result))
}
.map(d = sPID:${d._1} IDs:${d._2.toSeq.toString})
.collect
.foreach(println)








On Thu, Feb 19, 2015 at 12:31 AM, Matthew Bucci mrbucci...@gmail.com
wrote:

 Thanks for all the responses so far! I have started to understand the
 system more, but I just had another question while I was going along. Is
 there a way to check the individual partitions of an RDD? For example, if I
 had a graph with vertices a,b,c,d and it was split into 2 partitions could
 I check which vertices belonged in partition 1 and parition 2?

 Thank You,
 Matthew Bucci

 On Fri, Feb 13, 2015 at 10:58 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2015-02-13 12:19:46 -0800, Matthew Bucci mrbucci...@gmail.com wrote:
  1) How do you actually run programs in GraphX? At the moment I've been
 doing
  everything live through the shell, but I'd obviously like to be able to
 work
  on it by writing and running scripts.

 You can create your own projects that build against Spark and GraphX
 through a Maven dependency [1], then run those applications using the
 bin/spark-submit script included with Spark [2].

 These guides assume you already know how to do this using your preferred
 build tool (SBT or Maven). In short, here's how to do it with SBT:

 1. Install SBT locally (`brew install sbt` on OS X).

 2. Inside your project directory, create a build.sbt file listing Spark
 and GraphX as a dependency, as in [3].

 3. Run `sbt package` in a shell.

 4. Pass the JAR in your_project_dir/target/scala-2.10/ to
 bin/spark-submit.

 [1]
 http://spark.apache.org/docs/latest/programming-guide.html#linking-with-spark
 [2] http://spark.apache.org/docs/latest/submitting-applications.html
 [3] https://gist.github.com/ankurdave/1fb7234d8affb3a2e4f4

  2) Is there a way to check the status of the partitions of a graph? For
  example, I want to determine for starters if the number of partitions
  requested are always made, like if I ask for 8 partitions but only have
 4
  cores what happens?

 You can look at `graph.vertices` and `graph.edges`, which are both RDDs,
 so you can do for example: graph.vertices.partitions

  3) Would I be able to partition by vertex instead of edges, even if I
 had to
  write it myself? I know partitioning by edges is favored in a majority
 of
  the cases, but for the sake of research I'd like to be able to do both.

 If you pass PartitionStrategy.EdgePartition1D, this will partition edges
 by their source vertices, so all edges with the same source will be
 co-partitioned, and the communication pattern will be similar to
 vertex-partitioned (edge-cut) systems like Giraph.

  4) Is there a better way to time processes outside of using built-in
 unix
  timing through the logs or something?

 I think the options are Unix timing, log file timestamp parsing, looking
 at the web UI, or writing timing code within your program
 (System.currentTimeMillis and System.nanoTime).

 Ankur





-- 
---
Takeshi Yamamuro


Re: Learning GraphX Questions

2015-02-18 Thread Matthew Bucci
Thanks for all the responses so far! I have started to understand the
system more, but I just had another question while I was going along. Is
there a way to check the individual partitions of an RDD? For example, if I
had a graph with vertices a,b,c,d and it was split into 2 partitions could
I check which vertices belonged in partition 1 and parition 2?

Thank You,
Matthew Bucci

On Fri, Feb 13, 2015 at 10:58 PM, Ankur Dave ankurd...@gmail.com wrote:

 At 2015-02-13 12:19:46 -0800, Matthew Bucci mrbucci...@gmail.com wrote:
  1) How do you actually run programs in GraphX? At the moment I've been
 doing
  everything live through the shell, but I'd obviously like to be able to
 work
  on it by writing and running scripts.

 You can create your own projects that build against Spark and GraphX
 through a Maven dependency [1], then run those applications using the
 bin/spark-submit script included with Spark [2].

 These guides assume you already know how to do this using your preferred
 build tool (SBT or Maven). In short, here's how to do it with SBT:

 1. Install SBT locally (`brew install sbt` on OS X).

 2. Inside your project directory, create a build.sbt file listing Spark
 and GraphX as a dependency, as in [3].

 3. Run `sbt package` in a shell.

 4. Pass the JAR in your_project_dir/target/scala-2.10/ to bin/spark-submit.

 [1]
 http://spark.apache.org/docs/latest/programming-guide.html#linking-with-spark
 [2] http://spark.apache.org/docs/latest/submitting-applications.html
 [3] https://gist.github.com/ankurdave/1fb7234d8affb3a2e4f4

  2) Is there a way to check the status of the partitions of a graph? For
  example, I want to determine for starters if the number of partitions
  requested are always made, like if I ask for 8 partitions but only have 4
  cores what happens?

 You can look at `graph.vertices` and `graph.edges`, which are both RDDs,
 so you can do for example: graph.vertices.partitions

  3) Would I be able to partition by vertex instead of edges, even if I
 had to
  write it myself? I know partitioning by edges is favored in a majority of
  the cases, but for the sake of research I'd like to be able to do both.

 If you pass PartitionStrategy.EdgePartition1D, this will partition edges
 by their source vertices, so all edges with the same source will be
 co-partitioned, and the communication pattern will be similar to
 vertex-partitioned (edge-cut) systems like Giraph.

  4) Is there a better way to time processes outside of using built-in unix
  timing through the logs or something?

 I think the options are Unix timing, log file timestamp parsing, looking
 at the web UI, or writing timing code within your program
 (System.currentTimeMillis and System.nanoTime).

 Ankur



Learning GraphX Questions

2015-02-13 Thread Matthew Bucci
Hello, 

I was looking at GraphX as I believe it can be useful in my research on
temporal data and I had a number of questions about the system:

1) How do you actually run programs in GraphX? At the moment I've been doing
everything live through the shell, but I'd obviously like to be able to work
on it by writing and running scripts. 

2) Is there a way to check the status of the partitions of a graph? For
example, I want to determine for starters if the number of partitions
requested are always made, like if I ask for 8 partitions but only have 4
cores what happens?

3) Would I be able to partition by vertex instead of edges, even if I had to
write it myself? I know partitioning by edges is favored in a majority of
the cases, but for the sake of research I'd like to be able to do both.

4) Is there a better way to time processes outside of using built-in unix
timing through the logs or something?

Thank you very much for your insight,
Matthew Bucci



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Learning-GraphX-Questions-tp21649.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Learning GraphX Questions

2015-02-13 Thread Ankur Dave
At 2015-02-13 12:19:46 -0800, Matthew Bucci mrbucci...@gmail.com wrote:
 1) How do you actually run programs in GraphX? At the moment I've been doing
 everything live through the shell, but I'd obviously like to be able to work
 on it by writing and running scripts.

You can create your own projects that build against Spark and GraphX through a 
Maven dependency [1], then run those applications using the bin/spark-submit 
script included with Spark [2].

These guides assume you already know how to do this using your preferred build 
tool (SBT or Maven). In short, here's how to do it with SBT:

1. Install SBT locally (`brew install sbt` on OS X).

2. Inside your project directory, create a build.sbt file listing Spark and 
GraphX as a dependency, as in [3].

3. Run `sbt package` in a shell.

4. Pass the JAR in your_project_dir/target/scala-2.10/ to bin/spark-submit.

[1] 
http://spark.apache.org/docs/latest/programming-guide.html#linking-with-spark
[2] http://spark.apache.org/docs/latest/submitting-applications.html
[3] https://gist.github.com/ankurdave/1fb7234d8affb3a2e4f4

 2) Is there a way to check the status of the partitions of a graph? For
 example, I want to determine for starters if the number of partitions
 requested are always made, like if I ask for 8 partitions but only have 4
 cores what happens?

You can look at `graph.vertices` and `graph.edges`, which are both RDDs, so you 
can do for example: graph.vertices.partitions

 3) Would I be able to partition by vertex instead of edges, even if I had to
 write it myself? I know partitioning by edges is favored in a majority of
 the cases, but for the sake of research I'd like to be able to do both.

If you pass PartitionStrategy.EdgePartition1D, this will partition edges by 
their source vertices, so all edges with the same source will be 
co-partitioned, and the communication pattern will be similar to 
vertex-partitioned (edge-cut) systems like Giraph.

 4) Is there a better way to time processes outside of using built-in unix
 timing through the logs or something?

I think the options are Unix timing, log file timestamp parsing, looking at the 
web UI, or writing timing code within your program (System.currentTimeMillis 
and System.nanoTime).

Ankur

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org