Costs of transformations

2015-06-09 Thread Vijayasarathy Kannan
Is it possible bound costs of operations such as flatMap(), collect() based on the size of RDDs?

Re: Reading large files

2015-05-06 Thread Vijayasarathy Kannan
is that the order of records between parts is not preserved, so I have to do sortBy afterwards. Alexander *From:* Vijayasarathy Kannan [mailto:kvi...@vt.edu] *Sent:* Wednesday, May 06, 2015 10:38 AM *To:* user@spark.apache.org *Subject:* Reading large files ​Hi, Is there a way to read

Reading large files

2015-05-06 Thread Vijayasarathy Kannan
​Hi, Is there a way to read a large file, in parallel​/distributed way? I have a single large binary file which I currently read on the driver program and then distribute it to executors (using groupBy(), etc.). I want to know if there's a way to make the executors each read a specific/unique

Spark JVM default memory

2015-05-04 Thread Vijayasarathy Kannan
Starting the master with /sbin/start-master.sh creates a JVM with only 512MB of memory. How to change this default amount of memory? Thanks, Vijay

Re: Spark JVM default memory

2015-05-04 Thread Vijayasarathy Kannan
(sometimes 1G, sometimes 512M, etc.) On Mon, May 4, 2015 at 6:57 PM, Mohammed Guller moham...@glassbeam.com wrote: Did you confirm through the Spark UI how much memory is getting allocated to your application on each worker? Mohammed *From:* Vijayasarathy Kannan [mailto:kvi...@vt.edu] *Sent

Re: Spark JVM default memory

2015-05-04 Thread Vijayasarathy Kannan
these changes: http://spark.apache.org/docs/latest/configuration.html On Mon, May 4, 2015 at 2:24 PM, Vijayasarathy Kannan kvi...@vt.edu wrote: Starting the master with /sbin/start-master.sh creates a JVM with only 512MB of memory. How to change this default amount of memory? Thanks, Vijay

Complexity of transformations in Spark

2015-04-26 Thread Vijayasarathy Kannan
What is the complexity of transformations and actions in Spark, such as groupBy(), flatMap(), collect(), etc.? What attributes do we need to factor (such as number of partitions) in while analyzing codes using these operations?

Error running Spark on Cloudera

2015-04-08 Thread Vijayasarathy Kannan
I am trying to run a Spark application using spark-submit on a cluster using Cloudera manager. I get the error Exception in thread main java.io.IOException: Error in creating log directory: file:/user/spark/applicationHistory//app-20150408094126-0008 Adding the below lines in

Re: Reading a large file (binary) into RDD

2015-04-03 Thread Vijayasarathy Kannan
, short, long, etc.? If you could post a gist with an example of the kind of file and how it should look once read in that would be useful! - jeremyfreeman.net @thefreemanlab On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote: Thanks for the reply

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan
On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote: What are some efficient ways to read a large file into RDDs? For example, have several executors read a specific/unique portion of the file and construct RDDs. Is this possible to do in Spark? Currently, I am doing a line

Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan
What are some efficient ways to read a large file into RDDs? For example, have several executors read a specific/unique portion of the file and construct RDDs. Is this possible to do in Spark? Currently, I am doing a line-by-line read of the file at the driver and constructing the RDD.

Re: Reading a large file (binary) into RDD

2015-04-02 Thread Vijayasarathy Kannan
! - jeremyfreeman.net @thefreemanlab On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote: Thanks for the reply. Unfortunately, in my case, the binary file is a mix of short and long integers. Is there any other way that could of use here? My current method happens to have

Re: Unable to run Spark application

2015-04-01 Thread Vijayasarathy Kannan
That is failing too, with sbt.resolveexception: unresolved dependency:org.apache.spark#spark-network-common_2.10;1.2.1 On Wed, Apr 1, 2015 at 1:24 PM, Marcelo Vanzin van...@cloudera.com wrote: Try sbt assembly instead. On Wed, Apr 1, 2015 at 10:09 AM, Vijayasarathy Kannan kvi...@vt.edu

Unable to run Spark application

2015-04-01 Thread Vijayasarathy Kannan
Why do I get Failed to find Spark assembly JAR. You need to build Spark before running this program. ? I downloaded spark-1.2.1.tgz from the downloads page and extracted it. When I do sbt package inside my application, it worked fine. But when I try to run my application, I get the above

Re: Unable to run Spark application

2015-04-01 Thread Vijayasarathy Kannan
be missing? On Wed, Apr 1, 2015 at 1:32 PM, Vijayasarathy Kannan kvi...@vt.edu wrote: That is failing too, with sbt.resolveexception: unresolved dependency:org.apache.spark#spark-network-common_2.10;1.2.1 On Wed, Apr 1, 2015 at 1:24 PM, Marcelo Vanzin van...@cloudera.com wrote: Try sbt

Problems with spark.akka.frameSize

2015-03-19 Thread Vijayasarathy Kannan
Hi, I am encountering the following error with a Spark application. Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 0:0 was 11257268 bytes, which exceeds max allowed: spark.akka.frameSize (10485760 bytes) - reserved (204800 bytes).

Issues with SBT and Spark

2015-03-19 Thread Vijayasarathy Kannan
My current simple.sbt is name := SparkEpiFast version := 1.0 scalaVersion := 2.11.4 libraryDependencies += org.apache.spark % spark-core_2.11 % 1.2.1 % provided libraryDependencies += org.apache.spark % spark-graphx_2.11 % 1.2.1 % provided While I do sbt package, it compiles successfully.

Question on RDD groupBy and executors

2015-03-17 Thread Vijayasarathy Kannan
Hi, I am doing a groupBy on an EdgeRDD like this, val groupedEdges = graph.edges.groupBy[VertexId](func0) while(true) { val info = groupedEdges.flatMap(func1).collect.foreach(func2) } The groupBy distributes the data to different executors on different nodes in the cluster. Given a key K (a

PRNG in Scala

2015-03-03 Thread Vijayasarathy Kannan
Hi, What pseudo-random-number generator does scala.util.Random uses?

Re: Iterating on RDDs

2015-02-27 Thread Vijayasarathy Kannan
As you suggested, I tried to save the grouped RDD and persisted it in memory before the iterations begin. The performance seems to be much better now. My previous comment that the run times doubled was from a wrong observation. Thanks. On Fri, Feb 27, 2015 at 10:27 AM, Vijayasarathy Kannan kvi

Re: Iterating on RDDs

2015-02-27 Thread Vijayasarathy Kannan
, Vijayasarathy Kannan kvi...@vt.edu wrote: Hi, I have the following use case. (1) I have an RDD of edges of a graph (say R). (2) do a groupBy on R (by say source vertex) and call a function F on each group. (3) collect the results from Fs and do some computation (4) repeat the above steps until

Iterating on RDDs

2015-02-26 Thread Vijayasarathy Kannan
Hi, I have the following use case. (1) I have an RDD of edges of a graph (say R). (2) do a groupBy on R (by say source vertex) and call a function F on each group. (3) collect the results from Fs and do some computation (4) repeat the above steps until some criteria is met In (2), the groups

Re: Not able to update collections

2015-02-24 Thread Vijayasarathy Kannan
I am a beginner to Scala/Spark. Could you please elaborate on how to make RDD of results of func() and collect? On Tue, Feb 24, 2015 at 2:27 PM, Sean Owen so...@cloudera.com wrote: They aren't the same 'lst'. One is on your driver. It gets copied to executors when the tasks are executed.

Re: Not able to update collections

2015-02-24 Thread Vijayasarathy Kannan
...flatMap(func) This returns an RDD that basically has the list you are trying to build, I believe. You can collect() to the driver but beware if it is a huge data set. If you really just mean to count the results, you can count() instead On Tue, Feb 24, 2015 at 7:35 PM, Vijayasarathy Kannan kvi

Re: RDD groupBy

2015-02-23 Thread Vijayasarathy Kannan
You are right. I was looking at the wrong logs. I ran it on my local machine and saw that the println actually wrote the vertexIds. I was then able to find the same in the executors' logs in the remote machine. Thanks for the clarification. On Mon, Feb 23, 2015 at 2:00 PM, Sean Owen

Re: Processing graphs

2015-02-18 Thread Vijayasarathy Kannan
wrote: Hi Kannan, I am not sure I have understood what your question is exactly, but maybe the reduceByKey or reduceByKeyLocally functionality is better to your need. Best, Yifan LI On 17 Feb 2015, at 17:37, Vijayasarathy Kannan kvi...@vt.edu wrote: Hi, I am working on a Spark

Processing graphs

2015-02-17 Thread Vijayasarathy Kannan
Hi, I am working on a Spark application that processes graphs and I am trying to do the following. - group the vertices (key - vertex, value - set of its outgoing edges) - distribute each key to separate processes and process them (like mapper) - reduce the results back at the main process Does