Is it possible bound costs of operations such as flatMap(), collect() based
on the size of RDDs?
is that the order of records between parts is not preserved, so I
have to do sortBy afterwards.
Alexander
*From:* Vijayasarathy Kannan [mailto:kvi...@vt.edu]
*Sent:* Wednesday, May 06, 2015 10:38 AM
*To:* user@spark.apache.org
*Subject:* Reading large files
Hi,
Is there a way to read
Hi,
Is there a way to read a large file, in parallel/distributed way? I have a
single large binary file which I currently read on the driver program and
then distribute it to executors (using groupBy(), etc.). I want to know if
there's a way to make the executors each read a specific/unique
Starting the master with /sbin/start-master.sh creates a JVM with only
512MB of memory. How to change this default amount of memory?
Thanks,
Vijay
(sometimes 1G, sometimes 512M, etc.)
On Mon, May 4, 2015 at 6:57 PM, Mohammed Guller moham...@glassbeam.com
wrote:
Did you confirm through the Spark UI how much memory is getting
allocated to your application on each worker?
Mohammed
*From:* Vijayasarathy Kannan [mailto:kvi...@vt.edu]
*Sent
these changes:
http://spark.apache.org/docs/latest/configuration.html
On Mon, May 4, 2015 at 2:24 PM, Vijayasarathy Kannan kvi...@vt.edu
wrote:
Starting the master with /sbin/start-master.sh creates a JVM with only
512MB of memory. How to change this default amount of memory?
Thanks,
Vijay
What is the complexity of transformations and actions in Spark, such as
groupBy(), flatMap(), collect(), etc.?
What attributes do we need to factor (such as number of partitions) in
while analyzing codes using these operations?
I am trying to run a Spark application using spark-submit on a cluster
using Cloudera manager. I get the error
Exception in thread main java.io.IOException: Error in creating log
directory: file:/user/spark/applicationHistory//app-20150408094126-0008
Adding the below lines in
, short, long, etc.?
If you could post a gist with an example of the kind of file and how it
should look once read in that would be useful!
-
jeremyfreeman.net
@thefreemanlab
On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
Thanks for the reply
On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
What are some efficient ways to read a large file into RDDs?
For example, have several executors read a specific/unique portion of the
file and construct RDDs. Is this possible to do in Spark?
Currently, I am doing a line
What are some efficient ways to read a large file into RDDs?
For example, have several executors read a specific/unique portion of the
file and construct RDDs. Is this possible to do in Spark?
Currently, I am doing a line-by-line read of the file at the driver and
constructing the RDD.
!
-
jeremyfreeman.net
@thefreemanlab
On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
Thanks for the reply. Unfortunately, in my case, the binary file is a mix
of short and long integers. Is there any other way that could of use here?
My current method happens to have
That is failing too, with
sbt.resolveexception: unresolved
dependency:org.apache.spark#spark-network-common_2.10;1.2.1
On Wed, Apr 1, 2015 at 1:24 PM, Marcelo Vanzin van...@cloudera.com wrote:
Try sbt assembly instead.
On Wed, Apr 1, 2015 at 10:09 AM, Vijayasarathy Kannan kvi...@vt.edu
Why do I get
Failed to find Spark assembly JAR.
You need to build Spark before running this program. ?
I downloaded spark-1.2.1.tgz from the downloads page and extracted it.
When I do sbt package inside my application, it worked fine. But when I
try to run my application, I get the above
be missing?
On Wed, Apr 1, 2015 at 1:32 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:
That is failing too, with
sbt.resolveexception: unresolved
dependency:org.apache.spark#spark-network-common_2.10;1.2.1
On Wed, Apr 1, 2015 at 1:24 PM, Marcelo Vanzin van...@cloudera.com
wrote:
Try sbt
Hi,
I am encountering the following error with a Spark application.
Exception in thread main org.apache.spark.SparkException:
Job aborted due to stage failure:
Serialized task 0:0 was 11257268 bytes, which exceeds max allowed:
spark.akka.frameSize (10485760 bytes) - reserved (204800 bytes).
My current simple.sbt is
name := SparkEpiFast
version := 1.0
scalaVersion := 2.11.4
libraryDependencies += org.apache.spark % spark-core_2.11 % 1.2.1 %
provided
libraryDependencies += org.apache.spark % spark-graphx_2.11 % 1.2.1 %
provided
While I do sbt package, it compiles successfully.
Hi,
I am doing a groupBy on an EdgeRDD like this,
val groupedEdges = graph.edges.groupBy[VertexId](func0)
while(true) {
val info = groupedEdges.flatMap(func1).collect.foreach(func2)
}
The groupBy distributes the data to different executors on different nodes
in the cluster.
Given a key K (a
Hi,
What pseudo-random-number generator does scala.util.Random uses?
As you suggested, I tried to save the grouped RDD and persisted it in
memory before the iterations begin. The performance seems to be much better
now.
My previous comment that the run times doubled was from a wrong observation.
Thanks.
On Fri, Feb 27, 2015 at 10:27 AM, Vijayasarathy Kannan kvi
, Vijayasarathy Kannan kvi...@vt.edu
wrote:
Hi,
I have the following use case.
(1) I have an RDD of edges of a graph (say R).
(2) do a groupBy on R (by say source vertex) and call a function F on
each group.
(3) collect the results from Fs and do some computation
(4) repeat the above steps until
Hi,
I have the following use case.
(1) I have an RDD of edges of a graph (say R).
(2) do a groupBy on R (by say source vertex) and call a function F on each
group.
(3) collect the results from Fs and do some computation
(4) repeat the above steps until some criteria is met
In (2), the groups
I am a beginner to Scala/Spark. Could you please elaborate on how to make
RDD of results of func() and collect?
On Tue, Feb 24, 2015 at 2:27 PM, Sean Owen so...@cloudera.com wrote:
They aren't the same 'lst'. One is on your driver. It gets copied to
executors when the tasks are executed.
...flatMap(func)
This returns an RDD that basically has the list you are trying to
build, I believe.
You can collect() to the driver but beware if it is a huge data set.
If you really just mean to count the results, you can count() instead
On Tue, Feb 24, 2015 at 7:35 PM, Vijayasarathy Kannan kvi
You are right. I was looking at the wrong logs. I ran it on my local
machine and saw that the println actually wrote the vertexIds. I was then
able to find the same in the executors' logs in the remote machine.
Thanks for the clarification.
On Mon, Feb 23, 2015 at 2:00 PM, Sean Owen
wrote:
Hi Kannan,
I am not sure I have understood what your question is exactly, but maybe
the reduceByKey or reduceByKeyLocally functionality is better to your need.
Best,
Yifan LI
On 17 Feb 2015, at 17:37, Vijayasarathy Kannan kvi...@vt.edu wrote:
Hi,
I am working on a Spark
Hi,
I am working on a Spark application that processes graphs and I am trying
to do the following.
- group the vertices (key - vertex, value - set of its outgoing edges)
- distribute each key to separate processes and process them (like mapper)
- reduce the results back at the main process
Does
27 matches
Mail list logo