On 04/12/2014 06:35 PM, Xiaoli Li
wrote:
Hi Guillaume,
This sounds a good idea to me. I am a newbie here. Could
you further explain how will you determine which clusters to
keep? According to the distance between each
@YouPeng, @Aaron
many thanks for the memory-setting hint.
That solved the issue, just increased it to the default val of 512MB
thanks, Gerd
On 14 April 2014 03:22, YouPeng Yang yypvsxf19870...@gmail.com wrote:
Hi
The 512MB is the default memory size which each executor needs. and
For starters cacheing may or may not be persisted on disk , but check
pointing will be.
Also cache is generic check pointing is specific to streaming.
On Apr 14, 2014 7:51 AM, David Thomas dt5434...@gmail.com wrote:
What is the difference between checkpointing and caching an RDD?
A lot of your time is being spent in garbage collection (second image).
Maybe your dataset doesn't easily fit into memory? Can you reduce the
number of new objects created in myFun?
How big are your heap sizes?
Another observation is that in the 4th image some of your RDDs are massive
and some
Hi I am trying to cache 2Gbyte data and to implement the following procedure.
In order to cache them I did as follows: Is it necessary to cache rdd2 since
rdd1 is already cached?
rdd1 = textFile(hdfs...).cache()
rdd2 = rdd1.filter(userDefinedFunc1).cache()
rdd3 =
I've a fairly large job (5E9 records, ~1600 partitions).wherein on a given
stage, it looks like for the first half of the tasks, everything runs in
process_local mode in ~10s/partition. Then, from halfway through,
everything starts running in node_local mode, and takes 10x as long or more.
I
Hi Praveen
What is your config about * spark.local.dir ? *
Is all your worker has this dir and all worker has right permission on this
dir?
I think this is the reason of your error
Wisely Chen
On Mon, Apr 14, 2014 at 9:29 PM, Praveen R prav...@sigmoidanalytics.comwrote:
Had below error
Configuration comes from spark-ec2 setup script, sets spark.local.dir to
use /mnt/spark, /mnt2/spark.
Setup actually worked for quite sometime and then on one of the node there
were some disk errors as
mv: cannot remove
`/mnt2/spark/spark-local-20140409182103-c775/09/shuffle_1_248_0': Read-only
it seems you can imitate RDD.top()'s implementation. for each partition,
you get the number of records, and the total sum of key, and in the final
result handler, you add all the sum together, and add the number of records
together, then you can get the mean, I mean, arithmetic mean.
On Tue, Apr
Can you try adding this to your spark-env file and sync to all hosts
export MASTER=spark://hadoop-pg-5.cluster:7077
On Sat, Apr 12, 2014 at 6:50 PM, ge ko koenig@gmail.com wrote:
Hi,
I'm starting using Spark and have installed Spark within CDH5 using
ClouderaManager.
I set up one
I haven't tried before, but it seems Ganglia could be the tool.
On Wed, Apr 2, 2014 at 6:40 PM, yxzhao yxz...@ualr.edu wrote:
Hi All,
I am intrested in measure the total network I/O, cpu and memory
consumed by Spark job. I tried to find the related information in logs and
Web UI. But
Has there been any thought to adding a tail() method to RDD? It would
be really handy to skip over the first item in an RDD when it contains
header information. Even better would be a drop(int) function that
would allow you to skip over several lines of header information. Our
attempts to
I am confused about the process local and node local, too.
In my current understanding of Spark, one application typically only has
one executor in one node. However, node local means your data is in the
same host, but in a different executor.
This further means node local is the same with
We have similar needs but IIRC, I came to the conclusion that this would
only work on ordered RDDs, and then you would still have to figure out
which partition is the first one. I ended up deciding it would be best to
just drop the header lines from a Scala iterator before creating an RDD
based on
Hi to all,
in my application I read objects that are not serializable because I cannot
modify the sources.
So I tried to do a workaround creating a dummy class that extends the
unmodifiable one but implements serializable.
All attributes of the parent class are Lists of objects (some of them are
Hi Ian,
When you run your packaged application, are you adding its jar file to
the SparkContext (by calling the addJar() method)?
That will distribute the code to all the worker nodes. The failure
you're seeing seems to indicate the worker nodes do not have access to
your code.
On Mon, Apr 14,
Hi Joe,
If you cache rdd1 but not rdd2, any time you need rdd2's result, it
will have to be computed. It will use rdd1's cached data, but it will
have to compute its result again.
On Mon, Apr 14, 2014 at 5:32 AM, Joe L selme...@yahoo.com wrote:
Hi I am trying to cache 2Gbyte data and to
Hi Marcelo, thanks for answering.
That didn't seem to help. I have the following now:
val sc = new SparkContext(spark://masternodeip:7077,
Simple App, /usr/local/pkg/spark,
List(target/scala-2.10/simple-project_2.10-1.0.jar))
Hi Ian,
On Mon, Apr 14, 2014 at 11:30 AM, Ian Bonnycastle ibo...@gmail.com wrote:
val sc = new SparkContext(spark://masternodeip:7077,
Simple App, /usr/local/pkg/spark,
List(target/scala-2.10/simple-project_2.10-1.0.jar))
Hmmm... does
Hi Marcelo,
Changing it to null didn't make any difference at all. /usr/local/pkg/spark
is also on all the nodes... it has to be in order to get all the nodes up
and running in the cluster. Also, I'm confused by what you mean with
That's most probably the class that implements the closure you're
I'm looking at the Tuning Guide suggestion to use Kryo instead of default
serialization. My questions:
Does pyspark use Java serialization by default, as Scala spark does? If
so, then...
can I use Kryo with pyspark instead? The instructions say I should
register my classes with the Kryo
Thanks Aaron.
From: Aaron Davidson ilike...@gmail.com
Reply-To: user@spark.apache.org
Date: Monday, April 14, 2014 at 10:30 AM
To: user@spark.apache.org
Subject: Re: Spark resilience
Master and slave are somewhat overloaded terms in the Spark ecosystem (see
the glossary:
Has anyone used Cython closures with Spark? We have a large investment in
Python code that we don¹t want to port to Scala. Curious about any
performance issues with the interop between the Scala engine and the Cython
closures. I believe it is sockets on the driver and pipe on the executors?
You can use mapPartitionsWithIndex and look at the partition index (0 will be
the first partition) to decide whether to skip the first line.
Matei
On Apr 14, 2014, at 8:50 AM, Ethan Jewett esjew...@gmail.com wrote:
We have similar needs but IIRC, I came to the conclusion that this would only
Could you please elaborate how drivers can be restarted automatically ?
Thanks,
On Mon, Apr 14, 2014 at 10:30 AM, Aaron Davidson ilike...@gmail.com wrote:
Master and slave are somewhat overloaded terms in the Spark ecosystem (see
the glossary:
Thanks Eugen for tgee reply. Could you explain me why I have the
problem?Why my serialization doesn't work?
On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote:
Hi,
as a easy workaround you can enable Kryo serialization
http://spark.apache.org/docs/latest/configuration.html
Hi,
Using the spark-shell, I can't sc.parallelize to get an RDD.
Looks like a bug.
scala sc.parallelize(Array(a,s,d))
java.lang.NullPointerException
at init(console:17)
at init(console:22)
at init(console:24)
at init(console:26)
at init(console:28)
at init(console:30)
Sure. As you have pointed, those classes don't implement Serializable and
Spark uses by default java serialization (when you do collect the data from
the workers will be serialized, collected by the driver and then
deserialized on the driver side). Kryo (as most other decent serialization
libs)
Ok, that's fair enough. But why things work up to the collect?during map
and filter objects are not serialized?
On Apr 15, 2014 12:31 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote:
Sure. As you have pointed, those classes don't implement Serializable and
Spark uses by default java serialization
Does this happen at low event rate for that topic as well, or only for a
high volume rate?
TD
On Wed, Apr 9, 2014 at 11:24 PM, gaganbm gagan.mis...@gmail.com wrote:
I am really at my wits' end here.
I have different Streaming contexts, lets say 2, and both listening to same
Kafka topics. I
Nope, those operations are lazy, meaning it will create the RDDs but won't
trigger any action. The computation is launched by operations such as
collect, count, save to HDFS etc. And even if they were not lazy, no
serialization would happen. Serialization occurs only when data will be
transfered
Hi Guillaume,
Thanks for your explanation. It helps me a lot. I will try it.
Xiaoli
Hi. I found AmpCamp exercises as a nice way to get started with spark.
However they require amazon ec2 access. Has anyone put together any VM or
docker scripts to have the same environment locally to work out those labs?
It'll be really helpful. Thanks.
Hi,
I'm trying to figure out how to join two RDDs with different key types and
appreciate any suggestions.
Say I have two RDDS:
ipToUrl of type (IP, String)
ipRangeToZip of type (IPRange, String)
How can I join/cogroup these two RDDs together to produce a new RDD of type
(IP, (String,
Brad: did you ever manage to figure this out? We're experiencing similar
problems, and have also found that the memory limitations supplied to the
Java side of PySpark don't limit how much memory Python can consume (which
makes sense).
Have you profiled the datasets you are trying to join? Is
Spark can actually launch multiple executors on the same node if you configure
it that way, but if you haven’t done that, this might mean that some tasks are
reading data from the cache, and some from HDFS. (In the HDFS case Spark will
only report it as NODE_LOCAL since HDFS isn’t tied to a
Kryo won’t make a major impact on PySpark because it just stores data as byte[]
objects, which are fast to serialize even with Java. But it may be worth a try
— you would just set spark.serializer and not try to register any classes. What
might make more impact is storing data as
Are your IPRanges all on nice, even CIDR-format ranges? E.g. 192.168.0.0/16or
10.0.0.0/8?
If the range is always an even subnet mask and not split across subnets,
I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and then
joining the two RDDs. The expansion would be at most 32x if
Hi Spark users,
I've always done all my Spark work in Scala, but occasionally people ask
about Python and its performance impact vs the same algorithm
implementation in Scala.
Has anyone done tests to measure the difference?
Anecdotally I've heard Python is a 40% slowdown but that's entirely
Hi Joe,
You need to make sure which RDD is used most frequently. In your case, rdd2
rdd3 are filtered result of rdd1, so usually they are relatively smaller
than rdd1, and it would be more reasonable to cache rdd2 and/or rdd3
if rdd1is not referenced elsewhere.
Say rdd1 takes 10G, rdd2 takes 1G
Hi, all
in order to understand the memory usage about spark, i do the following test
val size = 1024*1024
val array = new Array[Int](size)
for(i - 0 until size) {
array(i) = i
}
val a = sc.parallelize(array).cache() /*4MB*/
val b = a.mapPartitions{ c = {
val d = c.toArray
val e = new
Nevermind -- I'm like 90% sure the problem is that I'm importing stuff that
declares a SparkContext as sc. If it's not, I'll report back.
On Mon, Apr 14, 2014 at 2:55 PM, Walrus theCat walrusthe...@gmail.comwrote:
Hi,
Using the spark-shell, I can't sc.parallelize to get an RDD.
Looks like
It's likely the Ints are getting boxed at some point along the journey
(perhaps starting with parallelize()). I could definitely see boxed Ints
being 7 times larger than primitive ones.
If you wanted to be very careful, you could try making an RDD[Array[Int]],
where each element is simply a
Launching drivers inside the cluster was a feature added in 0.9, for
standalone cluster mode:
http://spark.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster
Note the supervise flag, which will cause the driver to be restarted if
it fails. This is a rather
Cool! It's pretty rare to actually get logs from a wild hardware failure.
The problem is as you said, that the executor keeps failing, but the worker
doesn't get the hint, so it keeps creating new, bad executors.
However, this issue should not have caused your cluster to fail to start
up. In the
I was wondering less partitioning rdds could help the Spark performance and
reduce shuffling? is it true?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/shuffle-vs-performance-tp4255.html
Sent from the Apache Spark User List mailing list archive at
thanks for your help, Davidson!
i modified
val a:RDD[Int] = sc.parallelize(array).cache()
to keep val a an RDD of Int, but has the same result
another question
JVM and spark memory locate at different parts of system memory, the spark
code is executed in JVM memory, malloc operation like val e =
Hi Andrew,
I'm putting together some benchmarks for PySpark vs Scala. I'm focusing on
ML algorithms, as I'm particularly curious about the relative performance of
MLlib in Scala vs the Python MLlib API vs pure Python implementations.
Will share real results as soon as I have them, but roughly,
48 matches
Mail list logo