Re: Huge matrix

2014-04-14 Thread Guillaume Pitel
On 04/12/2014 06:35 PM, Xiaoli Li wrote: Hi Guillaume, This sounds a good idea to me. I am a newbie here. Could you further explain how will you determine which clusters to keep? According to the distance between each

Re: Master registers itself at startup?

2014-04-14 Thread Gerd Koenig
@YouPeng, @Aaron many thanks for the memory-setting hint. That solved the issue, just increased it to the default val of 512MB thanks, Gerd On 14 April 2014 03:22, YouPeng Yang yypvsxf19870...@gmail.com wrote: Hi The 512MB is the default memory size which each executor needs. and

Re: Checkpoint Vs Cache

2014-04-14 Thread Mayur Rustagi
For starters cacheing may or may not be persisted on disk , but check pointing will be. Also cache is generic check pointing is specific to streaming. On Apr 14, 2014 7:51 AM, David Thomas dt5434...@gmail.com wrote: What is the difference between checkpointing and caching an RDD?

Re: Incredible slow iterative computation

2014-04-14 Thread Andrew Ash
A lot of your time is being spent in garbage collection (second image). Maybe your dataset doesn't easily fit into memory? Can you reduce the number of new objects created in myFun? How big are your heap sizes? Another observation is that in the 4th image some of your RDDs are massive and some

Proper caching method

2014-04-14 Thread Joe L
Hi I am trying to cache 2Gbyte data and to implement the following procedure. In order to cache them I did as follows: Is it necessary to cache rdd2 since rdd1 is already cached? rdd1 = textFile(hdfs...).cache() rdd2 = rdd1.filter(userDefinedFunc1).cache() rdd3 =

process_local vs node_local

2014-04-14 Thread Nathan Kronenfeld
I've a fairly large job (5E9 records, ~1600 partitions).wherein on a given stage, it looks like for the first half of the tasks, everything runs in process_local mode in ~10s/partition. Then, from halfway through, everything starts running in node_local mode, and takes 10x as long or more. I

Re: Lost an executor error - Jobs fail

2014-04-14 Thread giive chen
Hi Praveen What is your config about * spark.local.dir ? * Is all your worker has this dir and all worker has right permission on this dir? I think this is the reason of your error Wisely Chen On Mon, Apr 14, 2014 at 9:29 PM, Praveen R prav...@sigmoidanalytics.comwrote: Had below error

Re: Lost an executor error - Jobs fail

2014-04-14 Thread Praveen R
Configuration comes from spark-ec2 setup script, sets spark.local.dir to use /mnt/spark, /mnt2/spark. Setup actually worked for quite sometime and then on one of the node there were some disk errors as mv: cannot remove `/mnt2/spark/spark-local-20140409182103-c775/09/shuffle_1_248_0': Read-only

Re: Use combineByKey and StatCount

2014-04-14 Thread dachuan
it seems you can imitate RDD.top()'s implementation. for each partition, you get the number of records, and the total sum of key, and in the final result handler, you add all the sum together, and add the number of records together, then you can get the mean, I mean, arithmetic mean. On Tue, Apr

Re: cannot exec. job: TaskSchedulerImpl: Initial job has not accepted any resources

2014-04-14 Thread Praveen R
Can you try adding this to your spark-env file and sync to all hosts export MASTER=spark://hadoop-pg-5.cluster:7077 On Sat, Apr 12, 2014 at 6:50 PM, ge ko koenig@gmail.com wrote: Hi, I'm starting using Spark and have installed Spark within CDH5 using ClouderaManager. I set up one

Re: Measure the Total Network I/O, Cpu and Memory Consumed by Spark Job

2014-04-14 Thread dachuan
I haven't tried before, but it seems Ganglia could be the tool. On Wed, Apr 2, 2014 at 6:40 PM, yxzhao yxz...@ualr.edu wrote: Hi All, I am intrested in measure the total network I/O, cpu and memory consumed by Spark job. I tried to find the related information in logs and Web UI. But

RDD.tail()

2014-04-14 Thread Philip Ogren
Has there been any thought to adding a tail() method to RDD? It would be really handy to skip over the first item in an RDD when it contains header information. Even better would be a drop(int) function that would allow you to skip over several lines of header information. Our attempts to

Re: process_local vs node_local

2014-04-14 Thread dachuan
I am confused about the process local and node local, too. In my current understanding of Spark, one application typically only has one executor in one node. However, node local means your data is in the same host, but in a different executor. This further means node local is the same with

Re: RDD.tail()

2014-04-14 Thread Ethan Jewett
We have similar needs but IIRC, I came to the conclusion that this would only work on ordered RDDs, and then you would still have to figure out which partition is the first one. I ended up deciding it would be best to just drop the header lines from a Scala iterator before creating an RDD based on

RDD collect help

2014-04-14 Thread Flavio Pompermaier
Hi to all, in my application I read objects that are not serializable because I cannot modify the sources. So I tried to do a workaround creating a dummy class that extends the unmodifiable one but implements serializable. All attributes of the parent class are Lists of objects (some of them are

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Marcelo Vanzin
Hi Ian, When you run your packaged application, are you adding its jar file to the SparkContext (by calling the addJar() method)? That will distribute the code to all the worker nodes. The failure you're seeing seems to indicate the worker nodes do not have access to your code. On Mon, Apr 14,

Re: Proper caching method

2014-04-14 Thread Marcelo Vanzin
Hi Joe, If you cache rdd1 but not rdd2, any time you need rdd2's result, it will have to be computed. It will use rdd1's cached data, but it will have to compute its result again. On Mon, Apr 14, 2014 at 5:32 AM, Joe L selme...@yahoo.com wrote: Hi I am trying to cache 2Gbyte data and to

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Ian Bonnycastle
Hi Marcelo, thanks for answering. That didn't seem to help. I have the following now: val sc = new SparkContext(spark://masternodeip:7077, Simple App, /usr/local/pkg/spark, List(target/scala-2.10/simple-project_2.10-1.0.jar))

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Marcelo Vanzin
Hi Ian, On Mon, Apr 14, 2014 at 11:30 AM, Ian Bonnycastle ibo...@gmail.com wrote: val sc = new SparkContext(spark://masternodeip:7077, Simple App, /usr/local/pkg/spark, List(target/scala-2.10/simple-project_2.10-1.0.jar)) Hmmm... does

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Ian Bonnycastle
Hi Marcelo, Changing it to null didn't make any difference at all. /usr/local/pkg/spark is also on all the nodes... it has to be in order to get all the nodes up and running in the cluster. Also, I'm confused by what you mean with That's most probably the class that implements the closure you're

using Kryo with pyspark?

2014-04-14 Thread Diana Carroll
I'm looking at the Tuning Guide suggestion to use Kryo instead of default serialization. My questions: Does pyspark use Java serialization by default, as Scala spark does? If so, then... can I use Kryo with pyspark instead? The instructions say I should register my classes with the Kryo

Re: Spark resilience

2014-04-14 Thread Ian Ferreira
Thanks Aaron. From: Aaron Davidson ilike...@gmail.com Reply-To: user@spark.apache.org Date: Monday, April 14, 2014 at 10:30 AM To: user@spark.apache.org Subject: Re: Spark resilience Master and slave are somewhat overloaded terms in the Spark ecosystem (see the glossary:

Pyspark with Cython

2014-04-14 Thread Ian Ferreira
Has anyone used Cython closures with Spark? We have a large investment in Python code that we don¹t want to port to Scala. Curious about any performance issues with the interop between the Scala engine and the Cython closures. I believe it is sockets on the driver and pipe on the executors?

Re: RDD.tail()

2014-04-14 Thread Matei Zaharia
You can use mapPartitionsWithIndex and look at the partition index (0 will be the first partition) to decide whether to skip the first line. Matei On Apr 14, 2014, at 8:50 AM, Ethan Jewett esjew...@gmail.com wrote: We have similar needs but IIRC, I came to the conclusion that this would only

Re: Spark resilience

2014-04-14 Thread Manoj Samel
Could you please elaborate how drivers can be restarted automatically ? Thanks, On Mon, Apr 14, 2014 at 10:30 AM, Aaron Davidson ilike...@gmail.com wrote: Master and slave are somewhat overloaded terms in the Spark ecosystem (see the glossary:

Re: RDD collect help

2014-04-14 Thread Flavio Pompermaier
Thanks Eugen for tgee reply. Could you explain me why I have the problem?Why my serialization doesn't work? On Apr 14, 2014 6:40 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hi, as a easy workaround you can enable Kryo serialization http://spark.apache.org/docs/latest/configuration.html

can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-14 Thread Walrus theCat
Hi, Using the spark-shell, I can't sc.parallelize to get an RDD. Looks like a bug. scala sc.parallelize(Array(a,s,d)) java.lang.NullPointerException at init(console:17) at init(console:22) at init(console:24) at init(console:26) at init(console:28) at init(console:30)

Re: RDD collect help

2014-04-14 Thread Eugen Cepoi
Sure. As you have pointed, those classes don't implement Serializable and Spark uses by default java serialization (when you do collect the data from the workers will be serialized, collected by the driver and then deserialized on the driver side). Kryo (as most other decent serialization libs)

Re: RDD collect help

2014-04-14 Thread Flavio Pompermaier
Ok, that's fair enough. But why things work up to the collect?during map and filter objects are not serialized? On Apr 15, 2014 12:31 AM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Sure. As you have pointed, those classes don't implement Serializable and Spark uses by default java serialization

Re: Strange behaviour of different SSCs with same Kafka topic

2014-04-14 Thread Tathagata Das
Does this happen at low event rate for that topic as well, or only for a high volume rate? TD On Wed, Apr 9, 2014 at 11:24 PM, gaganbm gagan.mis...@gmail.com wrote: I am really at my wits' end here. I have different Streaming contexts, lets say 2, and both listening to same Kafka topics. I

Re: RDD collect help

2014-04-14 Thread Eugen Cepoi
Nope, those operations are lazy, meaning it will create the RDDs but won't trigger any action. The computation is launched by operations such as collect, count, save to HDFS etc. And even if they were not lazy, no serialization would happen. Serialization occurs only when data will be transfered

Re: Huge matrix

2014-04-14 Thread Xiaoli Li
Hi Guillaume, Thanks for your explanation. It helps me a lot. I will try it. Xiaoli

AmpCamp exercise in a local environment

2014-04-14 Thread Nabeel Memon
Hi. I found AmpCamp exercises as a nice way to get started with spark. However they require amazon ec2 access. Has anyone put together any VM or docker scripts to have the same environment locally to work out those labs? It'll be really helpful. Thanks.

How to cogroup/join pair RDDs with different key types?

2014-04-14 Thread Roger Hoover
Hi, I'm trying to figure out how to join two RDDs with different key types and appreciate any suggestions. Say I have two RDDS: ipToUrl of type (IP, String) ipRangeToZip of type (IPRange, String) How can I join/cogroup these two RDDs together to produce a new RDD of type (IP, (String,

Re: trouble with join on large RDDs

2014-04-14 Thread Harry Brundage
Brad: did you ever manage to figure this out? We're experiencing similar problems, and have also found that the memory limitations supplied to the Java side of PySpark don't limit how much memory Python can consume (which makes sense). Have you profiled the datasets you are trying to join? Is

Re: process_local vs node_local

2014-04-14 Thread Matei Zaharia
Spark can actually launch multiple executors on the same node if you configure it that way, but if you haven’t done that, this might mean that some tasks are reading data from the cache, and some from HDFS. (In the HDFS case Spark will only report it as NODE_LOCAL since HDFS isn’t tied to a

Re: using Kryo with pyspark?

2014-04-14 Thread Matei Zaharia
Kryo won’t make a major impact on PySpark because it just stores data as byte[] objects, which are fast to serialize even with Java. But it may be worth a try — you would just set spark.serializer and not try to register any classes. What might make more impact is storing data as

Re: How to cogroup/join pair RDDs with different key types?

2014-04-14 Thread Andrew Ash
Are your IPRanges all on nice, even CIDR-format ranges? E.g. 192.168.0.0/16or 10.0.0.0/8? If the range is always an even subnet mask and not split across subnets, I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and then joining the two RDDs. The expansion would be at most 32x if

Scala vs Python performance differences

2014-04-14 Thread Andrew Ash
Hi Spark users, I've always done all my Spark work in Scala, but occasionally people ask about Python and its performance impact vs the same algorithm implementation in Scala. Has anyone done tests to measure the difference? Anecdotally I've heard Python is a 40% slowdown but that's entirely

Re: Proper caching method

2014-04-14 Thread Cheng Lian
Hi Joe, You need to make sure which RDD is used most frequently. In your case, rdd2 rdd3 are filtered result of rdd1, so usually they are relatively smaller than rdd1, and it would be more reasonable to cache rdd2 and/or rdd3 if rdd1is not referenced elsewhere. Say rdd1 takes 10G, rdd2 takes 1G

storage.MemoryStore estimated size 7 times larger than real

2014-04-14 Thread wxhsdp
Hi, all in order to understand the memory usage about spark, i do the following test val size = 1024*1024 val array = new Array[Int](size) for(i - 0 until size) { array(i) = i } val a = sc.parallelize(array).cache() /*4MB*/ val b = a.mapPartitions{ c = { val d = c.toArray val e = new

Re: can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-14 Thread Walrus theCat
Nevermind -- I'm like 90% sure the problem is that I'm importing stuff that declares a SparkContext as sc. If it's not, I'll report back. On Mon, Apr 14, 2014 at 2:55 PM, Walrus theCat walrusthe...@gmail.comwrote: Hi, Using the spark-shell, I can't sc.parallelize to get an RDD. Looks like

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-14 Thread Aaron Davidson
It's likely the Ints are getting boxed at some point along the journey (perhaps starting with parallelize()). I could definitely see boxed Ints being 7 times larger than primitive ones. If you wanted to be very careful, you could try making an RDD[Array[Int]], where each element is simply a

Re: Spark resilience

2014-04-14 Thread Aaron Davidson
Launching drivers inside the cluster was a feature added in 0.9, for standalone cluster mode: http://spark.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster Note the supervise flag, which will cause the driver to be restarted if it fails. This is a rather

Re: Lost an executor error - Jobs fail

2014-04-14 Thread Aaron Davidson
Cool! It's pretty rare to actually get logs from a wild hardware failure. The problem is as you said, that the executor keeps failing, but the worker doesn't get the hint, so it keeps creating new, bad executors. However, this issue should not have caused your cluster to fail to start up. In the

shuffle vs performance

2014-04-14 Thread Joe L
I was wondering less partitioning rdds could help the Spark performance and reduce shuffling? is it true? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/shuffle-vs-performance-tp4255.html Sent from the Apache Spark User List mailing list archive at

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-14 Thread wxhsdp
thanks for your help, Davidson! i modified val a:RDD[Int] = sc.parallelize(array).cache() to keep val a an RDD of Int, but has the same result another question JVM and spark memory locate at different parts of system memory, the spark code is executed in JVM memory, malloc operation like val e =

Re: Scala vs Python performance differences

2014-04-14 Thread Jeremy Freeman
Hi Andrew, I'm putting together some benchmarks for PySpark vs Scala. I'm focusing on ML algorithms, as I'm particularly curious about the relative performance of MLlib in Scala vs the Python MLlib API vs pure Python implementations. Will share real results as soon as I have them, but roughly,