Re: Spark resilience

2014-04-15 Thread Aaron Davidson
1. Spark prefers to run tasks where the data is, but it is able to move cached data between executors if no cores are available where the data is initially cached (which is often much faster than recomputing the data from scratch). The result is that data is automatically spread out across the clus

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Eugen Cepoi
Yes, the second example does that. It transforms all the points of a partition into a single element the skyline, thus reduce will run on the skyline of two partitions and not on single points. Le 16 avr. 2014 06:47, "Yanzhe Chen" a écrit : > Eugen, > > Thanks for your tip and I do want to merge

Re: Proper caching method

2014-04-15 Thread Arpit Tak
Hi Cheng, Is it possibe to delete or replicate an rdd ?? > rdd1 = textFile("hdfs...").cache() > > rdd2 = rdd1.filter(userDefinedFunc1).cache() > rdd3 = rdd1.filter(userDefinedFunc2).cache() I reframe above question , if rdd1 is around 50G and after filtering its come around say 4G. then to incre

Could I improve Spark performance partitioning elements in a RDD?

2014-04-15 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Could-I-improve-Spark-performance-partitioning-elements-in-a-RDD-tp4320.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark resilience

2014-04-15 Thread Arpit Tak
1. If we add more executors to cluster and data is already cached inside system(rdds are already there) . so, in that case those executors will run job on new executors or not , as rdd are not present there?? if yes, then how the performance on new executors ?? 2. What is the replication factor in

groupByKey(None) returns partitions according to the keys?

2014-04-15 Thread Joe L
I was wonder if groupByKey returns 2 partitions in the below example? >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)]) >>> sorted(x.groupByKey().collect()) [('a', [1, 1]), ('b', [1])] -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/groupByKey-None-ret

what is the difference between element and partition?

2014-04-15 Thread Joe L
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-difference-between-element-and-partition-tp4317.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Yanzhe Chen
Eugen, Thanks for your tip and I do want to merge the result of a partition with another one but I am still not quite clear how to do it. Say the original data rdd has 32 partitions and since mapPartitions won’t change the number of partitions, it will remain 32 partitions which each contains

RE: JMX with Spark

2014-04-15 Thread Shao, Saisai
Hi Paul, would you please paste your metrics.conf out so that we can find the problems if you still have problems. Thanks Jerry From: Parviz Deyhim [mailto:pdey...@gmail.com] Sent: Wednesday, April 16, 2014 9:10 AM To: user@spark.apache.org Subject: Re: JMX with Spark home directory or $home/co

Re: StackOverflow Error when run ALS with 100 iterations

2014-04-15 Thread Xiaoli Li
Thanks a lot for your information. It really helps me. On Tue, Apr 15, 2014 at 7:57 PM, Cheng Lian wrote: > Probably this JIRA > issuesolves your > problem. When running with large iteration number, the lineage > DAG of ALS becomes very d

Re: partitioning of small data sets

2014-04-15 Thread YouPeng Yang
Hi Actually,you can set the partition num by yourself to change the 'spark.default.parallelism' property .Otherwise,spark will use the default partition defaultParallelism. For Local Model,the defaultParallelism = totalcores. For Local Cluster Model,the defaultParallelism= math.max(totalcores

Re: java.net.SocketException: Network is unreachable while connecting to HBase

2014-04-15 Thread amit
In the worker logs i can see, 14/04/16 01:02:47 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@xx:10548] -> [akka.tcp://sparkExecutor@xx:16041]: Error [Association failed with [akka.tcp://sparkExecutor@xx:16041]] [ akka.remote.EndpointAssociationException: Association f

Re: JMX with Spark

2014-04-15 Thread Parviz Deyhim
home directory or $home/conf directory? works for me with metrics.properties hosted under conf dir. On Tue, Apr 15, 2014 at 6:08 PM, Paul Schooss wrote: > Has anyone got this working? I have enabled the properties for it in the > metrics.conf file and ensure that it is placed under spark's home

JMX with Spark

2014-04-15 Thread Paul Schooss
Has anyone got this working? I have enabled the properties for it in the metrics.conf file and ensure that it is placed under spark's home directory. Any ideas why I don't see spark beans ?

Re: StackOverflow Error when run ALS with 100 iterations

2014-04-15 Thread Cheng Lian
Probably this JIRA issuesolves your problem. When running with large iteration number, the lineage DAG of ALS becomes very deep, both DAGScheduler and Java serializer may overflow because they are implemented in a recursive way. You may resort

Re: groupByKey returns a single partition in a RDD?

2014-04-15 Thread wxhsdp
groupByKey has the numPartitions parameter, you can set it to determine the partition num. if not set, the generated RDD has the same partition num of the previous one Joe L wrote > I want to apply the following transformations to 60Gbyte data on 7nodes > with 10Gbyte memory. And I am wondering i

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-15 Thread wxhsdp
thank you so much, davidson ye, you are right, in both sbt and spark shell, the result of my code is 28MB, it's irrelevant to numSlices. yesterday i had the result of 4.2MB in spark shell, because i remove array initialization for laziness:) for(i <- 0 until size) { array(i) = i } --

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-04-15 Thread giive chen
Hi Prasad Sorry for missing your reply. https://gist.github.com/thegiive/10791823 Here it is. Wisely Chen On Fri, Apr 4, 2014 at 11:57 PM, Prasad wrote: > Hi Wisely, > Could you please post your pom.xml here. > > Thanks > > > > -- > View this message in context: > http://apache-spark-user-list

RE: Multi-tenant?

2014-04-15 Thread Ian Ferreira
Thanks Matei! Sent from my Windows Phone From: Matei Zaharia Sent: ‎4/‎15/‎2014 7:14 PM To: user@spark.apache.org Subject: Re: Multi-tenant? Yes, both things can happen. Take a look at http://spark.apa

Re: Multi-tenant?

2014-04-15 Thread Matei Zaharia
Yes, both things can happen. Take a look at http://spark.apache.org/docs/latest/job-scheduling.html, which includes scheduling concurrent jobs within the same driver. Matei On Apr 15, 2014, at 4:08 PM, Ian Ferreira wrote: > What is the support for multi-tenancy in Spark. > > I assume more th

Re: Can't run a simple spark application with 0.9.1

2014-04-15 Thread Paul Schooss
I am a dork please disregard this issue. I did not have the slaves correctly configured. This error is very misleading On Tue, Apr 15, 2014 at 11:21 AM, Paul Schooss wrote: > Hello, > > Currently I deployed 0.9.1 spark using a new way of starting up spark > > exec start-stop-daemon --start -

java.net.SocketException: Network is unreachable while connecting to HBase

2014-04-15 Thread amit karmakar
I am getting a java.net.SocketException: Network is unreachable whenever i do a count on one of my tables. If i just do a take(1), i see the task status as killed on the master UI but i get back the results. My driver runs on my local system which is accessible over the public internet and connects

Multi-tenant?

2014-04-15 Thread Ian Ferreira
What is the support for multi-tenancy in Spark. I assume more than one driver can share the same cluster, but can a driver run two jobs in parallel?

Re: How to stop system info output in spark shell

2014-04-15 Thread Nicholas Chammas
Thanks for the tip. Been wondering myself how one would do that. On Tue, Apr 15, 2014 at 5:04 AM, Wei Da wrote: > The solution: > Edit /opt/spark-0.9.0-incubating-bin-hadoop2/conf/log4j.properties, > changing > Spark's output to WARN. Done! > > Refer to: > > https://github.com/amplab-extras/Spa

Re: partitioning of small data sets

2014-04-15 Thread Nicholas Chammas
Looking at the Python version of textFile(), shouldn't it be "*max*(self.defaultParallelism, 2)"? If the default parallelism is, say 4, wouldn't we want to use that for minSplits instead of 2? On Tu

Re: Scala vs Python performance differences

2014-04-15 Thread Nicholas Chammas
I'd also be interested in seeing such a benchmark. On Tue, Apr 15, 2014 at 9:25 AM, Ian Ferreira wrote: > This would be super useful. Thanks. > > On 4/15/14, 1:30 AM, "Jeremy Freeman" wrote: > > >Hi Andrew, > > > >I'm putting together some benchmarks for PySpark vs Scala. I'm focusing on > >ML

StackOverflow Error when run ALS with 100 iterations

2014-04-15 Thread Xiaoli Li
Hi, I am testing ALS using 7 nodes. Each node has 4 cores and 8G memeory. ALS program cannot run even with a very small size of training data (about 91 lines) due to StackVverFlow error when I set the number of iterations to 100. I think the problem may be caused by updateFeatures method which up

Problem with KryoSerializer

2014-04-15 Thread yh18190
Hi, I have a problem when i want to use spark kryoserializer by extending a class Kryoregistarar to register custom classes inorder to create objects.I am getting following exception When I run following program..Please let me know what could be the problem... ] (run-main) org.apache.spark.SparkEx

Shark: class java.io.IOException: Cannot run program "/bin/java"

2014-04-15 Thread ge ko
Hi, after starting the shark-shell via /opt/shark/shark-0.9.1/bin/shark-withinfo -skipRddReload I receive lots of output, including the exception that /bin/java cannot be executed. But it is linked to /usr/bin/java ?!?! root#>ls -al /bin/java lrwxrwxrwx 1 root root 13 15. Apr 21:45 /bin/java

Re: can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-15 Thread Walrus theCat
Dankeschön ! On Tue, Apr 15, 2014 at 11:29 AM, Aaron Davidson wrote: > This is probably related to the Scala bug that :cp does not work: > https://issues.scala-lang.org/browse/SI-6502 > > > On Tue, Apr 15, 2014 at 11:21 AM, Walrus theCat wrote: > >> Actually altering the classpath in the REPL c

Re: can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-15 Thread Aaron Davidson
This is probably related to the Scala bug that :cp does not work: https://issues.scala-lang.org/browse/SI-6502 On Tue, Apr 15, 2014 at 11:21 AM, Walrus theCat wrote: > Actually altering the classpath in the REPL causes the provided > SparkContext to disappear: > > scala> sc.parallelize(List(1,2,

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Eugen Cepoi
It depends on your algorithm but I guess that you probably should use reduce (the code probably doesn't compile but it shows you the idea). val result = data.reduce { case (left, right) => skyline(left ++ right) } Or in the case you want to merge the result of a partition with another one you c

Can't run a simple spark application with 0.9.1

2014-04-15 Thread Paul Schooss
Hello, Currently I deployed 0.9.1 spark using a new way of starting up spark exec start-stop-daemon --start --pidfile /var/run/spark.pid --make-pidfile --chuid ${SPARK_USER}:${SPARK_GROUP} --chdir ${SPARK_HOME} --exec /usr/bin/java -- -cp ${CLASSPATH} -Dcom.sun.management.jmxremote.authentica

Re: can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-15 Thread Walrus theCat
Actually altering the classpath in the REPL causes the provided SparkContext to disappear: scala> sc.parallelize(List(1,2,3)) res0: spark.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :13 scala> :cp /root Added '/root'. Your new classpath is: ":/root/jars/aspectjrt.jar:/root/jars/aspectj

Re: How to cogroup/join pair RDDs with different key types?

2014-04-15 Thread Roger Hoover
I'm thinking of creating a union type for the key so that IPRange and IP types can be joined. On Tue, Apr 15, 2014 at 10:44 AM, Roger Hoover wrote: > Andrew, > > Thank you very much for your feedback. Unfortunately, the ranges are not > of predictable size but you gave me an idea of how to hand

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-04-15 Thread anant
I've received the same error with Spark built using Maven. It turns out that mesos-0.13.0 depends on protobuf-2.4.1 which is causing the clash at runtime. Protobuf included by Akka is shaded and doesn't cause any problems. The solution is to update the mesos dependency to 0.18.0 in spark's pom.xml

Re: How to cogroup/join pair RDDs with different key types?

2014-04-15 Thread Roger Hoover
Andrew, Thank you very much for your feedback. Unfortunately, the ranges are not of predictable size but you gave me an idea of how to handle it. Here's what I'm thinking: 1. Choose number of partitions, n, over IP space 2. Preprocess the IPRanges, splitting any of them that cross partition bou

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Cheng Lian
Your Spark solution first reduces partial results into a single partition, computes the final result, and then collects to the driver side. This involves a shuffle and two waves of network traffic. Instead, you can directly collect partial results to the driver and then computes the final results o

scheduler question

2014-04-15 Thread Mohit Jaggi
Hi Folks, I have some questions about how Spark scheduler works: - How does Spark know how many resources a job might need? - How does it fairly share resources between multiple jobs? - Does it "know" about data and partition sizes and use that information for scheduling? Mohit.

Re: partitioning of small data sets

2014-04-15 Thread Matei Zaharia
Yup, one reason it’s 2 actually is to give people a similar experience to working with large files, in case their code doesn’t deal well with the file being partitioned. Matei On Apr 15, 2014, at 9:53 AM, Aaron Davidson wrote: > Take a look at the minSplits argument for SparkContext#textFile

Streaming job having Cassandra query : OutOfMemoryError

2014-04-15 Thread sonyjv
Hi All, I am desperately looking for some help. My cluster is 6 nodes having dual core and 8GB ram each. Spark version running on the cluster is spark-0.9.0-incubating-bin-cdh4. I am getting OutOfMemoryError when running a Spark Streaming job (non-streaming version works fine) which queries Cass

Why these operations are slower than the equivalent on Hadoop?

2014-04-15 Thread Yanzhe Chen
Hi all, As a previous thread, I am asking how to implement a divide-and-conquer algorithm (skyline) in Spark. Here is my current solution: val data = sc.textFile(…).map(line => line.split(“,”).map(_.toDouble)) val result = data.mapPartitions(points => skyline(points.toArray).iterator).coales

Re: partitioning of small data sets

2014-04-15 Thread Aaron Davidson
Take a look at the minSplits argument for SparkContext#textFile [1] -- the default value is 2. You can simply set this to 1 if you'd prefer not to split your data. [1] http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext On Tue, Apr 15, 2014 at 8:44 AM, Diana Car

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-15 Thread Aaron Davidson
Ah, I think I can see where your issue may be coming from. In spark-shell, the MASTER is "local[*]", which just means it uses a pre-set number of cores. This distinction only matters because the default number of slices created from sc.parallelize() is based on the number of cores. So when you run

Re: Spark resilience

2014-04-15 Thread Manoj Samel
Thanks Aaron, this is useful ! - Manoj On Mon, Apr 14, 2014 at 8:12 PM, Aaron Davidson wrote: > Launching drivers inside the cluster was a feature added in 0.9, for > standalone cluster mode: > http://spark.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster

partitioning of small data sets

2014-04-15 Thread Diana Carroll
I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb Given the size, and that it is a single file, I assumed it would only be in a single partition. But when I cache it, I can see in the Spark App UI that it actually splits it into two partitions: [image: Inline image 1] Is this cor

Re: Scala vs Python performance differences

2014-04-15 Thread Ian Ferreira
This would be super useful. Thanks. On 4/15/14, 1:30 AM, "Jeremy Freeman" wrote: >Hi Andrew, > >I'm putting together some benchmarks for PySpark vs Scala. I'm focusing on >ML algorithms, as I'm particularly curious about the relative performance >of >MLlib in Scala vs the Python MLlib API vs pur

Re: standalone vs YARN

2014-04-15 Thread Surendranauth Hiraman
Prashant, In another email thread several weeks ago, it was mentioned that YARN support is considered beta until Spark 1.0. Is that not the case? -Suren On Tue, Apr 15, 2014 at 8:38 AM, Prashant Sharma wrote: > Hi Ishaaq, > > answers inline from what I know, I had like to be corrected though.

Re: standalone vs YARN

2014-04-15 Thread Prashant Sharma
Hi Ishaaq, answers inline from what I know, I had like to be corrected though. On Tue, Apr 15, 2014 at 5:58 PM, ishaaq wrote: > Hi all, > I am evaluating Spark to use here at my work. > > We have an existing Hadoop 1.x install which I planning to upgrade to > Hadoop > 2.3. > > This is not reall

standalone vs YARN

2014-04-15 Thread ishaaq
Hi all, I am evaluating Spark to use here at my work. We have an existing Hadoop 1.x install which I planning to upgrade to Hadoop 2.3. I am trying to work out whether I should install YARN or simply just setup a Spark standalone cluster. We already use ZooKeeper so it isn't a problem to setup HA

Re: RDD collect help

2014-04-15 Thread Flavio Pompermaier
Ok thanks for the help! Best, Flavio On Tue, Apr 15, 2014 at 12:43 AM, Eugen Cepoi wrote: > Nope, those operations are lazy, meaning it will create the RDDs but won't > trigger any "action". The computation is launched by operations such as > collect, count, save to HDFS etc. And even if they

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-15 Thread wxhsdp
sorry, davidosn, i don't catch the point. what's the essential difference between our codes? /*my code*/ val array = new Array[Int](size) val a = sc.parallelize(array).cache() /*4MB*/ /*your code*/ val numSlices = 8 val arr = Array.fill[Array[Int]](numSlices) { new Array[Int](size / numSlices) } v

Spark program thows OutOfMemoryError

2014-04-15 Thread Qin Wei
Hi, all My spark program always gives me the error "java.lang.OutOfMemoryError: Java heap space" in my standalone cluster, here is my code: object SimCalcuTotal { def main(args: Array[String]) { val sc = new SparkContext("spark://192.168.2.184:7077", "Sim Calcu Total", "/usr/local/spark

Re: How to stop system info output in spark shell

2014-04-15 Thread Wei Da
The solution: Edit /opt/spark-0.9.0-incubating-bin-hadoop2/conf/log4j.properties, changing Spark's output to WARN. Done! Refer to: https://github.com/amplab-extras/SparkR-pkg/blob/master/pkg/src/src/main/resources/log4j.properties#L8 eduardocalfaia wrote > Have you already tried in conf/log4j.

Re: Comparing GraphX and GraphLab

2014-04-15 Thread Qi Song
Hi Debasish, I found PageRank LiveJournal cost less than 100 seconds for GraphX in your EC2. But as I use the example (LiveJournalPageRank) you provided in my mechines with the same LiveJournal dataset, It took more than 10 minutes. Following are some details: Environment: 8 machines with each 2*I

groupByKey returns a single partition in a RDD?

2014-04-15 Thread Joe L
I want to apply the following transformations to 60Gbyte data on 7nodes with 10Gbyte memory. And I am wondering if groupByKey() function returns a RDD with a single partition for each key? if so, what will happen if the size of the partition doesn't fit into that particular node? rdd = sc.textFil

Re: storage.MemoryStore estimated size 7 times larger than real

2014-04-15 Thread Aaron Davidson
Hey, I was talking about something more like: val size = 1024 * 1024 val numSlices = 8 val arr = Array.fill[Array[Int]](numSlices) { new Array[Int](size / numSlices) } val rdd = sc.parallelize(arr, numSlices).cache() val size2 = rdd.map(_.length).sum() assert( size2 == size