Re: Update on Pig on Spark initiative
This is really exciting! Thanks so much for this work, I think you've guaranteed Pig's continued vitality. On Wednesday, August 27, 2014, Matei Zaharia matei.zaha...@gmail.com wrote: Awesome to hear this, Mayur! Thanks for putting this together. Matei On August 27, 2014 at 10:04:12 PM, Mayur Rustagi (mayur.rust...@gmail.com javascript:_e(%7B%7D,'cvml','mayur.rust...@gmail.com');) wrote: Hi, We have migrated Pig functionality on top of Spark passing 100% e2e for success cases in pig test suite. That means UDF, Joins other functionality is working quite nicely. We are in the process of merging with Apache Pig trunk(something that should happen over the next 2 weeks). Meanwhile if you are interested in giving it a go, you can try it at https://github.com/sigmoidanalytics/spork This contains all the major changes but may not have all the patches required for 100% e2e, if you are trying it out let me know any issues you face Whole bunch of folks contributed on this Julien Le Dem (Twitter), Praveen R (Sigmoid Analytics), Akhil Das (Sigmoid Analytics), Bill Graham (Twitter), Dmitriy Ryaboy (Twitter), Kamal Banga (Sigmoid Analytics), Anish Haldiya (Sigmoid Analytics), Aniket Mokashi (Google), Greg Owen (DataBricks), Amit Kumar Behera (Sigmoid Analytics), Mahesh Kalakoti (Sigmoid Analytics) Not to mention Spark Pig communities. Regards Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: Submitting multiple files pyspark
Hi Cheng, You specify extra python files through --py-files. For example: bin/spark-submit [your other options] --py-files helper.py main_app.py -Andrew 2014-08-27 22:58 GMT-07:00 Chengi Liu chengi.liu...@gmail.com: Hi, I have two files.. main_app.py and helper.py main_app.py calls some functions in helper.py. I want to use spark-submit to submit a job but how do i specify helper.py? Basically, how do i specify multiple files in spark? Thanks
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
Hi, I'm using a cluster with 5 nodes that each use 8 cores and 10GB of RAM Basically I'm creating a dictionary from text, i.e. giving each words that occurs more than n times in all texts a unique identifier. The essential port of the code looks like that: var texts = ctx.sql(SELECT text FROM table LIMIT 1500).map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) var dict2 = texts.flatMap(_.split( ).map(_.toLowerCase())).repartition(80) dict2 = dict2.filter(s = s.startsWith(http) == false) dict2 = dict2.filter(s = s.startsWith(@) == false) dict2 = dict2.map(removePunctuation(_)) //removes .,?!:; in strings (single words) dict2 = dict2.groupBy(identity).filter(_._2.size 10).keys //only keep entries that occur more than n times. var dict3 = dict2.zipWithIndex var dictM = dict3.collect.toMap var count = dictM.size If I use only 10M texts, it works. With 15M texts as above I get the following error. It occurs after the dictM.size operation, but due to laziness there isn't any computing happening before that. 14/08/27 22:36:29 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks 14/08/27 22:36:29 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 2028, idp11.foo.bar, PROCESS_LOCAL, 921 bytes) 14/08/27 22:36:29 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on idp11.foo.bar:36295 (size: 9.4 KB, free: 10.4 GB) 14/08/27 22:36:30 INFO spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 2 to sp...@idp11.foo.bar:33925 14/08/27 22:36:30 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 1263 bytes 14/08/27 22:37:06 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 2028, idp11.foo.bar): java.lang.OutOfMemoryError: Requested array size exceeds VM limit java.util.Arrays.copyOf(Arrays.java:3230) java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) ... I'm fine with spilling to disk if my program runs out of memory, but is there anything to prevent this error without changing Java Memory settings? (assume those are at the physical maximum) Kind regards, Simon -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-Requested-array-size-exceeds-VM-limit-tp12993.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Using unshaded akka in Spark driver
I am building (yet another) job server for Spark using Play! framework and it seems like Play's akka dependency conflicts with Spark's shaded akka dependency. Using SBT, I can force Play to use akka 2.2.3 (unshaded) but I haven't been able to figure out how to exclude com.typesafe.akka dependencies all together and introduce shaded akka dependencies instead. I was wondering what can potentially go wrong if I use unshaded version in the job server? The job server is responsible for creating SparkContext/StreamingContext that connects to spark cluster (using spark.master config) and these contexts are then provided to jobs that would create RDDs/DStreams perform computations on them. I am guessing in this case, the job server would act as a Spark driver.
Re: Visualizing stage task dependency graph
I'm working on this patch to visualize stages: https://github.com/apache/spark/pull/2077 Phuoc Do On Mon, Aug 4, 2014 at 10:12 PM, Zongheng Yang zonghen...@gmail.com wrote: I agree that this is definitely useful. One related project I know of is Sparkling [1] (also see talk at Spark Summit 2014 [2]), but it'd be great (and I imagine somewhat challenging) to visualize the *physical execution* graph of a Spark job. [1] http://pr01.uml.edu/ [2] http://spark-summit.org/2014/talk/sparkling-identification-of-task-skew-and-speculative-partition-of-data-for-spark-applications On Mon, Aug 4, 2014 at 8:55 PM, rpandya r...@iecommerce.com wrote: Is there a way to visualize the task dependency graph of an application, during or after its execution? The list of stages on port 4040 is useful, but still quite limited. For example, I've found that if I don't cache() the result of one expensive computation, it will get repeated 4 times, but it is not easy to trace through exactly why. Ideally, what I would like for each stage is: - the individual tasks and their dependencies - the various RDD operators that have been applied - the full stack trace, both for the stage barrier, the task, and for the lambdas used (often the RDDs are manipulated inside layers of code, so the immediate file/line# is not enough) Any suggestions? Thanks, Ravi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Visualizing-stage-task-dependency-graph-tp11404.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Phuoc Do https://vida.io/dnprock
Key-Value Operations
Hi, I have a RDD of key-value pairs. Now I want to find the key for which the values has the largest number of elements. How should I do that? Basically I want to select the key for which the number of items in values is the largest. Thank You
Re: Spark Streaming: DStream - zipWithIndex
If just want arbitrary unique id attached to each record in a dstream (no ordering etc), then why not create generate and attach an UUID to each record? On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: I see a issue here. If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG. I wish there was DStream mapPartitionsWithIndex. On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote: You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the RDDs would contain more than 1 billion records. Then you can use rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid) Just a hack .. On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote: No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra.
sbt package assembly run spark examples
hi guys, can someone explain or give the stupid user like me a link where i can get the right usage of sbt and spark in order to run the examples as a stand alone app I got to the point running the app by sbt run path-to-the-data but still get some error because i probably didnt tell the app that the master is local (--master local) in the SparkContext method in the BinaryClassification.scala programm it is set by val conf = new SparkConf().setAppName(sBinaryClassification with $params) so... how to adapt the code in the docu it is written val sc = new SparkContext(local, Simple App, YOUR_SPARK_HOME, List(target/scala-2.10/simple-project_2.10-1.0.jar)) I got the following error sbt run ~/git/spark/data/mllib/sample_binary_classification_data.txt [info] Set current project to Simple Project (in build file:/home/filip/spark-sample/) [info] Running BinaryClassification ~/git/spark/data/mllib/sample_binary_classification_data.txt [error] (run-main-0) org.apache.spark.SparkException: A master URL must be set in your configuration org.apache.spark.SparkException: A master URL must be set in your configuration at org.apache.spark.SparkContext.init(SparkContext.scala:166) at BinaryClassification$.run(BinaryClassification.scala:107) at BinaryClassification$$anonfun$main$1.apply(BinaryClassification.scala:99) at BinaryClassification$$anonfun$main$1.apply(BinaryClassification.scala:98) at scala.Option.map(Option.scala:145) at BinaryClassification$.main(BinaryClassification.scala:98) at BinaryClassification.main(BinaryClassification.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) [trace] Stack trace suppressed: run last compile:run for the full output. java.lang.RuntimeException: Nonzero exit code: 1 at scala.sys.package$.error(package.scala:27) [trace] Stack trace suppressed: run last compile:run for the full output. [error] (compile:run) Nonzero exit code: 1 [error] Total time: 7 s, completed Aug 28, 2014 11:04:44 AM -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-package-assembly-run-spark-examples-tp13000.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: sbt package assembly run spark examples
got it when I read the class refference https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.SparkConf conf.setMaster(local[2]) set the master to local with 2 threads but still get some warnings and the result (see below) is also not right i think ps: by the way ... first i run like this sbt run ~/git/spark/data/mllib/sample_binary_classification_data.txt but the app didnt find the file because it startet at the local directory an pointed to /home/filip/spark-sample/~/git/spark/data/mllib/sample_binary_classification_data.txt any explanation??? ps2: i guess many of us from the user side have problems with scala sbt and the class lib. has any of you a suggestion how i can overcome this. it is pretty time consuming trial and error :-( 14/08/28 11:47:37 INFO ui.SparkUI: Started SparkUI at http://filip-VirtualBox.localdomain:4040 14/08/28 11:47:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/08/28 11:47:41 WARN snappy.LoadSnappy: Snappy native library not loaded Training: 84, test: 16. 14/08/28 11:47:47 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 14/08/28 11:47:47 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Test areaUnderPR = 1.0. Test areaUnderROC = 1.0. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-package-assembly-run-spark-examples-tp13000p13001.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark SQL : how to find element where a field is in a given set
Hi all, What is the expression that I should use with spark sql DSL if I need to retreive data with a field in a given set. For example : I have the following schema case class Person(name: String, age: Int) And I need to do something like : personTable.where('name in Seq(foo, bar)) ? Cheers. Jaonary
Re: Trying to run SparkSQL over Spark Streaming
Thanks for the reply. Sorry I could not ask more earlier. Trying to use a parquet file is not working at all. case class Rec(name:String,pv:Int) val sqlContext=new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD val d1=sc.parallelize(Array((a,10),(b,3))).map(e=Rec(e._1,e._2)) d1.saveAsParquetFile(p1.parquet) val d2=sc.parallelize(Array((a,10),(b,3),(c,5))).map(e=Rec(e._1,e._2)) d2.saveAsParquetFile(p2.parquet) val f1=sqlContext.parquetFile(p1.parquet) val f2=sqlContext.parquetFile(p2.parquet) f1.registerAsTable(logs) f2.insertInto(logs) is giving the error : java.lang.AssertionError: assertion failed: No plan for InsertIntoTable Map(), false same as the one that occured while trying to insert into from rdd to table. So i guess inserting into parquet tables is also not supported? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-run-SparkSQL-over-Spark-Streaming-tp12530p13004.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to join two PairRDD together?
Maybe you can refer sliding method of RDD, but it's right now mllib private method. Look at org.apache.spark.mllib.rdd.RDDFunctions. 2014-08-26 12:59 GMT+08:00 Vida Ha v...@databricks.com: Can you paste the code? It's unclear to me how/when the out of memory is occurring without seeing the code. On Sun, Aug 24, 2014 at 11:37 PM, Gefei Li gefeili.2...@gmail.com wrote: Hello everyone, I am transplanting a clustering algorithm to spark platform, and I meet a problem confusing me for a long time, can someone help me? I have a PairRDDInteger, Integer named patternRDD, which the key represents a number and the value stores an information of the key. And I want to use two of the VALUEs to calculate a kendall number, and if the number is greater than 0.6, then output the two KEYs. I have tried to transform the PairRDD to a RDDTuple2Integer, Integer, and add a common key zero to them, and join two together then get a PairRDD0, IterableTuple2Tuple2key1, value1, Tuple2key2, value2, and tried to use values() method and map the keys out, but it gives me an out of memory error. I think the out of memory error is caused by the few entries of my RDD, but I have no idea how to solve it. Can you help me? Regards, Gefei Li
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
Hi,triedmvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree dep.txtAttached the dep. txt for your information.[WARNING] [WARNING] Some problems were encountered while building the effective settings [WARNING] Unrecognised tag: 'mirrors' (position: START_TAG seen .../mirror\n --\n\n\n mirrors... @161:12) @ /opt/apache-maven-3.1.1/conf/settings.xml, line 161, column 12 [WARNING] [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building Spark Project Examples 1.0.2 [INFO] [INFO] [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ spark-examples_2.10 --- [INFO] org.apache.spark:spark-examples_2.10:jar:1.0.2 [INFO] +- org.apache.spark:spark-core_2.10:jar:1.0.2:provided [INFO] | +- org.apache.hadoop:hadoop-client:jar:2.4.1:provided [INFO] | | +- org.apache.hadoop:hadoop-common:jar:2.4.1:provided [INFO] | | | \- org.apache.hadoop:hadoop-auth:jar:2.4.1:provided [INFO] | | +- org.apache.hadoop:hadoop-hdfs:jar:2.4.1:provided [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.4.1:provided [INFO] | | | +- org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.4.1:provided [INFO] | | | | +- org.apache.hadoop:hadoop-yarn-client:jar:2.4.1:provided [INFO] | | | | | \- com.sun.jersey:jersey-client:jar:1.9:provided [INFO] | | | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:2.4.1:provided [INFO] | | | \- org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.4.1:provided [INFO] | | +- org.apache.hadoop:hadoop-yarn-api:jar:2.4.1:provided [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.4.1:provided [INFO] | | | \- org.apache.hadoop:hadoop-yarn-common:jar:2.4.1:provided [INFO] | | +- org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.4.1:provided [INFO] | | \- org.apache.hadoop:hadoop-annotations:jar:2.4.1:provided [INFO] | +- net.java.dev.jets3t:jets3t:jar:0.9.0:runtime [INFO] | | +- org.apache.httpcomponents:httpclient:jar:4.1.2:compile [INFO] | | +- org.apache.httpcomponents:httpcore:jar:4.1.2:compile [INFO] | | \- com.jamesmurty.utils:java-xmlbuilder:jar:0.4:runtime [INFO] | +- org.apache.curator:curator-recipes:jar:2.4.0:provided [INFO] | | \- org.apache.curator:curator-framework:jar:2.4.0:provided [INFO] | | \- org.apache.curator:curator-client:jar:2.4.0:provided [INFO] | +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:provided [INFO] | | +- org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:provided [INFO] | | +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:provided [INFO] | | | +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:provided [INFO] | | | \- org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:provided [INFO] | | \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:provided [INFO] | | \- org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:provided [INFO] | |\- org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:provided [INFO] | +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:provided [INFO] | +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile [INFO] | +- com.google.guava:guava:jar:14.0.1:compile [INFO] | +- org.apache.commons:commons-lang3:jar:3.3.2:provided [INFO] | +- com.google.code.findbugs:jsr305:jar:1.3.9:provided [INFO] | +- org.slf4j:slf4j-api:jar:1.7.5:compile [INFO] | +- org.slf4j:jul-to-slf4j:jar:1.7.5:provided [INFO] | +- org.slf4j:jcl-over-slf4j:jar:1.7.5:provided [INFO] | +- log4j:log4j:jar:1.2.17:compile [INFO] | +- org.slf4j:slf4j-log4j12:jar:1.7.5:compile [INFO] | +- com.ning:compress-lzf:jar:1.0.0:provided [INFO] | +- org.xerial.snappy:snappy-java:jar:1.0.5:compile [INFO] | +- com.twitter:chill_2.10:jar:0.3.6:provided [INFO] | | \- com.esotericsoftware.kryo:kryo:jar:2.21:provided [INFO] | | +- com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:provided [INFO] | | +- com.esotericsoftware.minlog:minlog:jar:1.2:provided [INFO] | | \- org.objenesis:objenesis:jar:1.2:provided [INFO] | +- com.twitter:chill-java:jar:0.3.6:provided [INFO] | +- commons-net:commons-net:jar:2.2:compile [INFO] | +- org.spark-project.akka:akka-remote_2.10:jar:2.2.3-shaded-protobuf:provided [INFO] | | +- org.spark-project.akka:akka-actor_2.10:jar:2.2.3-shaded-protobuf:compile [INFO] | | | \- com.typesafe:config:jar:1.0.2:compile [INFO] | | +- io.netty:netty:jar:3.6.6.Final:provided [INFO] | | +- org.spark-project.protobuf:protobuf-java:jar:2.4.1-shaded:compile [INFO] | | \- org.uncommons.maths:uncommons-maths:jar:1.2.2a:provided [INFO] | +- org.spark-project.akka:akka-slf4j_2.10:jar:2.2.3-shaded-protobuf:provided [INFO] | +- org.scala-lang:scala-library:jar:2.10.4:compile
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
I see 0.98.5 in dep.txt You should be good to go. On Thu, Aug 28, 2014 at 3:16 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, tried mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree dep.txt Attached the dep. txt for your information. Regards Arthur On 28 Aug, 2014, at 12:22 pm, Ted Yu yuzhih...@gmail.com wrote: I forgot to include '-Dhadoop.version=2.4.1' in the command below. The modified command passed. You can verify the dependence on hbase 0.98 through this command: mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree dep.txt Cheers On Wed, Aug 27, 2014 at 8:58 PM, Ted Yu yuzhih...@gmail.com wrote: Looks like the patch given by that URL only had the last commit. I have attached pom.xml for spark-1.0.2 to SPARK-1297 You can download it and replace examples/pom.xml with the downloaded pom I am running this command locally: mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean package Cheers On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, Thanks. Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.) Is this normal? Regards Arthur patch -p1 -i 1893.patch patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines). 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file examples/pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 succeeded at 122 (offset -49 lines). 2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file docs/building-with-maven.md patching file examples/pom.xml Hunk #1 succeeded at 122 (offset -40 lines). Hunk #2 succeeded at 195 (offset -40 lines). On 28 Aug, 2014, at 10:53 am, Ted Yu yuzhih...@gmail.com wrote: Can you use this command ? patch -p1 -i 1893.patch Cheers On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, I tried the following steps to apply the patch 1893 but got Hunk FAILED, can you please advise how to get thru this error? or is my spark-1.0.2 source not the correct one? Regards Arthur wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 wget https://github.com/apache/spark/pull/1893.patch patch 1893.patch patching file pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej patching file pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 FAILED at 171. 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej can't find file to patch at input line 267 Perhaps you should have used the -p or --strip option? The text leading up to this was: -- | |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001 |From: tedyu yuzhih...@gmail.com |Date: Mon, 11 Aug 2014 15:57:46 -0700 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add | description to building-with-maven.md | |--- | docs/building-with-maven.md | 3 +++ | 1 file changed, 3 insertions(+) | |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- a/docs/building-with-maven.md |+++ b/docs/building-with-maven.md -- File to patch: On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote: You can get the patch from this URL: https://github.com/apache/spark/pull/1893.patch BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml Cheers On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, Thank you so much!! As I am new to Spark, can you please advise the steps about how to apply this patch to my spark-1.0.2 source folder? Regards Arthur On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote: See SPARK-1297 The pull request is here: https://github.com/apache/spark/pull/1893 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: (correction: Compilation Error: Spark 1.0.2 with HBase 0.98” , please ignore if duplicated) Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 edit project/SparkBuild.scala, set HBASE_VERSION // HBase version; set as appropriate. val HBASE_VERSION = 0.98.2 edit pom.xml with following values hadoop.version2.4.1/hadoop.version protobuf.version2.5.0/protobuf.version yarn.version${hadoop.version}/yarn.version hbase.version0.98.5/hbase.version zookeeper.version3.4.6/zookeeper.version hive.version0.13.1/hive.version SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly but it
Spark-submit not running
The file is compiling properly but when I try to run the jar file using spark-submit, it is giving some errors. I am running spark locally and have downloaded a pre-built version of Spark named For Hadoop 2 (HDP2, CDH5). AI don't know if it is a dependency problem but I don't want to have Hadoop in my system. The error says: 14/08/28 12:34:36 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. Vineet
Re: Spark Streaming checkpoint recovery causes IO re-execution
Not sure if this make sense, but maybe would be nice to have a kind of flag available within the code that tells me if I'm running in a normal situation or during a recovery. To better explain this, let's consider the following scenario: I am processing data, let's say from a Kafka streaming, and I am updating a database based on the computations. During the recovery I don't want to update again the database (for many reasons, let's just assume that) but I want my system to be in the same status as before, thus I would like to know if my code is running for the first time or during a recovery so I can avoid to update the database again. More generally I want to know this in case I'm interacting with external entities. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13009.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
Hi, I tried to start Spark but failed: $ ./sbin/start-all.sh starting org.apache.spark.deploy.master.Master, logging to /mnt/hadoop/spark-1.0.2/sbin/../logs/spark-edhuser-org.apache.spark.deploy.master.Master-1-m133.out failed to launch org.apache.spark.deploy.master.Master: Failed to find Spark assembly in /mnt/hadoop/spark-1.0.2/assembly/target/scala-2.10/ $ ll assembly/ total 20 -rw-rw-r--. 1 hduser hadoop 11795 Jul 26 05:50 pom.xml -rw-rw-r--. 1 hduser hadoop 507 Jul 26 05:50 README drwxrwxr-x. 4 hduser hadoop 4096 Jul 26 05:50 src Regards Arthur On 28 Aug, 2014, at 6:19 pm, Ted Yu yuzhih...@gmail.com wrote: I see 0.98.5 in dep.txt You should be good to go. On Thu, Aug 28, 2014 at 3:16 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, tried mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree dep.txt Attached the dep. txt for your information. Regards Arthur On 28 Aug, 2014, at 12:22 pm, Ted Yu yuzhih...@gmail.com wrote: I forgot to include '-Dhadoop.version=2.4.1' in the command below. The modified command passed. You can verify the dependence on hbase 0.98 through this command: mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree dep.txt Cheers On Wed, Aug 27, 2014 at 8:58 PM, Ted Yu yuzhih...@gmail.com wrote: Looks like the patch given by that URL only had the last commit. I have attached pom.xml for spark-1.0.2 to SPARK-1297 You can download it and replace examples/pom.xml with the downloaded pom I am running this command locally: mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean package Cheers On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, Thanks. Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.) Is this normal? Regards Arthur patch -p1 -i 1893.patch patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines). 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file examples/pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 succeeded at 122 (offset -49 lines). 2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file docs/building-with-maven.md patching file examples/pom.xml Hunk #1 succeeded at 122 (offset -40 lines). Hunk #2 succeeded at 195 (offset -40 lines). On 28 Aug, 2014, at 10:53 am, Ted Yu yuzhih...@gmail.com wrote: Can you use this command ? patch -p1 -i 1893.patch Cheers On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, I tried the following steps to apply the patch 1893 but got Hunk FAILED, can you please advise how to get thru this error? or is my spark-1.0.2 source not the correct one? Regards Arthur wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 wget https://github.com/apache/spark/pull/1893.patch patch 1893.patch patching file pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej patching file pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 FAILED at 171. 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej can't find file to patch at input line 267 Perhaps you should have used the -p or --strip option? The text leading up to this was: -- | |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001 |From: tedyu yuzhih...@gmail.com |Date: Mon, 11 Aug 2014 15:57:46 -0700 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add | description to building-with-maven.md | |--- | docs/building-with-maven.md | 3 +++ | 1 file changed, 3 insertions(+) | |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- a/docs/building-with-maven.md |+++ b/docs/building-with-maven.md -- File to patch: On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote: You can get the patch from this URL: https://github.com/apache/spark/pull/1893.patch BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the pom.xml Cheers On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, Thank you so much!! As I am new to Spark, can you please advise the steps about how to apply this patch to my spark-1.0.2 source folder? Regards Arthur On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote: See SPARK-1297 The pull request is here: https://github.com/apache/spark/pull/1893 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: (correction: Compilation Error: Spark 1.0.2 with HBase 0.98” , please ignore if duplicated)
how to filter value in spark
fileA=1 2 3 4 one number a line,save in /sparktest/1/ fileB=3 4 5 6 one number a line,save in /sparktest/2/ I want to get 3 and 4 var a = sc.textFile(/sparktest/1/).map((_,1)) var b = sc.textFile(/sparktest/2/).map((_,1)) a.filter(param={b.lookup(param._1).length0}).map(_._1).foreach(println) Error throw Scala.MatchError:Null PairRDDFunctions.lookup... - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark-submit not running
You need to set HADOOP_HOME. Is Spark officially supposed to work on Windows or not at this stage? I know the build doesn't quite yet. On Thu, Aug 28, 2014 at 11:37 AM, Hingorani, Vineet vineet.hingor...@sap.com wrote: The file is compiling properly but when I try to run the jar file using spark-submit, it is giving some errors. I am running spark locally and have downloaded a pre-built version of Spark named “For Hadoop 2 (HDP2, CDH5)”. AI don’t know if it is a dependency problem but I don’t want to have Hadoop in my system. The error says: 14/08/28 12:34:36 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. Vineet
Re: how to filter value in spark
On 08/28/2014 07:20 AM, marylucy wrote: fileA=1 2 3 4 one number a line,save in /sparktest/1/ fileB=3 4 5 6 one number a line,save in /sparktest/2/ I want to get 3 and 4 var a = sc.textFile(/sparktest/1/).map((_,1)) var b = sc.textFile(/sparktest/2/).map((_,1)) a.filter(param={b.lookup(param._1).length0}).map(_._1).foreach(println) Error throw Scala.MatchError:Null PairRDDFunctions.lookup... the issue is nesting of the b rdd inside a transformation of the a rdd consider using intersection, it's more idiomatic a.intersection(b).foreach(println) but not that intersection will remove duplicates best, matt - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
how to specify columns in groupby
Hi all, I have an RDD which has values in the format id,date,cost. I want to group the elements based on the id and date columns and get the sum of the cost for each group. Can somebody tell me how to do this? Thanks Regards, Meethu M
Re: How to join two PairRDD together?
It sounds like you are adding the same key to every element, and joining, in order to accomplish a full cartesian join? I can imagine doing it that way would blow up somewhere. There is a cartesian() method to do this maybe more efficiently. However if your data set is large, this sort of algorithm for computing Kendall's tau is going to be very slow since it's N^2 and would create an unspeakably large shuffle. There are faster algorithms for this statistic. Also consider sampling your data and computing the join over a small sample to estimate the statistic. On Thu, Aug 28, 2014 at 11:15 AM, Yanbo Liang yanboha...@gmail.com wrote: Maybe you can refer sliding method of RDD, but it's right now mllib private method. Look at org.apache.spark.mllib.rdd.RDDFunctions. 2014-08-26 12:59 GMT+08:00 Vida Ha v...@databricks.com: Can you paste the code? It's unclear to me how/when the out of memory is occurring without seeing the code. On Sun, Aug 24, 2014 at 11:37 PM, Gefei Li gefeili.2...@gmail.com wrote: Hello everyone, I am transplanting a clustering algorithm to spark platform, and I meet a problem confusing me for a long time, can someone help me? I have a PairRDDInteger, Integer named patternRDD, which the key represents a number and the value stores an information of the key. And I want to use two of the VALUEs to calculate a kendall number, and if the number is greater than 0.6, then output the two KEYs. I have tried to transform the PairRDD to a RDDTuple2Integer, Integer, and add a common key zero to them, and join two together then get a PairRDD0, IterableTuple2Tuple2key1, value1, Tuple2key2, value2, and tried to use values() method and map the keys out, but it gives me an out of memory error. I think the out of memory error is caused by the few entries of my RDD, but I have no idea how to solve it. Can you help me? Regards, Gefei Li
Re: Compilaon Error: Spark 1.0.2 with HBase 0.98
0.98.2 is not an HBase version, but 0.98.2-hadoop2 is: http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.hbase%22%20AND%20a%3A%22hbase%22 On Thu, Aug 28, 2014 at 2:54 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with HBase 0.98, My steps: wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 edit project/SparkBuild.scala, set HBASE_VERSION // HBase version; set as appropriate. val HBASE_VERSION = 0.98.2 edit pom.xml with following values hadoop.version2.4.1/hadoop.version protobuf.version2.5.0/protobuf.version yarn.version${hadoop.version}/yarn.version hbase.version0.98.5/hbase.version zookeeper.version3.4.6/zookeeper.version hive.version0.13.1/hive.version SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or should I set HBASE_VERSION back to “0.94.6? Regards Arthur [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.apache.hbase#hbase;0.98.2: not found [warn] :: sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not found at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126) at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51) at sbt.IvySbt$$anon$3.call(Ivy.scala:60) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102) at xsbt.boot.Using$.withResource(Using.scala:11) at xsbt.boot.Using$.apply(Using.scala:10) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:60) at sbt.IvySbt.withIvy(Ivy.scala:101) at sbt.IvySbt.withIvy(Ivy.scala:97) at sbt.IvySbt$Module.withModule(Ivy.scala:116) at sbt.IvyActions$.update(IvyActions.scala:125) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189) at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188) at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45) at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139) at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42) at sbt.std.Transform$$anon$4.work(System.scala:64) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18) at sbt.Execute.work(Execute.scala:244) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237) at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160) at sbt.CompletionService$$anon$2.call(CompletionService.scala:30) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) [error] (examples/*:update) sbt.ResolveException: unresolved dependency: org.apache.hbase#hbase;0.98.2: not found [error] Total time: 270 s, completed Aug 28, 2014 9:42:05 AM
Re: how to specify columns in groupby
For your reference: val d1 = textFile.map(line = { val fileds = line.split(,) ((fileds(0),fileds(1)), fileds(2).toDouble) }) val d2 = d1.reduceByKey(_+_) d2.foreach(println) 2014-08-28 20:04 GMT+08:00 MEETHU MATHEW meethu2...@yahoo.co.in: Hi all, I have an RDD which has values in the format id,date,cost. I want to group the elements based on the id and date columns and get the sum of the cost for each group. Can somebody tell me how to do this? Thanks Regards, Meethu M
Re: Key-Value Operations
If you mean your values are all a Seq or similar already, then you just take the top 1 ordered by the size of the value: rdd.top(1)(Ordering.by(_._2.size)) On Thu, Aug 28, 2014 at 9:34 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I have a RDD of key-value pairs. Now I want to find the key for which the values has the largest number of elements. How should I do that? Basically I want to select the key for which the number of items in values is the largest. Thank You
RE: Spark-submit not running
How can I set HADOOP_HOME if I am running the Spark on my local machine without anything else? Do I have to install some other pre-built file? I am on Windows 7 and Spark’s official site says that it is available on Windows, I added Java path in the PATH variable. Vineet From: Sean Owen [mailto:so...@cloudera.com] Sent: Donnerstag, 28. August 2014 13:49 To: Hingorani, Vineet Cc: user@spark.apache.org Subject: Re: Spark-submit not running You need to set HADOOP_HOME. Is Spark officially supposed to work on Windows or not at this stage? I know the build doesn't quite yet. On Thu, Aug 28, 2014 at 11:37 AM, Hingorani, Vineet vineet.hingor...@sap.commailto:vineet.hingor...@sap.com wrote: The file is compiling properly but when I try to run the jar file using spark-submit, it is giving some errors. I am running spark locally and have downloaded a pre-built version of Spark named “For Hadoop 2 (HDP2, CDH5)”. AI don’t know if it is a dependency problem but I don’t want to have Hadoop in my system. The error says: 14/08/28 12:34:36 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. Vineet
Re: Spark-submit not running
Can you copy the exact spark-submit command that you are running? You should be able to run it locally without installing hadoop. Here is an example on how to run the job locally. # Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master local[8] \ /path/to/examples.jar \ 100 On Aug 28, 2014, at 7:54 AM, Hingorani, Vineet vineet.hingor...@sap.com wrote: How can I set HADOOP_HOME if I am running the Spark on my local machine without anything else? Do I have to install some other pre-built file? I am on Windows 7 and Spark’s official site says that it is available on Windows, I added Java path in the PATH variable. Vineet From: Sean Owen [mailto:so...@cloudera.com] Sent: Donnerstag, 28. August 2014 13:49 To: Hingorani, Vineet Cc: user@spark.apache.org Subject: Re: Spark-submit not running You need to set HADOOP_HOME. Is Spark officially supposed to work on Windows or not at this stage? I know the build doesn't quite yet. On Thu, Aug 28, 2014 at 11:37 AM, Hingorani, Vineet vineet.hingor...@sap.com wrote: The file is compiling properly but when I try to run the jar file using spark-submit, it is giving some errors. I am running spark locally and have downloaded a pre-built version of Spark named “For Hadoop 2 (HDP2, CDH5)”.AI don’t know if it is a dependency problem but I don’t want to have Hadoop in my system. The error says: 14/08/28 12:34:36 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries. Vineet
Re: Spark-submit not running
Yes, but I think at the moment there is still a dependency on Hadoop even when not using it. See https://issues.apache.org/jira/browse/SPARK-2356 On Thu, Aug 28, 2014 at 2:14 PM, Guru Medasani gdm...@outlook.com wrote: Can you copy the exact spark-submit command that you are running? You should be able to run it locally without installing hadoop. Here is an example on how to run the job locally. # Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master local[8] \ /path/to/examples.jar \ 100
SPARK on YARN, containers fails
Hi there, I'm trying to run JavaSparkPi example on YARN with master = yarn-client but I have a problem. It runs smoothly with submitting application, first container for Application Master works too. When job is starting and there are some tasks to do I'm getting this warning on console (I'm using windows cmd if this makes any difference): WARN cluster.YarnClientClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory When I'm checking logs for container with Application Masters it is launching containers for executors properly, then goes with: INFO YarnAllocationHandler: Completed container container_1409217202587_0003_01_02 (state: COMPLETE, exit status: 1) INFO YarnAllocationHandler: Container marked as failed: container_1409217202587_0003_01_02 And tries to re-launch them. On failed container log there is only this: Error: Could not find or load main class pwd..sp...@gbv06758291.my.secret.address.net:63680.user.CoarseGrainedScheduler Any hints? I run YARN on 1 machine with 16GB RAM and 2 cores so it's definietly not case of propagation of spark assembly jar, cause without it firs container with Application Master would fail too. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-on-YARN-containers-fails-tp13024.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming checkpoint recovery causes IO re-execution
Can you clarify the scenario: val ssc = new StreamingContext(sparkConf, Seconds(10)) ssc.checkpoint(checkpointDirectory) val stream = KafkaUtils.createStream(...) val wordCounts = lines.flatMap(_.split( )).map(x = (x, 1L)) val wordDstream= wordCounts.updateStateByKey[Int](updateFunc) wordDstream.foreachRDD( rdd=rdd.foreachPartition( //sync state to external store//) ) My impression is that during recovery from checkpoint, your wordDstream would be in the state that it was before the crash +1 batch interval forward when you get to the foreachRDD part -- even if re-creating the pre-crash RDD is really slow. So if your driver goes down at 10:20 and you restart at 10:30, I thought at the time of the DB write wordDstream would have exactly the state of 10:20 + 10seconds worth of aggregated stream data? I don't really understand what you mean by Upon metadata checkpoint recovery (before the data checkpoint occurs) but it sounds like you're observing the same DB write happening twice? I don't have any advice for you but I am interested in understanding better what happens in the recovery scenario so just trying to clarify what you observe. On Thu, Aug 28, 2014 at 6:42 AM, GADV giulio_devec...@yahoo.com wrote: Not sure if this make sense, but maybe would be nice to have a kind of flag available within the code that tells me if I'm running in a normal situation or during a recovery. To better explain this, let's consider the following scenario: I am processing data, let's say from a Kafka streaming, and I am updating a database based on the computations. During the recovery I don't want to update again the database (for many reasons, let's just assume that) but I want my system to be in the same status as before, thus I would like to know if my code is running for the first time or during a recovery so I can avoid to update the database again. More generally I want to know this in case I'm interacting with external entities. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13009.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Compilation Error: Spark 1.0.2 with HBase 0.98
I didn't see that problem. Did you run this command ? mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests clean package Here is what I got: TYus-MacBook-Pro:spark-1.0.2 tyu$ sbin/start-all.sh starting org.apache.spark.deploy.master.Master, logging to /Users/tyu/spark-1.0.2/sbin/../logs/spark-tyu-org.apache.spark.deploy.master.Master-1-TYus-MacBook-Pro.local.out localhost: ssh: connect to host localhost port 22: Connection refused TYus-MacBook-Pro:spark-1.0.2 tyu$ vi logs/spark-tyu-org.apache.spark.deploy.master.Master-1-TYus-MacBook-Pro.local.out TYus-MacBook-Pro:spark-1.0.2 tyu$ jps 11563 Master 11635 Jps TYus-MacBook-Pro:spark-1.0.2 tyu$ ps aux | grep 11563 tyu 11563 0.7 0.8 196 142444 s003 S 6:52AM 0:02.72 /Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/bin/java -cp ::/Users/tyu/spark-1.0.2/conf:/Users/tyu/spark-1.0.2/assembly/target/scala-2.10/spark-assembly-1.0.2-hadoop2.4.1.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip TYus-MacBook-Pro.local --port 7077 --webui-port 8080 TYus-MacBook-Pro:spark-1.0.2 tyu$ ls -l assembly/target/scala-2.10/spark-assembly-1.0.2-hadoop2.4.1.jar -rw-r--r-- 1 tyu staff 121182305 Aug 27 21:13 assembly/target/scala-2.10/spark-assembly-1.0.2-hadoop2.4.1.jar Cheers On Thu, Aug 28, 2014 at 3:42 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I tried to start Spark but failed: $ ./sbin/start-all.sh starting org.apache.spark.deploy.master.Master, logging to /mnt/hadoop/spark-1.0.2/sbin/../logs/spark-edhuser-org.apache.spark.deploy.master.Master-1-m133.out failed to launch org.apache.spark.deploy.master.Master: Failed to find Spark assembly in /mnt/hadoop/spark-1.0.2/assembly/target/scala-2.10/ $ ll assembly/ total 20 -rw-rw-r--. 1 hduser hadoop 11795 Jul 26 05:50 pom.xml -rw-rw-r--. 1 hduser hadoop 507 Jul 26 05:50 README drwxrwxr-x. 4 hduser hadoop 4096 Jul 26 05:50 *src* Regards Arthur On 28 Aug, 2014, at 6:19 pm, Ted Yu yuzhih...@gmail.com wrote: I see 0.98.5 in dep.txt You should be good to go. On Thu, Aug 28, 2014 at 3:16 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, tried mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree dep.txt Attached the dep. txt for your information. Regards Arthur On 28 Aug, 2014, at 12:22 pm, Ted Yu yuzhih...@gmail.com wrote: I forgot to include '-Dhadoop.version=2.4.1' in the command below. The modified command passed. You can verify the dependence on hbase 0.98 through this command: mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree dep.txt Cheers On Wed, Aug 27, 2014 at 8:58 PM, Ted Yu yuzhih...@gmail.com wrote: Looks like the patch given by that URL only had the last commit. I have attached pom.xml for spark-1.0.2 to SPARK-1297 You can download it and replace examples/pom.xml with the downloaded pom I am running this command locally: mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean package Cheers On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, Thanks. Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.) Is this normal? Regards Arthur patch -p1 -i 1893.patch patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines). 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file examples/pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 succeeded at 122 (offset -49 lines). 2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej patching file docs/building-with-maven.md patching file examples/pom.xml Hunk #1 succeeded at 122 (offset -40 lines). Hunk #2 succeeded at 195 (offset -40 lines). On 28 Aug, 2014, at 10:53 am, Ted Yu yuzhih...@gmail.com wrote: Can you use this command ? patch -p1 -i 1893.patch Cheers On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi Ted, I tried the following steps to apply the patch 1893 but got Hunk FAILED, can you please advise how to get thru this error? or is my spark-1.0.2 source not the correct one? Regards Arthur wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz tar -vxf spark-1.0.2.tgz cd spark-1.0.2 wget https://github.com/apache/spark/pull/1893.patch patch 1893.patch patching file pom.xml Hunk #1 FAILED at 45. Hunk #2 FAILED at 110. 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej patching file pom.xml Hunk #1 FAILED at 54. Hunk #2 FAILED at 72. Hunk #3 FAILED at 171. 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej can't find file to patch at input line 267 Perhaps you should have used the -p or --strip option? The text leading up to this was: -- | |From
repartitioning an RDD yielding imbalance
I've got an RDD where each element is a long string (a whole document). I'm using pyspark so some of the handy partition-handling functions aren't available, and I count the number of elements in each partition with: def count_partitions(id, iterator): c = sum(1 for _ in iterator) yield (id,c) rdd.mapPartitionsWithSplit(count_partitions).collectAsMap() This returns the following: {0: 866, 1: 1158, 2: 828, 3: 876} But if I do: rdd.repartition(8).mapPartitionsWithSplit(count_partitions).collectAsMap() I get {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 3594, 7: 134} Why this strange redistribution of elements? I'm obviously misunderstanding how spark does the partitioning -- is it a problem with having a list of strings as an RDD? Help vey much appreciated! Thanks, Rok - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Graphx: undirected graph support
A bit in analogy with a linked-list a double linked-list. It might introduce overhead in terms of memory usage, but you could use two directed edges to substitute the uni-directed edge. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-undirected-graph-support-tp2142p13028.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark-submit not running
Thanks Sean. Looks like there is a workaround as per the JIRA https://issues.apache.org/jira/browse/SPARK-2356 . http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7. May be that's worth a shot? On Aug 28, 2014, at 8:15 AM, Sean Owen so...@cloudera.com wrote: Yes, but I think at the moment there is still a dependency on Hadoop even when not using it. See https://issues.apache.org/jira/browse/SPARK-2356 On Thu, Aug 28, 2014 at 2:14 PM, Guru Medasani gdm...@outlook.com wrote: Can you copy the exact spark-submit command that you are running? You should be able to run it locally without installing hadoop. Here is an example on how to run the job locally. # Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master local[8] \ /path/to/examples.jar \ 100
RE: Spark-submit not running
Thank you Sean and Guru for giving the information. Btw I have to put this line, but where should I add it? In my scala file below other ‘import …’ lines are written? System.setProperty(hadoop.home.dir, d:\\winutil\\) Thank you Vineet From: Sean Owen [mailto:so...@cloudera.com] Sent: Donnerstag, 28. August 2014 15:16 To: Guru Medasani Cc: Hingorani, Vineet; user@spark.apache.org Subject: Re: Spark-submit not running Yes, but I think at the moment there is still a dependency on Hadoop even when not using it. See https://issues.apache.org/jira/browse/SPARK-2356 On Thu, Aug 28, 2014 at 2:14 PM, Guru Medasani gdm...@outlook.commailto:gdm...@outlook.com wrote: Can you copy the exact spark-submit command that you are running? You should be able to run it locally without installing hadoop. Here is an example on how to run the job locally. # Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master local[8] \ /path/to/examples.jar \ 100
Re: Spark-submit not running
You should set this as early as possible in your program, before other code runs. On Thu, Aug 28, 2014 at 3:27 PM, Hingorani, Vineet vineet.hingor...@sap.com wrote: Thank you Sean and Guru for giving the information. Btw I have to put this line, but where should I add it? In my scala file below other ‘import …’ lines are written? System.setProperty(hadoop.home.dir, d:\\winutil\\) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Spark-submit not running
The following error is given when I try to add this line: [info] Set current project to Simple Project (in build file:/C:/Users/D062844/Desktop/Hand sOnSpark/Install/spark-1.0.2-bin-hadoop2/) [info] Compiling 1 Scala source to C:\Users\D062844\Desktop\HandsOnSpark\Install\spark-1.0 .2-bin-hadoop2\target\scala-2.10\classes... [error] C:\Users\D062844\Desktop\HandsOnSpark\Install\spark-1.0.2-bin-hadoop2\src\main\sca la\SimpleApp.scala:1: expected class or object definition [error] System.setProperty(hadoop.home.dir, c:\\winutil\\) [error] ^ [error] one error found [error] (compile:compile) Compilation failed Vineet -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Donnerstag, 28. August 2014 16:30 To: Hingorani, Vineet Cc: user@spark.apache.org Subject: Re: Spark-submit not running You should set this as early as possible in your program, before other code runs. On Thu, Aug 28, 2014 at 3:27 PM, Hingorani, Vineet vineet.hingor...@sap.com wrote: Thank you Sean and Guru for giving the information. Btw I have to put this line, but where should I add it? In my scala file below other ‘import …’ lines are written? System.setProperty(hadoop.home.dir, d:\\winutil\\)
Change delimiter when collecting SchemaRDD
Hi All, Is there any way to change the delimiter from being a comma ? Some of the strings in my data contain commas as well, making it very difficult to parse the results. Yadid
Print to spark log
Hi all, Just wondering if there is a way to use logging to print to spark logs some additional info (similar to debug in scalding). Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Print-to-spark-log-tp13035.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: CUDA in spark, especially in MLlib?
Breeze author David also has a github project on cuda binding in scalado you prefer using java or scala ? On Aug 27, 2014 2:05 PM, Frank van Lankvelt f.vanlankv...@onehippo.com wrote: you could try looking at ScalaCL[1], it's targeting OpenCL rather than CUDA, but that might be close enough? cheers, Frank 1. https://github.com/ochafik/ScalaCL On Wed, Aug 27, 2014 at 7:33 PM, Wei Tan w...@us.ibm.com wrote: Thank you all. Actually I was looking at JCUDA. Function wise this may be a perfect solution to offload computation to GPU. Will see how performance it will be, especially with the Java binding. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center *http://researcher.ibm.com/person/us-wtan* http://researcher.ibm.com/person/us-wtan From:Chen He airb...@gmail.com To:Antonio Jesus Navarro ajnava...@stratio.com, Cc:Matei Zaharia matei.zaha...@gmail.com, user@spark.apache.org, Wei Tan/Watson/IBM@IBMUS Date:08/27/2014 11:03 AM Subject:Re: CUDA in spark, especially in MLlib? -- JCUDA can let you do that in Java *http://www.jcuda.org* http://www.jcuda.org/ On Wed, Aug 27, 2014 at 1:48 AM, Antonio Jesus Navarro *ajnava...@stratio.com* ajnava...@stratio.com wrote: Maybe this would interest you: CPU and GPU-accelerated Machine Learning Library: *https://github.com/BIDData/BIDMach* https://github.com/BIDData/BIDMach 2014-08-27 4:08 GMT+02:00 Matei Zaharia *matei.zaha...@gmail.com* matei.zaha...@gmail.com: You should try to find a Java-based library, then you can call it from Scala. Matei On August 26, 2014 at 6:58:11 PM, Wei Tan (*w...@us.ibm.com* w...@us.ibm.com) wrote: Hi I am trying to find a CUDA library in Scala, to see if some matrix manipulation in MLlib can be sped up. I googled a few but found no active projects on Scala+CUDA. Python is supported by CUDA though. Any suggestion on whether this idea makes any sense? Best regards, Wei -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com
SPARK-1297 patch error (spark-1297-v4.txt )
Hi, I have just tried to apply the patch of SPARK-1297: https://issues.apache.org/jira/browse/SPARK-1297 There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt respectively. When applying the 2nd one, I got Hunk #1 FAILED at 45 Can you please advise how to fix it in order to make the compilation of Spark Project Examples success? (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2) Regards Arthur patch -p1 -i spark-1297-v4.txt patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines). 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej below is the content of examples/pom.xml.rej: +++ examples/pom.xml @@ -45,6 +45,39 @@ /dependency /dependencies /profile +profile + idhbase-hadoop2/id + activation +property + namehbase.profile/name + valuehadoop2/value +/property + /activation + properties +protobuf.version2.5.0/protobuf.version +hbase.version0.98.4-hadoop2/hbase.version + /properties + dependencyManagement +dependencies +/dependencies + /dependencyManagement +/profile +profile + idhbase-hadoop1/id + activation +property + name!hbase.profile/name +/property + /activation + properties +hbase.version0.98.4-hadoop1/hbase.version + /properties + dependencyManagement +dependencies +/dependencies + /dependencyManagement +/profile + /profiles dependencies This caused the related compilation failed: [INFO] Spark Project Examples FAILURE [0.102s]
Re: SPARK-1297 patch error (spark-1297-v4.txt )
I attached patch v5 which corresponds to the pull request. Please try again. On Thu, Aug 28, 2014 at 9:50 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I have just tried to apply the patch of SPARK-1297: https://issues.apache.org/jira/browse/SPARK-1297 There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt respectively. When applying the 2nd one, I got Hunk #1 FAILED at 45 Can you please advise how to fix it in order to make the compilation of Spark Project Examples success? (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2) Regards Arthur patch -p1 -i spark-1297-v4.txt patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines). 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej below is the content of examples/pom.xml.rej: +++ examples/pom.xml @@ -45,6 +45,39 @@ /dependency /dependencies /profile +profile + idhbase-hadoop2/id + activation +property + namehbase.profile/name + valuehadoop2/value +/property + /activation + properties +protobuf.version2.5.0/protobuf.version +hbase.version0.98.4-hadoop2/hbase.version + /properties + dependencyManagement +dependencies +/dependencies + /dependencyManagement +/profile +profile + idhbase-hadoop1/id + activation +property + name!hbase.profile/name +/property + /activation + properties +hbase.version0.98.4-hadoop1/hbase.version + /properties + dependencyManagement +dependencies +/dependencies + /dependencyManagement +/profile + /profiles dependencies This caused the related compilation failed: [INFO] Spark Project Examples FAILURE [0.102s]
Re: SPARK-1297 patch error (spark-1297-v4.txt )
Hi, patch -p1 -i spark-1297-v5.txt can't find file to patch at input line 5 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- |diff --git docs/building-with-maven.md docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- docs/building-with-maven.md |+++ docs/building-with-maven.md -- File to patch: Please advise Regards Arthur On 29 Aug, 2014, at 12:50 am, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I have just tried to apply the patch of SPARK-1297: https://issues.apache.org/jira/browse/SPARK-1297 There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt respectively. When applying the 2nd one, I got Hunk #1 FAILED at 45 Can you please advise how to fix it in order to make the compilation of Spark Project Examples success? (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2) Regards Arthur patch -p1 -i spark-1297-v4.txt patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines). 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej below is the content of examples/pom.xml.rej: +++ examples/pom.xml @@ -45,6 +45,39 @@ /dependency /dependencies /profile +profile + idhbase-hadoop2/id + activation +property + namehbase.profile/name + valuehadoop2/value +/property + /activation + properties +protobuf.version2.5.0/protobuf.version +hbase.version0.98.4-hadoop2/hbase.version + /properties + dependencyManagement +dependencies +/dependencies + /dependencyManagement +/profile +profile + idhbase-hadoop1/id + activation +property + name!hbase.profile/name +/property + /activation + properties +hbase.version0.98.4-hadoop1/hbase.version + /properties + dependencyManagement +dependencies +/dependencies + /dependencyManagement +/profile + /profiles dependencies This caused the related compilation failed: [INFO] Spark Project Examples FAILURE [0.102s]
New SparkR mailing list, JIRA
Hi I'd like to announce a couple of updates to the SparkR project. In order to facilitate better collaboration for new features and development we have a new mailing list, issue tracker for SparkR. - The new JIRA is hosted at https://sparkr.atlassian.net/browse/SPARKR/ and we have migrated all existing Github issues to the JIRA. Please submit any bugs / improvements to this JIRA going forward. - There is a new mailing list sparkr-...@googlegroups.com that will be used for design discussions for new features and development related issues. We will still be answering to user issues on Apache Spark mailing lists. Please let me know if have any questions. Thanks Shivaram
Converting a DStream's RDDs to SchemaRDDs
Hi Folks, I’d like to find out tips on how to convert the RDDs inside a Spark Streaming DStream to a set of SchemaRDDs. My DStream contains JSON data pushed over from Kafka, and I’d like to use SparkSQL’s JSON import function (i.e. jsonRDD) to register the JSON dataset as a table, and perform queries on it. Here’s a code snippet of my latest attempt (in Scala): … val sc = new SparkContext(conf) val ssc = new StreamingContext(local, this.getClass.getName, Seconds(1)) ssc.checkpoint(checkpoint) val stream = KafkaUtils.createStream(ssc, localhost:2181, “group, Map(“topic - 10)).map(_._2) val sql = new SQLContext(sc) stream.foreachRDD(rdd = { if (rdd.count 0) { // message received val sqlRDD = sql.jsonRDD(rdd) sqlRDD.printSchema() } else { println(“No message received) } }) … This compiles and runs when I submit it to Spark (local-mode); however, I never seem to be able to successfully see a schema printed on my console, via the “sqlRDD.printSchema()” method when Kafka is streaming my JSON messages to the “topic” topic name. I know my JSON is valid and my Kafka connection works fine, I’ve been able to print the stream messages in their raw format, just not as SchemaRDDs. Any tips? Suggestions? Thanks much, --- Rishi Verma NASA Jet Propulsion Laboratory California Institute of Technology - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark webUI - application details page
Hi All, @Andrew Thanks for the tips. I just built the master branch of Spark last night, but am still having problems viewing history through the standalone UI. I dug into the Spark job events directories as you suggested, and I see at a minimum 'SPARK_VERSION_1.0.0' and 'EVENT_LOG_1'; for applications that call 'sc.stop()' I also see 'APPLICATION_COMPLETE'. The version and application complete files are empty; the event log file contains the information one would need to repopulate the web UI. The follow may be helpful in debugging this: -Each job directory (e.g. '/tmp/spark-events/testhistoryjob-1409246088110') and the files within are owned by the user who ran the job with permissions 770. This prevents the 'spark' user from accessing the contents. -When I make a directory and contents accessible to the spark user, the history server (invoked as 'sbin/start-history-server.sh /tmp/spark-events') is able to display the history, but the standalone web UI still produces the following error: 'No event logs found for application HappyFunTimes in file:///tmp/spark-events/testhistoryjob-1409246088110. Did you specify the correct logging directory?' -Incase it matters, I'm running pyspark. Do you know what may be causing this? When you attempt to reproduce locally, who do you observe owns the files in /tmp/spark-events? best, -Brad On Tue, Aug 26, 2014 at 8:51 AM, SK skrishna...@gmail.com wrote: I have already tried setting the history server and accessing it on master-url:18080 as per the link. But the page does not list any completed applications. As I mentioned in my previous mail, I am running Spark in standalone mode on the cluster (as well as on my local machine). According to the link, it appears that the history server is required only in mesos or yarn mode, not in standalone mode. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p12834.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming checkpoint recovery causes IO re-execution
Hi Yana, The fact is that the DB writing is happening on the node level and not on Spark level. One of the benefits of distributed computing nature of Spark is enabling IO distribution as well. For example, is much faster to have the nodes to write to Cassandra instead of having them all collected at the driver level and sending the writes from there. The problem is that nodes computations which get redone upon recovery. If these lambda functions send events to other systems these events would get resent upon re-computation causing overall system instability. Hope this helps you understand the problematic. tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13043.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SPARK-1297 patch error (spark-1297-v4.txt )
bq. Spark 1.0.2 For the above release, you can download pom.xml attached to the JIRA and place it in examples directory I verified that the build against 0.98.4 worked using this command: mvn -Dhbase.profile=hadoop2 -Phadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests clean package Patch v5 is @ level 0 - you don't need to use -p1 in the patch command. Cheers On Thu, Aug 28, 2014 at 9:50 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I have just tried to apply the patch of SPARK-1297: https://issues.apache.org/jira/browse/SPARK-1297 There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt respectively. When applying the 2nd one, I got Hunk #1 FAILED at 45 Can you please advise how to fix it in order to make the compilation of Spark Project Examples success? (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2) Regards Arthur patch -p1 -i spark-1297-v4.txt patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines). 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej below is the content of examples/pom.xml.rej: +++ examples/pom.xml @@ -45,6 +45,39 @@ /dependency /dependencies /profile +profile + idhbase-hadoop2/id + activation +property + namehbase.profile/name + valuehadoop2/value +/property + /activation + properties +protobuf.version2.5.0/protobuf.version +hbase.version0.98.4-hadoop2/hbase.version + /properties + dependencyManagement +dependencies +/dependencies + /dependencyManagement +/profile +profile + idhbase-hadoop1/id + activation +property + name!hbase.profile/name +/property + /activation + properties +hbase.version0.98.4-hadoop1/hbase.version + /properties + dependencyManagement +dependencies +/dependencies + /dependencyManagement +/profile + /profiles dependencies This caused the related compilation failed: [INFO] Spark Project Examples FAILURE [0.102s]
Re: spark.files.userClassPathFirst=true Not Working Correctly
Sorry for the extremely late reply. It turns out that the same error occurred when running on yarn. However, I recently updated my project to depend on cdh5 and the issue I was having disappeared and I am no longer setting the userClassPathFirst to true. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-files-userClassPathFirst-true-Not-Working-Correctly-tp11917p13045.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Low Level Kafka Consumer for Spark
@bharat- overall, i've noticed a lot of confusion about how Spark Streaming scales - as well as how it handles failover and checkpointing, but we can discuss that separately. there's actually 2 dimensions to scaling here: receiving and processing. *Receiving* receiving can be scaled out by submitting new DStreams/Receivers to the cluster as i've done in the Kinesis example. in fact, i purposely chose to submit multiple receivers in my Kinesis example because i feel it should be the norm and not the exception - particularly for partitioned and checkpoint-capable streaming systems like Kafka and Kinesis. it's the only way to scale. a side note here is that each receiver running in the cluster will immediately replicates to 1 other node for fault-tolerance of that specific receiver. this is where the confusion lies. this 2-node replication is mainly for failover in case the receiver dies while data is in flight. there's still chance for data loss as there's no write ahead log on the hot path, but this is being addressed. this in mentioned in the docs here: https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving *Processing* once data is received, tasks are scheduled across the Spark cluster just like any other non-streaming task where you can specify the number of partitions for reduces, etc. this is the part of scaling that is sometimes overlooked - probably because it works just like regular Spark, but it is worth highlighting. Here's a blurb in the docs: https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-processing the other thing that's confusing with Spark Streaming is that in Scala, you need to explicitly import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions in order to pick up the implicits that allow DStream.reduceByKey and such (versus DStream.transform(rddBatch = rddBatch.reduceByKey()) in other words, DStreams appear to be relatively featureless until you discover this implicit. otherwise, you need to operate on the underlying RDD's explicitly which is not ideal. the Kinesis example referenced earlier in the thread uses the DStream implicits. side note to all of this - i've recently convinced my publisher for my upcoming book, Spark In Action, to let me jump ahead and write the Spark Streaming chapter ahead of other more well-understood libraries. early release is in a month or so. sign up @ http://sparkinaction.com if you wanna get notified. shameless plug that i wouldn't otherwise do, but i really think it will help clear a lot of confusion in this area as i hear these questions asked a lot in my talks and such. and i think a clear, crisp story on scaling and fault-tolerance will help Spark Streaming's adoption. hope that helps! -chris On Wed, Aug 27, 2014 at 6:32 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: I agree. This issue should be fixed in Spark rather rely on replay of Kafka messages. Dib On Aug 28, 2014 6:45 AM, RodrigoB rodrigo.boav...@aspect.com wrote: Dibyendu, Tnks for getting back. I believe you are absolutely right. We were under the assumption that the raw data was being computed again and that's not happening after further tests. This applies to Kafka as well. The issue is of major priority fortunately. Regarding your suggestion, I would maybe prefer to have the problem resolved within Spark's internals since once the data is replicated we should be able to access it once more and not having to pool it back again from Kafka or any other stream that is being affected by this issue. If for example there is a big amount of batches to be recomputed I would rather have them done distributed than overloading the batch interval with huge amount of Kafka messages. I do not have yet enough know how on where is the issue and about the internal Spark code so I can't really how much difficult will be the implementation. tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p12966.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SPARK-1297 patch error (spark-1297-v4.txt )
Hi Ted, I downloaded pom.xml to examples directory. It works, thanks!! Regards Arthur [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.119s] [INFO] Spark Project Core SUCCESS [1:27.100s] [INFO] Spark Project Bagel ... SUCCESS [10.261s] [INFO] Spark Project GraphX .. SUCCESS [31.332s] [INFO] Spark Project ML Library .. SUCCESS [35.226s] [INFO] Spark Project Streaming ... SUCCESS [39.135s] [INFO] Spark Project Tools ... SUCCESS [6.469s] [INFO] Spark Project Catalyst SUCCESS [36.521s] [INFO] Spark Project SQL . SUCCESS [35.488s] [INFO] Spark Project Hive SUCCESS [35.296s] [INFO] Spark Project REPL SUCCESS [18.668s] [INFO] Spark Project YARN Parent POM . SUCCESS [0.583s] [INFO] Spark Project YARN Stable API . SUCCESS [15.989s] [INFO] Spark Project Assembly SUCCESS [11.497s] [INFO] Spark Project External Twitter SUCCESS [8.777s] [INFO] Spark Project External Kafka .. SUCCESS [9.688s] [INFO] Spark Project External Flume .. SUCCESS [10.411s] [INFO] Spark Project External ZeroMQ . SUCCESS [9.511s] [INFO] Spark Project External MQTT ... SUCCESS [8.451s] [INFO] Spark Project Examples SUCCESS [1:40.240s] [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 8:33.350s [INFO] Finished at: Fri Aug 29 01:58:00 HKT 2014 [INFO] Final Memory: 82M/1086M [INFO] On 29 Aug, 2014, at 1:36 am, Ted Yu yuzhih...@gmail.com wrote: bq. Spark 1.0.2 For the above release, you can download pom.xml attached to the JIRA and place it in examples directory I verified that the build against 0.98.4 worked using this command: mvn -Dhbase.profile=hadoop2 -Phadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests clean package Patch v5 is @ level 0 - you don't need to use -p1 in the patch command. Cheers On Thu, Aug 28, 2014 at 9:50 AM, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I have just tried to apply the patch of SPARK-1297: https://issues.apache.org/jira/browse/SPARK-1297 There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt respectively. When applying the 2nd one, I got Hunk #1 FAILED at 45 Can you please advise how to fix it in order to make the compilation of Spark Project Examples success? (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2) Regards Arthur patch -p1 -i spark-1297-v4.txt patching file examples/pom.xml Hunk #1 FAILED at 45. Hunk #2 succeeded at 94 (offset -16 lines). 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej below is the content of examples/pom.xml.rej: +++ examples/pom.xml @@ -45,6 +45,39 @@ /dependency /dependencies /profile +profile + idhbase-hadoop2/id + activation +property + namehbase.profile/name + valuehadoop2/value +/property + /activation + properties +protobuf.version2.5.0/protobuf.version +hbase.version0.98.4-hadoop2/hbase.version + /properties + dependencyManagement +dependencies +/dependencies + /dependencyManagement +/profile +profile + idhbase-hadoop1/id + activation +property + name!hbase.profile/name +/property + /activation + properties +hbase.version0.98.4-hadoop1/hbase.version + /properties + dependencyManagement +dependencies +/dependencies + /dependencyManagement +/profile + /profiles dependencies This caused the related compilation failed: [INFO] Spark Project Examples FAILURE [0.102s]
org.apache.hadoop.io.compress.SnappyCodec not found
Hi, I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and HBase. With default setting in Spark 1.0.2, when trying to load a file I got Class org.apache.hadoop.io.compress.SnappyCodec not found Can you please advise how to enable snappy in Spark? Regards Arthur scala inFILE.first() java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.RDD.take(RDD.scala:983) at org.apache.spark.rdd.RDD.first(RDD.scala:1015) at $iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC.init(console:22) at $iwC.init(console:24) at init(console:26) at .init(console:30) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106) ... 55 more Caused by: java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.SnappyCodec not found. at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135) at
Re: CUDA in spark, especially in MLlib?
Thank you Debasish. I am fine with either Scala or Java. I would like to get a quick evaluation on the performance gain, e.g., ALS on GPU. I would like to try whichever library does the business :) Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From: Debasish Das debasish.da...@gmail.com To: Frank van Lankvelt f.vanlankv...@onehippo.com, Cc: Matei Zaharia matei.zaha...@gmail.com, user user@spark.apache.org, Antonio Jesus Navarro ajnava...@stratio.com, Chen He airb...@gmail.com, Wei Tan/Watson/IBM@IBMUS Date: 08/28/2014 12:20 PM Subject:Re: CUDA in spark, especially in MLlib? Breeze author David also has a github project on cuda binding in scalado you prefer using java or scala ? On Aug 27, 2014 2:05 PM, Frank van Lankvelt f.vanlankv...@onehippo.com wrote: you could try looking at ScalaCL[1], it's targeting OpenCL rather than CUDA, but that might be close enough? cheers, Frank 1. https://github.com/ochafik/ScalaCL On Wed, Aug 27, 2014 at 7:33 PM, Wei Tan w...@us.ibm.com wrote: Thank you all. Actually I was looking at JCUDA. Function wise this may be a perfect solution to offload computation to GPU. Will see how performance it will be, especially with the Java binding. Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J. Watson Research Center http://researcher.ibm.com/person/us-wtan From:Chen He airb...@gmail.com To:Antonio Jesus Navarro ajnava...@stratio.com, Cc:Matei Zaharia matei.zaha...@gmail.com, user@spark.apache.org, Wei Tan/Watson/IBM@IBMUS Date:08/27/2014 11:03 AM Subject:Re: CUDA in spark, especially in MLlib? JCUDA can let you do that in Java http://www.jcuda.org On Wed, Aug 27, 2014 at 1:48 AM, Antonio Jesus Navarro ajnava...@stratio.com wrote: Maybe this would interest you: CPU and GPU-accelerated Machine Learning Library: https://github.com/BIDData/BIDMach 2014-08-27 4:08 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com: You should try to find a Java-based library, then you can call it from Scala. Matei On August 26, 2014 at 6:58:11 PM, Wei Tan (w...@us.ibm.com) wrote: Hi I am trying to find a CUDA library in Scala, to see if some matrix manipulation in MLlib can be sped up. I googled a few but found no active projects on Scala+CUDA. Python is supported by CUDA though. Any suggestion on whether this idea makes any sense? Best regards, Wei -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com
Re: org.apache.hadoop.io.compress.SnappyCodec not found
Hi, my check native result: hadoop checknative 14/08/29 02:54:51 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version 14/08/29 02:54:51 INFO zlib.ZlibFactory: Successfully loaded initialized native-zlib library Native library checking: hadoop: true /mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libhadoop.so zlib: true /lib64/libz.so.1 snappy: true /mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libsnappy.so.1 lz4:true revision:99 bzip2: false Any idea how to enable or disable snappy in Spark? Regards Arthur On 29 Aug, 2014, at 2:39 am, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and HBase. With default setting in Spark 1.0.2, when trying to load a file I got Class org.apache.hadoop.io.compress.SnappyCodec not found Can you please advise how to enable snappy in Spark? Regards Arthur scala inFILE.first() java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.RDD.take(RDD.scala:983) at org.apache.spark.rdd.RDD.first(RDD.scala:1015) at $iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC.init(console:22) at $iwC.init(console:24) at init(console:26) at .init(console:30) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.reflect.InvocationTargetException at
org.apache.spark.examples.xxx
hey guys i still try to get used to compile and run the example code why does the run_example code submit the class with an org.apache.spark.examples in front of the class itself? probably a stupid question but i would be glad some one of you explains by the way.. how was the spark...example...jar file build? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-spark-examples-xxx-tp13052.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Print to spark log
I'm not sure if this is the case, but basic monitoring is described here: https://spark.apache.org/docs/latest/monitoring.html If it comes to something more sophisticated I was for example able to save some messages into local logs and view them in YARN UI via http by editing spark source code (use logInfo() method inside where needed) and build assembly again with maven. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Print-to-spark-log-tp13035p13053.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
What happens if I have a function like a PairFunction but which might return multiple values
In many cases when I work with Map Reduce my mapper or my reducer might take a single value and map it to multiple keys - The reducer might also take a single key and emit multiple values I don't think that functions like flatMap and reduceByKey will work or are there tricks I am not aware of
Re: Spark webUI - application details page
I was able to recently solve this problem for standalone mode. For this mode, I did not use a history server. Instead, I set spark.eventLog.dir (in conf/spark-defaults.conf) to a directory in hdfs (basically this directory should be in a place that is writable by the master and accessible globally to all the nodes). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p13055.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: OutofMemoryError when generating output
Hi, Thanks for the response. I tried to use countByKey. But I am not able to write the output to console or to a file. Neither collect() nor saveAsTextFile() work for the Map object that is generated after countByKey(). valx = sc.textFile(baseFile)).map { line = val fields = line.split(\t) (fields(11), fields(6)) // extract (month, user_id) }.distinct().countByKey() x.saveAsTextFile(...) // does not work. generates an error that saveAstextFile is not defined for Map object Is there a way to convert the Map object to an object that I can output to console and to a file? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OutofMemoryError-when-generating-output-tp12847p13056.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Print to spark log
thanks for the reply. I was looking for something for the case when it's running outside of the spark framework. if I declare a sparkcontext or and rdd that could print some messages in the log? The problem I have that if I print something from the scala object that runs the spark app, it does not print to the console for some reason. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Print-to-spark-log-tp13035p13057.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming: DStream - zipWithIndex
Yes, that is an option. I started with a function of batch time, and index to generate id as long. This may be faster than generating UUID, with added benefit of sorting based on time. - Original Message - From: Tathagata Das tathagata.das1...@gmail.com To: Soumitra Kumar kumar.soumi...@gmail.com Cc: Xiangrui Meng men...@gmail.com, user@spark.apache.org Sent: Thursday, August 28, 2014 2:19:38 AM Subject: Re: Spark Streaming: DStream - zipWithIndex If just want arbitrary unique id attached to each record in a dstream (no ordering etc), then why not create generate and attach an UUID to each record? On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: I see a issue here. If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG. I wish there was DStream mapPartitionsWithIndex. On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote: You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the RDDs would contain more than 1 billion records. Then you can use rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid) Just a hack .. On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote: No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark Streaming: DStream - zipWithIndex
But then if you want to generate ids that are unique across ALL the records that you are going to see in a stream (which can be potentially infinite), then you definitely need a number space larger than long :) TD On Thu, Aug 28, 2014 at 12:48 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Yes, that is an option. I started with a function of batch time, and index to generate id as long. This may be faster than generating UUID, with added benefit of sorting based on time. - Original Message - From: Tathagata Das tathagata.das1...@gmail.com To: Soumitra Kumar kumar.soumi...@gmail.com Cc: Xiangrui Meng men...@gmail.com, user@spark.apache.org Sent: Thursday, August 28, 2014 2:19:38 AM Subject: Re: Spark Streaming: DStream - zipWithIndex If just want arbitrary unique id attached to each record in a dstream (no ordering etc), then why not create generate and attach an UUID to each record? On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: I see a issue here. If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG. I wish there was DStream mapPartitionsWithIndex. On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote: You can use RDD id as the seed, which is unique in the same spark context. Suppose none of the RDDs would contain more than 1 billion records. Then you can use rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid) Just a hack .. On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: So, I guess zipWithUniqueId will be similar. Is there a way to get unique index? On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com wrote: No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: Hello, If I do: DStream transform { rdd.zipWithIndex.map { Is the index guaranteed to be unique across all RDDs here? } } Thanks, -Soumitra.
Q on downloading spark for standalone cluster
Hello there, I've a basic question on the downloadthat which option I need to downloadfor standalone cluster. I've a private cluster of three machineson Centos. When I click on download it shows me following: Download Spark The latest release is Spark 1.0.2, released August 5, 2014 (release notes) http://spark.apache.org/releases/spark-release-1-0-2.html (git tag) https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f Pre-built packages: * For Hadoop 1 (HDP1, CDH3): find an Apache mirror http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-hadoop1.tgz or direct file download http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop1.tgz * For CDH4: find an Apache mirror http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-cdh4.tgz or direct file download http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-cdh4.tgz * For Hadoop 2 (HDP2, CDH5): find an Apache mirror http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz or direct file download http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz Pre-built packages, third-party (NOTE: may include non ASF-compatible licenses): * For MapRv3: direct file download (external) http://package.mapr.com/tools/apache-spark/1.0.2/spark-1.0.2-bin-mapr3.tgz * For MapRv4: direct file download (external) http://package.mapr.com/tools/apache-spark/1.0.2/spark-1.0.2-bin-mapr4.tgz From the above it looks like that I've to donwload Hadoop or CDH4 first in order to use Spark ? I've a standalone cluster and my data size is also like hundreds of Gig or close to Terabyte. I don't get it that which one I need to download from the above list. Could some one assist me that which one I need to download for standalone cluster and for big data foot print ? or Hadoop is needed or mandatory for using Spark? that's not the understanding I've. My understanding is that you can use spark with Hadoop if you like from yarn2 but you could use spark standalone also without hadoop. Please assist. I'm confused ! -Sanjeev - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Converting a DStream's RDDs to SchemaRDDs
Try using local[n] with n 1, instead of local. Since receivers take up 1 slot, and local is basically 1 slot, there is no slot left to process the data. That's why nothing gets printed. TD On Thu, Aug 28, 2014 at 10:28 AM, Verma, Rishi (398J) rishi.ve...@jpl.nasa.gov wrote: Hi Folks, I’d like to find out tips on how to convert the RDDs inside a Spark Streaming DStream to a set of SchemaRDDs. My DStream contains JSON data pushed over from Kafka, and I’d like to use SparkSQL’s JSON import function (i.e. jsonRDD) to register the JSON dataset as a table, and perform queries on it. Here’s a code snippet of my latest attempt (in Scala): … val sc = new SparkContext(conf) val ssc = new StreamingContext(local, this.getClass.getName, Seconds(1)) ssc.checkpoint(checkpoint) val stream = KafkaUtils.createStream(ssc, localhost:2181, “group, Map(“topic - 10)).map(_._2) val sql = new SQLContext(sc) stream.foreachRDD(rdd = { if (rdd.count 0) { // message received val sqlRDD = sql.jsonRDD(rdd) sqlRDD.printSchema() } else { println(“No message received) } }) … This compiles and runs when I submit it to Spark (local-mode); however, I never seem to be able to successfully see a schema printed on my console, via the “sqlRDD.printSchema()” method when Kafka is streaming my JSON messages to the “topic” topic name. I know my JSON is valid and my Kafka connection works fine, I’ve been able to print the stream messages in their raw format, just not as SchemaRDDs. Any tips? Suggestions? Thanks much, --- Rishi Verma NASA Jet Propulsion Laboratory California Institute of Technology - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: repartitioning an RDD yielding imbalance
On Thu, Aug 28, 2014 at 7:00 AM, Rok Roskar rokros...@gmail.com wrote: I've got an RDD where each element is a long string (a whole document). I'm using pyspark so some of the handy partition-handling functions aren't available, and I count the number of elements in each partition with: def count_partitions(id, iterator): c = sum(1 for _ in iterator) yield (id,c) rdd.mapPartitionsWithSplit(count_partitions).collectAsMap() This returns the following: {0: 866, 1: 1158, 2: 828, 3: 876} But if I do: rdd.repartition(8).mapPartitionsWithSplit(count_partitions).collectAsMap() I get {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 3594, 7: 134} Why this strange redistribution of elements? I'm obviously misunderstanding how spark does the partitioning -- is it a problem with having a list of strings as an RDD? This imbalance was introduce by BatchedDeserializer. By default, Python elements in RDD are serialized by pickle in batch (1024 elements in one batch), so in the view of Scala, it only see one or two element of Array[Byte] in the RDD, then imbalance happened. To fix this, you could change the default batchSize to 10 (or less) or reserialize your RDD as in unbatched mode, for example: sc = SparkContext(batchSize=10) rdd = sc.textFile().repartition(8) OR rdd._reserialize(PickleSerializer()).repartition(8) PS: _reserialize() is not an public API, so it may be changed in the future. Davies Help vey much appreciated! Thanks, Rok - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Q on downloading spark for standalone cluster
If you aren't using Hadoop, I don't think it matters which you download. I'd probably just grab the Hadoop 2 package. Out of curiosity, what are you using as your data store? I get the impression most Spark users are using HDFS or something built on top. On Thu, Aug 28, 2014 at 4:07 PM, Sanjeev Sagar sanjeev.sa...@mypointscorp.com wrote: Hello there, I've a basic question on the downloadthat which option I need to downloadfor standalone cluster. I've a private cluster of three machineson Centos. When I click on download it shows me following: Download Spark The latest release is Spark 1.0.2, released August 5, 2014 (release notes) http://spark.apache.org/releases/spark-release-1-0-2.html (git tag) https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h= 8fb6f00e195fb258f3f70f04756e07c259a2351f Pre-built packages: * For Hadoop 1 (HDP1, CDH3): find an Apache mirror http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/ spark-1.0.2-bin-hadoop1.tgz or direct file download http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop1.tgz * For CDH4: find an Apache mirror http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/ spark-1.0.2-bin-cdh4.tgz or direct file download http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-cdh4.tgz * For Hadoop 2 (HDP2, CDH5): find an Apache mirror http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/ spark-1.0.2-bin-hadoop2.tgz or direct file download http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz Pre-built packages, third-party (NOTE: may include non ASF-compatible licenses): * For MapRv3: direct file download (external) http://package.mapr.com/tools/apache-spark/1.0.2/ spark-1.0.2-bin-mapr3.tgz * For MapRv4: direct file download (external) http://package.mapr.com/tools/apache-spark/1.0.2/ spark-1.0.2-bin-mapr4.tgz From the above it looks like that I've to donwload Hadoop or CDH4 first in order to use Spark ? I've a standalone cluster and my data size is also like hundreds of Gig or close to Terabyte. I don't get it that which one I need to download from the above list. Could some one assist me that which one I need to download for standalone cluster and for big data foot print ? or Hadoop is needed or mandatory for using Spark? that's not the understanding I've. My understanding is that you can use spark with Hadoop if you like from yarn2 but you could use spark standalone also without hadoop. Please assist. I'm confused ! -Sanjeev - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Re: Where to save intermediate results?
I assume your on-demand calculations are a streaming flow? If your data aggregated from batch isn't too large, maybe you should just save it to disk; when your streaming flow starts you can read the aggregations back from disk and perhaps just broadcast them. Though I guess you'd have to restart your streaming flow when these aggregations are updated. For something more sophisticated, maybe look at Redis http://redis.io/ or some distributed database? Your ETL can update that store, and your on-demand job can query it. On Thu, Aug 28, 2014 at 4:30 PM, huylv huy.le...@insight-centre.org wrote: Hi, I'm building a system for near real-time data analytics. My plan is to have an ETL batch job which calculates aggregations running periodically. User queries are then parsed for on-demand calculations, also in Spark. Where are the pre-calculated results supposed to be saved? I mean, after finishing aggregations, the ETL job will terminate, so caches are wiped out of memory. How can I use these results to calculate on-demand queries? Or more generally, could you please give me a good way to organize the data flow and jobs in order to achieve this? I'm new to Spark so sorry if this might sound like a dumb question. Thank you. Huy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Where-to-save-intermediate-results-tp13062.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Re: What happens if I have a function like a PairFunction but which might return multiple values
To emulate a Mapper, flatMap() is exactly what you want. Since it flattens, it means you return an Iterable of values instead of 1 value. That can be a Collection containing many values, or 1, or 0. For a reducer, to really reproduce what a Reducer does in Java, I think you will need groupByKey() followed by flatMapValues(). In many cases when I work with Map Reduce my mapper or my reducer might take a single value and map it to multiple keys - The reducer might also take a single key and emit multiple values I don't think that functions like flatMap and reduceByKey will work or are there tricks I am not aware of - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Q on downloading spark for standalone cluster
Hello Daniel, If you’re not using Hadoop then why you want to grab the Hadoop package? CDH5 will download all the Hadoop packages and cloudera manager too. Just curious what happen if you start spark on EC2 cluster, what it choose for the data store as default? -Sanjeev From: Daniel Siegmann [mailto:daniel.siegm...@velos.io] Sent: Thursday, August 28, 2014 2:04 PM To: Sagar, Sanjeev Cc: user@spark.apache.org Subject: Re: Q on downloading spark for standalone cluster If you aren't using Hadoop, I don't think it matters which you download. I'd probably just grab the Hadoop 2 package. Out of curiosity, what are you using as your data store? I get the impression most Spark users are using HDFS or something built on top. On Thu, Aug 28, 2014 at 4:07 PM, Sanjeev Sagar sanjeev.sa...@mypointscorp.commailto:sanjeev.sa...@mypointscorp.com wrote: Hello there, I've a basic question on the downloadthat which option I need to downloadfor standalone cluster. I've a private cluster of three machineson Centos. When I click on download it shows me following: Download Spark The latest release is Spark 1.0.2, released August 5, 2014 (release notes) http://spark.apache.org/releases/spark-release-1-0-2.html (git tag) https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f Pre-built packages: * For Hadoop 1 (HDP1, CDH3): find an Apache mirror http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-hadoop1.tgz or direct file download http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop1.tgz * For CDH4: find an Apache mirror http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-cdh4.tgz or direct file download http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-cdh4.tgz * For Hadoop 2 (HDP2, CDH5): find an Apache mirror http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz or direct file download http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz Pre-built packages, third-party (NOTE: may include non ASF-compatible licenses): * For MapRv3: direct file download (external) http://package.mapr.com/tools/apache-spark/1.0.2/spark-1.0.2-bin-mapr3.tgz * For MapRv4: direct file download (external) http://package.mapr.com/tools/apache-spark/1.0.2/spark-1.0.2-bin-mapr4.tgz From the above it looks like that I've to donwload Hadoop or CDH4 first in order to use Spark ? I've a standalone cluster and my data size is also like hundreds of Gig or close to Terabyte. I don't get it that which one I need to download from the above list. Could some one assist me that which one I need to download for standalone cluster and for big data foot print ? or Hadoop is needed or mandatory for using Spark? that's not the understanding I've. My understanding is that you can use spark with Hadoop if you like from yarn2 but you could use spark standalone also without hadoop. Please assist. I'm confused ! -Sanjeev - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.iomailto:daniel.siegm...@velos.io W: www.velos.iohttp://www.velos.io
Re: Failed to run runJob at ReceiverTracker.scala
Do you see this error right in the beginning or after running for sometime? The root cause seems to be that somehow your Spark executors got killed, which killed receivers and caused further errors. Please try to take a look at the executor logs of the lost executor to find what is the root cause that caused the executor to fail. TD On Thu, Aug 28, 2014 at 3:54 PM, Tim Smith secs...@gmail.com wrote: Hi, Have a Spark-1.0.0 (CDH5) streaming job reading from kafka that died with: 14/08/28 22:28:15 INFO DAGScheduler: Failed to run runJob at ReceiverTracker.scala:275 Exception in thread Thread-59 14/08/28 22:28:15 INFO YarnClientClusterScheduler: Cancelling stage 2 14/08/28 22:28:15 INFO DAGScheduler: Executor lost: 5 (epoch 4) 14/08/28 22:28:15 INFO BlockManagerMasterActor: Trying to remove executor 5 from BlockManagerMaster. 14/08/28 22:28:15 INFO BlockManagerMaster: Removed 5 successfully in removeExecutor org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0:0 failed 4 times, most recent failure: TID 6481 on host node-dn1-1.ops.sfdc.net failed for unknown reason Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Any insights into this error? Thanks, Tim
Re: DStream repartitioning, performance tuning processing
If you are repartitioning to 8 partitions, and your node happen to have at least 4 cores each, its possible that all 8 partitions are assigned to only 2 nodes. Try increasing the number of partitions. Also make sure you have executors (allocated by YARN) running on more than two nodes if you want to use all 11 nodes in your yarn cluster. If you are using Spark 1.x, then you dont need to set the ttl for running Spark Streaming. In case you are using older version, why do you want to reduce it? You could reduce it, but it does increase the risk of the premature cleaning, if once in a while things get delayed by 20 seconds. I dont see much harm in keeping the ttl at 60 seconds (a bit of extra garbage shouldnt hurt performance). TD On Thu, Aug 28, 2014 at 3:16 PM, Tim Smith secs...@gmail.com wrote: Hi, In my streaming app, I receive from kafka where I have tried setting the partitions when calling createStream or later, by calling repartition - in both cases, the number of nodes running the tasks seems to be stubbornly stuck at 2. Since I have 11 nodes in my cluster, I was hoping to use more nodes. I am starting the job as: nohup spark-submit --class logStreamNormalizer --master yarn log-stream-normalizer_2.10-1.0.jar --jars spark-streaming-kafka_2.10-1.0.0.jar,kafka_2.10-0.8.1.1.jar,zkclient-0.3.jar,metrics-core-2.2.0.jar,json4s-jackson_2.10-3.2.10.jar --executor-memory 30G --spark.cleaner.ttl 60 --executor-cores 8 --num-executors 8 normRunLog-6.log 2normRunLogError-6.log echo $! run-6.pid My main code is: val sparkConf = new SparkConf().setAppName(SparkKafkaTest) val ssc = new StreamingContext(sparkConf,Seconds(5)) val kInMsg = KafkaUtils.createStream(ssc,node-nn1-1:2181/zk_kafka,normApp,Map(rawunstruct - 16)) val propsMap = Map(metadata.broker.list - node-dn1-6:9092,node-dn1-7:9092,node-dn1-8:9092, serializer.class - kafka.serializer.StringEncoder, producer.type - async, request.required.acks - 1) val to_topic = normStruct val writer = new KafkaOutputService(to_topic, propsMap) if (!configMap.keySet.isEmpty) { //kInMsg.repartition(8) val outdata = kInMsg.map(x=normalizeLog(x._2,configMap)) outdata.foreachRDD((rdd,time) = { rdd.foreach(rec = { writer.output(rec) }) } ) } ssc.start() ssc.awaitTermination() In terms of total delay, with a 5 second batch, the delays usually stay under 5 seconds, but sometimes jump to ~10 seconds. As a performance tuning question, does this mean, I can reduce my cleaner ttl from 60 to say 25 (still more than double of the peak delay)? Thanks Tim
Re: Kinesis receiver spark streaming partition
great question, wei. this is very important to understand from a performance perspective. and this extends is beyond kinesis - it's for any streaming source that supports shards/partitions. i need to do a little research into the internals to confirm my theory. lemme get back to you! -chris On Tue, Aug 26, 2014 at 11:37 AM, Wei Liu wei@stellarloyalty.com wrote: We are exploring using Kinesis and spark streaming together. I took at a look at the kinesis receiver code in 1.1.0. I have a question regarding kinesis partition spark streaming partition. It seems to be pretty difficult to align these partitions. Kinesis partitions a stream of data into shards, if we follow the example, we will have multiple kinesis receivers reading from the same stream in spark streaming. It seems like kinesis workers will coordinate among themselves and assign shards to themselves dynamically. For a particular shard, it can be consumed by different kinesis workers (thus different spark workers) dynamically (not at the same time). Blocks are generated based on time intervals, RDD are created based on blocks. RDDs are partitioned based on blocks. At the end, the data for a given shard will be spread into multiple blocks (possible located on different spark worker nodes). We will probably need to group these data again for a given shard and shuffle data around to achieve the same partition we had in Kinesis. Is there a better way to achieve this to avoid data reshuffling? Thanks, Wei
Re: org.apache.hadoop.io.compress.SnappyCodec not found
Hi, If change my etc/hadoop/core-site.xml from property nameio.compression.codecs/name value org.apache.hadoop.io.compress.SnappyCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec /value /property to property nameio.compression.codecs/name value org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.SnappyCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec /value /property and run the test again, I found this time it cannot find org.apache.hadoop.io.compress.GzipCodec scala inFILE.first() java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.RDD.take(RDD.scala:983) at org.apache.spark.rdd.RDD.first(RDD.scala:1015) at $iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC.init(console:22) at $iwC.init(console:24) at init(console:26) at .init(console:30) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at
Re: Failed to run runJob at ReceiverTracker.scala
Appeared after running for a while. I re-ran the job and this time, it crashed with: 14/08/29 00:18:50 WARN ReceiverTracker: Error reported by receiver for stream 0: Error in block pushing thread - java.net.SocketException: Too many open files Shouldn't the failed receiver get re-spawned on a different worker? On Thu, Aug 28, 2014 at 4:12 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Do you see this error right in the beginning or after running for sometime? The root cause seems to be that somehow your Spark executors got killed, which killed receivers and caused further errors. Please try to take a look at the executor logs of the lost executor to find what is the root cause that caused the executor to fail. TD On Thu, Aug 28, 2014 at 3:54 PM, Tim Smith secs...@gmail.com wrote: Hi, Have a Spark-1.0.0 (CDH5) streaming job reading from kafka that died with: 14/08/28 22:28:15 INFO DAGScheduler: Failed to run runJob at ReceiverTracker.scala:275 Exception in thread Thread-59 14/08/28 22:28:15 INFO YarnClientClusterScheduler: Cancelling stage 2 14/08/28 22:28:15 INFO DAGScheduler: Executor lost: 5 (epoch 4) 14/08/28 22:28:15 INFO BlockManagerMasterActor: Trying to remove executor 5 from BlockManagerMaster. 14/08/28 22:28:15 INFO BlockManagerMaster: Removed 5 successfully in removeExecutor org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0:0 failed 4 times, most recent failure: TID 6481 on host node-dn1-1.ops.sfdc.net failed for unknown reason Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Any insights into this error? Thanks, Tim
Re: DStream repartitioning, performance tuning processing
TD - Apologies, didn't realize I was replying to you instead of the list. What does numPartitions refer to when calling createStream? I read an earlier thread that seemed to suggest that numPartitions translates to partitions created on the Spark side? http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201407.mbox/%3ccaph-c_o04j3njqjhng5ho281mqifnf3k_r6coqxpqh5bh6a...@mail.gmail.com%3E Actually, I re-tried with 64 numPartitions in createStream and that didn't work. I will manually set repartition to 64/128 and see how that goes. Thanks. On Thu, Aug 28, 2014 at 5:42 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Having 16 partitions in KafkaUtils.createStream does not translate to the RDDs in Spark / Spark Streaming having 16 partitions. Repartition is the best way to distribute the received data between all the nodes, as long as there are sufficient number of partitions (try setting it to 2x the number cores given to the application). Yeah, in 1.0.0, ttl should be unnecessary. On Thu, Aug 28, 2014 at 5:17 PM, Tim Smith secs...@gmail.com wrote: On Thu, Aug 28, 2014 at 4:19 PM, Tathagata Das tathagata.das1...@gmail.com wrote: If you are repartitioning to 8 partitions, and your node happen to have at least 4 cores each, its possible that all 8 partitions are assigned to only 2 nodes. Try increasing the number of partitions. Also make sure you have executors (allocated by YARN) running on more than two nodes if you want to use all 11 nodes in your yarn cluster. If you look at the code, I commented out the manual re-partitioning to 8. Instead, I am created 16 partitions when I call createStream. But I will increase the partitions to, say, 64 and see if I get better parallelism. If you are using Spark 1.x, then you dont need to set the ttl for running Spark Streaming. In case you are using older version, why do you want to reduce it? You could reduce it, but it does increase the risk of the premature cleaning, if once in a while things get delayed by 20 seconds. I dont see much harm in keeping the ttl at 60 seconds (a bit of extra garbage shouldnt hurt performance). I am running 1.0.0 (CDH5) so ttl setting is redundant? But you are right, unless I have memory issues, more aggressive pruning won't help. Thanks, Tim TD On Thu, Aug 28, 2014 at 3:16 PM, Tim Smith secs...@gmail.com wrote: Hi, In my streaming app, I receive from kafka where I have tried setting the partitions when calling createStream or later, by calling repartition - in both cases, the number of nodes running the tasks seems to be stubbornly stuck at 2. Since I have 11 nodes in my cluster, I was hoping to use more nodes. I am starting the job as: nohup spark-submit --class logStreamNormalizer --master yarn log-stream-normalizer_2.10-1.0.jar --jars spark-streaming-kafka_2.10-1.0.0.jar,kafka_2.10-0.8.1.1.jar,zkclient-0.3.jar,metrics-core-2.2.0.jar,json4s-jackson_2.10-3.2.10.jar --executor-memory 30G --spark.cleaner.ttl 60 --executor-cores 8 --num-executors 8 normRunLog-6.log 2normRunLogError-6.log echo $! run-6.pid My main code is: val sparkConf = new SparkConf().setAppName(SparkKafkaTest) val ssc = new StreamingContext(sparkConf,Seconds(5)) val kInMsg = KafkaUtils.createStream(ssc,node-nn1-1:2181/zk_kafka,normApp,Map(rawunstruct - 16)) val propsMap = Map(metadata.broker.list - node-dn1-6:9092,node-dn1-7:9092,node-dn1-8:9092, serializer.class - kafka.serializer.StringEncoder, producer.type - async, request.required.acks - 1) val to_topic = normStruct val writer = new KafkaOutputService(to_topic, propsMap) if (!configMap.keySet.isEmpty) { //kInMsg.repartition(8) val outdata = kInMsg.map(x=normalizeLog(x._2,configMap)) outdata.foreachRDD((rdd,time) = { rdd.foreach(rec = { writer.output(rec) }) } ) } ssc.start() ssc.awaitTermination() In terms of total delay, with a 5 second batch, the delays usually stay under 5 seconds, but sometimes jump to ~10 seconds. As a performance tuning question, does this mean, I can reduce my cleaner ttl from 60 to say 25 (still more than double of the peak delay)? Thanks Tim
Re: Change delimiter when collecting SchemaRDD
The comma is just the way the default toString works for Row objects. Since SchemaRDDs are also RDDs, you can do arbitrary transformations on the Row objects that are returned. For example, if you'd rather the delimiter was '|': sql(SELECT * FROM src).map(_.mkString(|)).collect() On Thu, Aug 28, 2014 at 7:58 AM, yadid ayzenberg ya...@media.mit.edu wrote: Hi All, Is there any way to change the delimiter from being a comma ? Some of the strings in my data contain commas as well, making it very difficult to parse the results. Yadid
Memory statistics in the Application detail UI
Hi, I am using a cluster where each node has 16GB (this is the executor memory). After I complete an MLlib job, the executor tab shows the following: Memory: 142.6 KB Used (95.5 GB Total) and individual worker nodes have the Memory Used values as 17.3 KB / 8.6 GB (this is different for different nodes). What does the second number signify (i.e. 8.6 GB and 95.5 GB)? If 17.3 KB was used out of the total memory of the node, should it not be 17.3 KB/16 GB? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-statistics-in-the-Application-detail-UI-tp13082.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: org.apache.hadoop.io.compress.SnappyCodec not found
Hi, I fixed the issue by copying libsnappy.so to Java ire. Regards Arthur On 29 Aug, 2014, at 8:12 am, arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com wrote: Hi, If change my etc/hadoop/core-site.xml from property nameio.compression.codecs/name value org.apache.hadoop.io.compress.SnappyCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec /value /property to property nameio.compression.codecs/name value org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.SnappyCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec /value /property and run the test again, I found this time it cannot find org.apache.hadoop.io.compress.GzipCodec scala inFILE.first() java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.RDD.take(RDD.scala:983) at org.apache.spark.rdd.RDD.first(RDD.scala:1015) at $iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC.init(console:22) at $iwC.init(console:24) at init(console:26) at .init(console:30) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
The concurrent model of spark job/stage/task
hi, guys I am trying to understand how spark work on the concurrent model. I read below from https://spark.apache.org/docs/1.0.2/job-scheduling.html quote Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users). I searched everywhere but not get: 1. how to start 2 or more jobs in one spark driver, in java code.. I wrote 2 actions in the code, but the job still staged in index 0, 1, 2, 3... looks they run secquencly. 2. are the stages run currently? because they always number in order 0, 1. 2. 3.. I obverserved on the spark stage UI. 3. Can I retrieve the data out of RDD? like populate a pojo myself and compute on it. Thanks in advance, guys. 35597...@qq.com
Problem using accessing HiveContext
Hi, While using HiveContext. If hive table created as test_datatypes(testbigint bigint, ss bigint ) select below works fine. For create table test_datatypes(testbigint bigint, testdec decimal(5,2) ) scala val dataTypes=hiveContext.hql(select * from test_datatypes) 14/08/28 21:18:44 INFO parse.ParseDriver: Parsing command: select * from test_datatypes 14/08/28 21:18:44 INFO parse.ParseDriver: Parse Completed 14/08/28 21:18:44 INFO analysis.Analyzer: Max iterations (2) reached for batch MultiInstanceRelations 14/08/28 21:18:44 INFO analysis.Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences java.lang.IllegalArgumentException: Error: ',', ':', or ';' expected at position 14 from 'bigint:decimal(5,2)' [0:bigint, 6::, 7:decimal, 14:(, 15:5, 16:,, 17:2, 18:)] at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:312) at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:716) at org.apache.hadoop.hive.serde2.lazy.LazyUtils.extractColumnInfo(LazyUtils.java:364) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initSerdeParams(LazySimpleSerDe.java:288) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:187) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:218) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:272) at org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:175) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:991) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:924) at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:58) at org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:143) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:122) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:122) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:122) at org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:149) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$2.applyOrElse(Analyzer.scala:83) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$2.applyOrElse(Analyzer.scala:81) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156) Same exception happens with create table test_datatypes(testbigint bigint, testdate date ) ... Thanks, Igor.
FW: Reference Accounts Large Node Deployments
Anyone? No customers using streaming at scale? From: Steve Nunez snu...@hortonworks.com Date: Wednesday, August 27, 2014 at 9:08 To: user@spark.apache.org user@spark.apache.org Subject: Reference Accounts Large Node Deployments All, Does anyone have specific references to customers, use cases and large-scale deployments of Spark Streaming? By OElarge scale¹ I mean both through-put and number of nodes. I¹m attempting an objective comparison of Streaming and Storm and while this data is known for Storm, there appears to be little for Spark Streaming. If you know of any such deployments, please post them here because I am sure I¹m not the only one wondering about this. If customer confidentially prevents mentioning them by name, consider identifying them by industry, e.g. OEtelco doing X with streaming using Y nodes¹. Any information at all will be welcome. I¹ll feed back a summary and/or update a wiki page once I collate the information. Cheers, - Steve -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Odd saveAsSequenceFile bug
Hey Sparkies... I have an odd bug. I am running Spark 0.9.2 on Amazon EC2 machines as a job (i.e. not in REPL) After a bunch of processing, I tell spark to save my rdd to S3 using: rdd.saveAsSequenceFile(uri,codec) That line of code hangs. By hang I mean (a) Spark stages UI shows no update on that task succeeded (b) Pushing into that stage shows No task have reported metrics yet (c) Ganglia shows the cpu, network access etc at nil (d) No error logs on master or slave. The system just does nothing. What makes it weirder is if I modify the code to either.. (1) rdd.coalesce(41, true) // i.e. increase the num partitions by 1 OR (2) rdd.coalesce(39, false) // decrease by 1 with no shuffle It runs through like a charm... ?? Any ideas for debugging? shay
RE: Sorting Reduced/Groupd Values without Explicit Sorting
Hi list, Any change on this one? I think I have seen a lot of work being done on this lately but I am unable to forge a working solution from jira tickets. Any example would be highly appreciated. Tomas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sorting-Reduced-Groupd-Values-without-Explicit-Sorting-tp8508p13088.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: OutofMemoryError when generating output
Yeah, saveAsTextFile is an RDD specific method. If you really want to use that method, just turn the map into an RDD: `sc.parallelize(x.toSeq).saveAsTextFile(...)` Reading through the api-docs will present you many more alternate solutions! Best, Burak - Original Message - From: SK skrishna...@gmail.com To: u...@spark.incubator.apache.org Sent: Thursday, August 28, 2014 12:45:22 PM Subject: Re: OutofMemoryError when generating output Hi, Thanks for the response. I tried to use countByKey. But I am not able to write the output to console or to a file. Neither collect() nor saveAsTextFile() work for the Map object that is generated after countByKey(). valx = sc.textFile(baseFile)).map { line = val fields = line.split(\t) (fields(11), fields(6)) // extract (month, user_id) }.distinct().countByKey() x.saveAsTextFile(...) // does not work. generates an error that saveAstextFile is not defined for Map object Is there a way to convert the Map object to an object that I can output to console and to a file? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OutofMemoryError-when-generating-output-tp12847p13056.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Memory statistics in the Application detail UI
Each executor reserves some memory for storing RDDs in memory, and some for executor operations like shuffling. The number you see is memory reserved for storing RDDs, and defaults to about 0.6 of the total (spark.storage.memoryFraction). On Fri, Aug 29, 2014 at 2:32 AM, SK skrishna...@gmail.com wrote: Hi, I am using a cluster where each node has 16GB (this is the executor memory). After I complete an MLlib job, the executor tab shows the following: Memory: 142.6 KB Used (95.5 GB Total) and individual worker nodes have the Memory Used values as 17.3 KB / 8.6 GB (this is different for different nodes). What does the second number signify (i.e. 8.6 GB and 95.5 GB)? If 17.3 KB was used out of the total memory of the node, should it not be 17.3 KB/16 GB? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-statistics-in-the-Application-detail-UI-tp13082.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Memory statistics in the Application detail UI
Hi, Spark uses by default approximately 60% of the executor heap memory to store RDDs. That's why you have 8.6GB instead of 16GB. 95.5 is therefore the sum of all the 8.6 GB of executor memory + the driver memory. Best, Burak - Original Message - From: SK skrishna...@gmail.com To: u...@spark.incubator.apache.org Sent: Thursday, August 28, 2014 6:32:32 PM Subject: Memory statistics in the Application detail UI Hi, I am using a cluster where each node has 16GB (this is the executor memory). After I complete an MLlib job, the executor tab shows the following: Memory: 142.6 KB Used (95.5 GB Total) and individual worker nodes have the Memory Used values as 17.3 KB / 8.6 GB (this is different for different nodes). What does the second number signify (i.e. 8.6 GB and 95.5 GB)? If 17.3 KB was used out of the total memory of the node, should it not be 17.3 KB/16 GB? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-statistics-in-the-Application-detail-UI-tp13082.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark / Thrift / ODBC connectivity
I’m currently using the Spark 1.1 branch and have been able to get the Thrift service up and running. The quick questions were whether I should able to use the Thrift service to connect to SparkSQL generated tables and/or Hive tables? As well, by any chance do we have any documents that point to how we can connect something like Tableau to Spark SQL Thrift - similar to the SAP ODBC connectivity http://www.saphana.com/docs/DOC-472? Thanks! Denny
How to debug this error?
Hello I'm new to Spark and playing around, but saw the following error. Could anyone to help on it? Thanks Gary scala c res15: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[7] at flatMap at console:23 scala group res16: org.apache.spark.rdd.RDD[(String, Iterable[String])] = MappedValuesRDD[5] at groupByKey at console:19 val d = c.map(i=group.filter(_._1 ==i )) d.first 14/08/29 04:39:33 INFO TaskSchedulerImpl: Cancelling stage 28 14/08/29 04:39:33 INFO DAGScheduler: Failed to run first at console:28 org.apache.spark.SparkException: Job aborted due to stage failure: Task 28.0:180 failed 4 times, most recent failure: Exception failure in TID 3605 on host mcs-spark-slave1-staging.snc1: java.lang.NullPointerException org.apache.spark.rdd.RDD.filter(RDD.scala:282) $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:25) $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:25) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) scala.collection.Iterator$$anon$10.next(Iterator.scala:312) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) scala.collection.AbstractIterator.to(Iterator.scala:1157) scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) scala.collection.AbstractIterator.toArray(Iterator.scala:1157) org.apache.spark.rdd.RDD$$anonfun$28.apply(RDD.scala:1003) org.apache.spark.rdd.RDD$$anonfun$28.apply(RDD.scala:1003) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) org.apache.spark.scheduler.Task.run(Task.scala:51) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Spark Hive max key length is 767 bytes
(Please ignore if duplicated) Hi, I use Spark 1.0.2 with Hive 0.13.1 I have already set the hive mysql database to latine1; mysql: alter database hive character set latin1; Spark: scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) scala hiveContext.hql(create table test_datatype1 (testbigint bigint )) scala hiveContext.hql(drop table test_datatype1) 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as embedded-only so does not have its own datastore table. 14/08/29 12:31:55 INFO DataNucleus.Datastore: The class org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so does not have its own datastore table. 14/08/29 12:31:59 ERROR DataNucleus.Datastore: An exception was thrown while adding/validating class(es) : Specified key was too long; max key length is 767 bytes com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at com.mysql.jdbc.Util.handleNewInstance(Util.java:408) at com.mysql.jdbc.Util.getInstance(Util.java:383) at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1062) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4226) at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4158) Can you please advise what would be wrong? Regards Arthur
Re: Memory statistics in the Application detail UI
Hi, Thanks for the responses. I understand that the second values in the Memory Used column for the executors add up to 95.5 GB and the first values add up to 17.3 KB. If 95.5 GB is the memory used to store the RDDs, then what is 17.3 KB ? is that the memory used for shuffling operations? For non MLlib applications I get 0.0 for the first number - i.e memory used is 0.0 (95.5 GB Total). Is the total memory used the sum of the two numbers or is the first number included in the second number (i.e is 17.3 KB included in the 95.5 GB)? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-statistics-in-the-Application-detail-UI-tp13082p13095.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Failed to run runJob at ReceiverTracker.scala
It did. It got failed and respawned 4 times. In this case, the too many open files is a sign that you need increase the system-wide limit of open files. Try adding ulimit -n 16000 to your conf/spark-env.sh. TD On Thu, Aug 28, 2014 at 5:29 PM, Tim Smith secs...@gmail.com wrote: Appeared after running for a while. I re-ran the job and this time, it crashed with: 14/08/29 00:18:50 WARN ReceiverTracker: Error reported by receiver for stream 0: Error in block pushing thread - java.net.SocketException: Too many open files Shouldn't the failed receiver get re-spawned on a different worker? On Thu, Aug 28, 2014 at 4:12 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Do you see this error right in the beginning or after running for sometime? The root cause seems to be that somehow your Spark executors got killed, which killed receivers and caused further errors. Please try to take a look at the executor logs of the lost executor to find what is the root cause that caused the executor to fail. TD On Thu, Aug 28, 2014 at 3:54 PM, Tim Smith secs...@gmail.com wrote: Hi, Have a Spark-1.0.0 (CDH5) streaming job reading from kafka that died with: 14/08/28 22:28:15 INFO DAGScheduler: Failed to run runJob at ReceiverTracker.scala:275 Exception in thread Thread-59 14/08/28 22:28:15 INFO YarnClientClusterScheduler: Cancelling stage 2 14/08/28 22:28:15 INFO DAGScheduler: Executor lost: 5 (epoch 4) 14/08/28 22:28:15 INFO BlockManagerMasterActor: Trying to remove executor 5 from BlockManagerMaster. 14/08/28 22:28:15 INFO BlockManagerMaster: Removed 5 successfully in removeExecutor org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0:0 failed 4 times, most recent failure: TID 6481 on host node-dn1-1.ops.sfdc.net failed for unknown reason Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Any insights into this error? Thanks, Tim
Re: Memory statistics in the Application detail UI
Click the Storage tab. You have some (tiny) RDD persisted in memory. On Fri, Aug 29, 2014 at 5:58 AM, SK skrishna...@gmail.com wrote: Hi, Thanks for the responses. I understand that the second values in the Memory Used column for the executors add up to 95.5 GB and the first values add up to 17.3 KB. If 95.5 GB is the memory used to store the RDDs, then what is 17.3 KB ? is that the memory used for shuffling operations? For non MLlib applications I get 0.0 for the first number - i.e memory used is 0.0 (95.5 GB Total). Is the total memory used the sum of the two numbers or is the first number included in the second number (i.e is 17.3 KB included in the 95.5 GB)? thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: org.apache.hadoop.io.compress.SnappyCodec not found
Hi, You can set the settings in conf/spark-env.sh like this:export SPARK_LIBRARY_PATH=/usr/lib/hadoop/lib/native/ SPARK_JAVA_OPTS+=-Djava.library.path=$SPARK_LIBRARY_PATH SPARK_JAVA_OPTS+=-Dspark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec SPARK_JAVA_OPTS+=-Dio.compression.codecs=org.apache.hadoop.io.compress.SnappyCodec export SPARK_JAVA_OPTS And make sure the snappy native library is located in /usr/lib/hadoop/lib/native/ directory. Also, if you are using 64 bit system, remember to re-compile the snappy native library to 64-bit, cause the official snappy lib is 32-bit. $ file /usr/lib/hadoop/lib/native/*/usr/lib/hadoop/lib/native/libhadoop.a: current ar archive/usr/lib/hadoop/lib/native/libhadooppipes.a: current ar archive/usr/lib/hadoop/lib/native/libhadoop.so: symbolic link to `libhadoop.so.1.0.0'/usr/lib/hadoop/lib/native/libhadoop.so.1.0.0: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, stripped/usr/lib/hadoop/lib/native/libhadooputils.a: current ar archive/usr/lib/hadoop/lib/native/libhdfs.a: current ar archive/usr/lib/hadoop/lib/native/libsnappy.so: symbolic link to `libsnappy.so.1.1.3'/usr/lib/hadoop/lib/native/libsnappy.so.1: symbolic link to `libsnappy.so.1.1.3'/usr/lib/hadoop/lib/native/libsnappy.so.1.1.3: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, stripped BRs,Patrick Liu Date: Thu, 28 Aug 2014 18:36:25 -0700 From: ml-node+s1001560n13084...@n3.nabble.com To: linkpatrick...@live.com Subject: Re: org.apache.hadoop.io.compress.SnappyCodec not found Hi, I fixed the issue by copying libsnappy.so to Java ire. RegardsArthur On 29 Aug, 2014, at 8:12 am, [hidden email] [hidden email] wrote:Hi, If change my etc/hadoop/core-site.xml frompropertynameio.compression.codecs/namevalue org.apache.hadoop.io.compress.SnappyCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec/value /property topropertynameio.compression.codecs/namevalue org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.SnappyCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec/value /property and run the test again, I found this time it cannot find org.apache.hadoop.io.compress.GzipCodec scala inFILE.first()java.lang.RuntimeException: Error in configuring object at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171)at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:202) at org.apache.spark.rdd.RDD.take(RDD.scala:983) at org.apache.spark.rdd.RDD.first(RDD.scala:1015) at $iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC.init(console:22) at $iwC.init(console:24)at init(console:26) at .init(console:30) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) at