Re: spark-submit question
I figured it out. I had to use pyspark.files.SparkFiles to get the locations of files loaded into Spark. On Mon, Nov 17, 2014 at 1:26 PM, Sean Owen so...@cloudera.com wrote: You are changing these paths and filenames to match your own actual scripts and file locations right? On Nov 17, 2014 4:59 AM, Samarth Mailinglist mailinglistsama...@gmail.com wrote: I am trying to run a job written in python with the following command: bin/spark-submit --master spark://localhost:7077 /path/spark_solution_basic.py --py-files /path/*.py --files /path/config.properties I always get an exception that config.properties is not found: INFO - IOError: [Errno 2] No such file or directory: 'config.properties' Why isn't this working?
RandomGenerator class not found exception
My sbt file for the project includes this: libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.0, org.apache.spark %% spark-mllib % 1.1.0, org.apache.commons % commons-math3 % 3.3 ) = Still I am getting this error: java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator = The jar at location: ~/.m2/repository/org/apache/commons/commons-math3/3.3 contains the random generator class: $ jar tvf commons-math3-3.3.jar | grep RandomGenerator org/apache/commons/math3/random/RandomGenerator.class org/apache/commons/math3/random/UniformRandomGenerator.class org/apache/commons/math3/random/SynchronizedRandomGenerator.class org/apache/commons/math3/random/AbstractRandomGenerator.class org/apache/commons/math3/random/RandomGeneratorFactory$1.class org/apache/commons/math3/random/RandomGeneratorFactory.class org/apache/commons/math3/random/StableRandomGenerator.class org/apache/commons/math3/random/NormalizedRandomGenerator.class org/apache/commons/math3/random/JDKRandomGenerator.class org/apache/commons/math3/random/GaussianRandomGenerator.class Please help
Re: RandomGenerator class not found exception
Add this jar http://mvnrepository.com/artifact/org.apache.commons/commons-math3/3.3 while creating the sparkContext. sc.addJar(/path/to/commons-math3-3.3.jar) And make sure it is shipped and available in the environment tab (4040) Thanks Best Regards On Mon, Nov 17, 2014 at 1:54 PM, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: My sbt file for the project includes this: libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.0, org.apache.spark %% spark-mllib % 1.1.0, org.apache.commons % commons-math3 % 3.3 ) = Still I am getting this error: java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator = The jar at location: ~/.m2/repository/org/apache/commons/commons-math3/3.3 contains the random generator class: $ jar tvf commons-math3-3.3.jar | grep RandomGenerator org/apache/commons/math3/random/RandomGenerator.class org/apache/commons/math3/random/UniformRandomGenerator.class org/apache/commons/math3/random/SynchronizedRandomGenerator.class org/apache/commons/math3/random/AbstractRandomGenerator.class org/apache/commons/math3/random/RandomGeneratorFactory$1.class org/apache/commons/math3/random/RandomGeneratorFactory.class org/apache/commons/math3/random/StableRandomGenerator.class org/apache/commons/math3/random/NormalizedRandomGenerator.class org/apache/commons/math3/random/JDKRandomGenerator.class org/apache/commons/math3/random/GaussianRandomGenerator.class Please help
Re: RandomGenerator class not found exception
Include the commons-math3/3.3 in class path while submitting jar to spark cluster. Like.. spark-submit --driver-class-path maths3.3jar --class MainClass --master spark cluster url appjar On Mon, Nov 17, 2014 at 1:55 PM, Ritesh Kumar Singh [via Apache Spark User List] ml-node+s1001560n19055...@n3.nabble.com wrote: My sbt file for the project includes this: libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.0, org.apache.spark %% spark-mllib % 1.1.0, org.apache.commons % commons-math3 % 3.3 ) = Still I am getting this error: java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator = The jar at location: ~/.m2/repository/org/apache/commons/commons-math3/3.3 contains the random generator class: $ jar tvf commons-math3-3.3.jar | grep RandomGenerator org/apache/commons/math3/random/RandomGenerator.class org/apache/commons/math3/random/UniformRandomGenerator.class org/apache/commons/math3/random/SynchronizedRandomGenerator.class org/apache/commons/math3/random/AbstractRandomGenerator.class org/apache/commons/math3/random/RandomGeneratorFactory$1.class org/apache/commons/math3/random/RandomGeneratorFactory.class org/apache/commons/math3/random/StableRandomGenerator.class org/apache/commons/math3/random/NormalizedRandomGenerator.class org/apache/commons/math3/random/JDKRandomGenerator.class org/apache/commons/math3/random/GaussianRandomGenerator.class Please help -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19057.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Functions in Spark
One 'rule of thumbs' is to use rdd.toDebugString and check the lineage for ShuffleRDD. As long as there's no need for restructuring the RDD, operations can be pipelined on each partition. rdd.toDebugString is your friend :-) -kr, Gerard. On Mon, Nov 17, 2014 at 7:37 AM, Mukesh Jha me.mukesh@gmail.com wrote: Thanks I did go through the video it was very informative, but I think I's looking for the Transformations section @ page https://spark.apache.org/docs/0.9.1/scala-programming-guide.html. On Mon, Nov 17, 2014 at 10:31 AM, Samarth Mailinglist mailinglistsama...@gmail.com wrote: Check this video out: https://www.youtube.com/watch?v=dmL0N3qfSc8list=UURzsq7k4-kT-h3TDUBQ82-w On Mon, Nov 17, 2014 at 9:43 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Is there any way to know which of my functions perform better in Spark? In other words, say I have achieved same thing using two different implementations. How do I judge as to which implementation is better than the other. Is processing time the only metric that we can use to claim the goodness of one implementation to the other? Can anyone please share some thoughts on this? Thank You -- Thanks Regards, *Mukesh Jha me.mukesh@gmail.com*
Landmarks in GraphX section of Spark API
Hi, I was going through the graphx section in the Spark API in https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$ Here, I find the word landmark. Can anyone explain to me what is landmark means. Is it a simple English word or does it mean something else in graphx. Thank You
Re: How to incrementally compile spark examples using mvn
The downloads just happen once so this is not a problem. If you are just building one module in a project, it needs a compiled copy of other modules. It will either use your locally-built and locally-installed artifact, or, download one from the repo if possible. This isn't needed if you are compiling all modules at once. If you want to compile everything and reuse the local artifacts later, you need 'install' not 'package'. On Mon, Nov 17, 2014 at 12:27 AM, Yiming (John) Zhang sdi...@gmail.com wrote: Thank you Marcelo. I tried your suggestion (# mvn -pl :spark-examples_2.10 compile), but it required to download many spark components (as listed below), which I have already compiled on my server. Downloading: https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10/1.1.0/spark-core_2.10-1.1.0.pom ... Downloading: https://repo1.maven.org/maven2/org/apache/spark/spark-streaming_2.10/1.1.0/spark-streaming_2.10-1.1.0.pom ... Downloading: https://repository.jboss.org/nexus/content/repositories/releases/org/apache/spark/spark-hive_2.10/1.1.0/spark-hive_2.10-1.1.0.pom ... This problem didn't happen when I compiled the whole project using ``mvn -DskipTests package''. I guess some configurations have to be made to tell mvn the dependencies are local. Any idea for that? Thank you for your help! Cheers, Yiming -邮件原件- 发件人: Marcelo Vanzin [mailto:van...@cloudera.com] 发送时间: 2014年11月16日 10:26 收件人: sdi...@gmail.com 抄送: user@spark.apache.org 主题: Re: How to incrementally compile spark examples using mvn I haven't tried scala:cc, but you can ask maven to just build a particular sub-project. For example: mvn -pl :spark-examples_2.10 compile On Sat, Nov 15, 2014 at 5:31 PM, Yiming (John) Zhang sdi...@gmail.com wrote: Hi, I have already successfully compile and run spark examples. My problem is that if I make some modifications (e.g., on SparkPi.scala or LogQuery.scala) I have to use “mvn -DskipTests package” to rebuild the whole spark project and wait a relatively long time. I also tried “mvn scala:cc” as described in http://spark.apache.org/docs/latest/building-with-maven.html, but I could only get infinite stop like: [INFO] --- scala-maven-plugin:3.2.0:cc (default-cli) @ spark-parent --- [INFO] wait for files to compile... Is there any method to incrementally compile the examples using mvn? Thank you! Cheers, Yiming -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark streaming batch overrun
Hi all, In this presentation (https://prezi.com/1jzqym68hwjp/spark-gotchas-and-anti-patterns/) it mentions that Spark Streaming's behaviour is undefined if a batch overruns the polling interval. Is this something that might be addressed in future or is it fundamental to the design? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-batch-overrun-tp19061.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
HDFS read text file
Hi, JavaRDDInstrument studentsData = sc.parallelize(list);--list is Student Info ListStudent studentsData.saveAsTextFile(hdfs://master/data/spark/instruments.txt); above statements saved the students information in the HDFS as a text file. Each object is a line in text file as below. [cid:image001.png@01D0027F.FB321550] How to read that file, I mean each line as Object of student. -Naveen
Building Spark for Hive The requested profile hadoop-1.2 could not be activated because it does not exist.
I am using Apache Hadoop 1.2.1 . I wanted to use Spark Sql with Hive. So I tried to build Spark like so . mvn -Phive,hadoop-1.2 -Dhadoop.version=1.2.1 clean -DskipTests package But I get the following error. The requested profile hadoop-1.2 could not be activated because it does not exist. Is there some way to handle this or do I have to downgrade Hadoop. Any help is Appreciated -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-for-Hive-The-requested-profile-hadoop-1-2-could-not-be-activated-because-it-does-not--tp19063.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to measure communication between nodes in Spark Standalone Cluster?
Hello, I use Spark Standalone Cluster and I want to measure somehow internode communication. As I understood, Graphx transfers only vertices values. Am I right? But I do not want to get number of bytes which were transferred between any two nodes. So is there way to measure how many values of vertices were transferred among nodes? Thanks! -- Cordialement, Hlib Mykhailenko Doctorant à INRIA Sophia-Antipolis Méditerranée 2004 Route des Lucioles BP93 06902 SOPHIA ANTIPOLIS cedex
Re: HDFS read text file
You can use the sc.objectFile https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext to read it. It will be RDD[Student] type. Thanks Best Regards On Mon, Nov 17, 2014 at 4:03 PM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Hi, JavaRDDInstrument studentsData = sc.parallelize(list);--list is Student Info ListStudent studentsData.saveAsTextFile(hdfs://master/data/spark/instruments.txt); above statements saved the students information in the HDFS as a text file. Each object is a line in text file as below. How to read that file, I mean each line as Object of student. -Naveen
Re: How to measure communication between nodes in Spark Standalone Cluster?
You can use Ganglia to see the overall data transfer across the cluster/nodes. I don't think there's a direct way to get the vertices being transferred. Thanks Best Regards On Mon, Nov 17, 2014 at 4:29 PM, Hlib Mykhailenko hlib.mykhaile...@inria.fr wrote: Hello, I use Spark Standalone Cluster and I want to measure somehow internode communication. As I understood, Graphx transfers only vertices values. Am I right? But I do not want to get number of bytes which were transferred between any two nodes. So is there way to measure how many values of vertices were transferred among nodes? Thanks! -- Cordialement, *Hlib Mykhailenko* Doctorant à INRIA Sophia-Antipolis Méditerranée http://www.inria.fr/centre/sophia/ 2004 Route des Lucioles BP93 06902 SOPHIA ANTIPOLIS cedex
Re: HDFS read text file
Hello Naveen, I think you should first override toString method of your sample.spark.test.Student class. -- Cordialement, Hlib Mykhailenko Doctorant à INRIA Sophia-Antipolis Méditerranée 2004 Route des Lucioles BP93 06902 SOPHIA ANTIPOLIS cedex - Original Message - From: Naveen Kumar Pokala npok...@spcapitaliq.com To: user@spark.apache.org Sent: Monday, November 17, 2014 11:33:44 AM Subject: HDFS read text file Hi, JavaRDDInstrument studentsData = sc .parallelize( list );--list is Student Info ListStudent studentsData .saveAsTextFile( hdfs://master/data/spark/instruments.txt ); above statements saved the students information in the HDFS as a text file. Each object is a line in text file as below. How to read that file, I mean each line as Object of student. -Naveen
Re: Building Spark for Hive The requested profile hadoop-1.2 could not be activated because it does not exist.
Oops , I guess , this is the right way to do it mvn -Phive -Dhadoop.version=1.2.1 clean -DskipTests package -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-for-Hive-The-requested-profile-hadoop-1-2-could-not-be-activated-because-it-does-not--tp19063p19068.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Returning breeze.linalg.DenseMatrix from method
This should fix it -- def func(str: String): DenseMatrix*[Double]* = { ... ... } So, why is this required? Think of it like this -- If you hadn't explicitly mentioned Double, it might have been that the calling function expected a DenseMatrix[SomeOtherType], and performed a SomeOtherType-specific operation which may have not been supported by the returned DenseMatrix[Double]. (I'm also assuming that SomeOtherType has no subtype relations with Double). On 17 November 2014 00:14, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Hi, I have a method that returns DenseMatrix: def func(str: String): DenseMatrix = { ... ... } But I keep getting this error: *class DenseMatrix takes type parameters* I tried this too: def func(str: String): DenseMatrix(Int, Int, Array[Double]) = { ... ... } But this gives me this error: *'=' expected but '(' found* Any possible fixes? -- *Tribhuvanesh Orekondy*
Building Spark with hive does not work
Hi, I am building spark on the most recent master branch. I checked this page: https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md The cmd *./sbt/sbt -Phive -Phive-thirftserver clean assembly/assembly* works fine. A fat jar is created. However, when I started the SQL-CLI, I encountered an exception: Spark assembly has been built with Hive, including Datanucleus jars on classpath java.lang.ClassNotFoundException: org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:337) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Failed to load main class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver. *You need to build Spark with -Phive and -Phive-thriftserver.* Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties It's suggested to do with -Phive and -Phive-thriftserver, which is actually what I have done. Any idea ? Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-with-hive-does-not-work-tp19072.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Returning breeze.linalg.DenseMatrix from method
Yeah, it works. Although when I try to define a var of type DenseMatrix, like this: var mat1: DenseMatrix[Double] It gives an error saying we need to initialise the matrix mat1 at the time of declaration. Had to initialise it as : var mat1: DenseMatrix[Double] = DenseMatrix.zeros[Double](1,1) Anyways, it works now Thanks for helping :) On Mon, Nov 17, 2014 at 4:56 PM, tribhuvan...@gmail.com tribhuvan...@gmail.com wrote: This should fix it -- def func(str: String): DenseMatrix*[Double]* = { ... ... } So, why is this required? Think of it like this -- If you hadn't explicitly mentioned Double, it might have been that the calling function expected a DenseMatrix[SomeOtherType], and performed a SomeOtherType-specific operation which may have not been supported by the returned DenseMatrix[Double]. (I'm also assuming that SomeOtherType has no subtype relations with Double). On 17 November 2014 00:14, Ritesh Kumar Singh riteshoneinamill...@gmail.com wrote: Hi, I have a method that returns DenseMatrix: def func(str: String): DenseMatrix = { ... ... } But I keep getting this error: *class DenseMatrix takes type parameters* I tried this too: def func(str: String): DenseMatrix(Int, Int, Array[Double]) = { ... ... } But this gives me this error: *'=' expected but '(' found* Any possible fixes? -- *Tribhuvanesh Orekondy*
Re: How do you force a Spark Application to run in multiple tasks
I've never used Mesos, sorry. On Fri, Nov 14, 2014 at 5:30 PM, Steve Lewis lordjoe2...@gmail.com wrote: The cluster runs Mesos and I can see the tasks in the Mesos UI but most are not doing much - any hints about that UI On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann daniel.siegm...@velos.io wrote: Most of the information you're asking for can be found on the Spark web UI (see here http://spark.apache.org/docs/1.1.0/monitoring.html). You can see which tasks are being processed by which nodes. If you're using HDFS and your file size is smaller than the HDFS block size you will only have one partition (remember, there is exactly one task for each partition in a stage). If you want to force it to have more partitions, you can call RDD.repartition(numPartitions). Note that this will introduce a shuffle you wouldn't otherwise have. Also make sure your job is allocated more than one core in your cluster (you can see this on the web UI). On Fri, Nov 14, 2014 at 2:18 PM, Steve Lewis lordjoe2...@gmail.com wrote: I have instrumented word count to track how many machines the code runs on. I use an accumulator to maintain a Set or MacAddresses. I find that everything is done on a single machine. This is probably optimal for word count but not the larger problems I am working on. How to a force processing to be split into multiple tasks. How to I access the task and attempt numbers to track which processing happens in which attempt. Also is using MacAddress to determine which machine is running the code. As far as I can tell a simple word count is running in one thread on one machine and the remainder of the cluster does nothing, This is consistent with tests where I write to sdout from functions and see little output on most machines in the cluster -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io
Re: How to measure communication between nodes in Spark Standalone Cluster?
I am not sure there is a direct way(an api in graphx, etc) to measure the number of transferred vertex values among nodes during computation. It might depend on: - the operations in your application, e.g. only communicate with its immediate neighbours for each vertex. - the partition strategy you chose, wrt the vertices replication factor - the distribution of partitions on cluster ... Best, Yifan LI LIP6, UPMC, Paris On 17 Nov 2014, at 11:59, Hlib Mykhailenko hlib.mykhaile...@inria.fr wrote: Hello, I use Spark Standalone Cluster and I want to measure somehow internode communication. As I understood, Graphx transfers only vertices values. Am I right? But I do not want to get number of bytes which were transferred between any two nodes. So is there way to measure how many values of vertices were transferred among nodes? Thanks! -- Cordialement, Hlib Mykhailenko Doctorant à INRIA Sophia-Antipolis Méditerranée http://www.inria.fr/centre/sophia/ 2004 Route des Lucioles BP93 06902 SOPHIA ANTIPOLIS cedex
Re: same error of SPARK-1977 while using trainImplicit in mllib 1.0.2
Thanks. It works for me. -- Aaron Lin On 2014年11月15日 Saturday at 上午1:19, Xiangrui Meng wrote: If you use Kryo serialier, you need to register mutable.BitSet and Rating: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala#L102 The JIRA was marked resolved because chill resolved the problem in v0.4.0 and we have this workaround. -Xiangrui On Fri, Nov 14, 2014 at 12:41 AM, aaronlin aaron...@kkbox.com (mailto:aaron...@kkbox.com) wrote: Hi folks, Although spark-1977 said that this problem is resolved in 1.0.2, but I will have this problem while running the script in AWS EC2 via spark-c2.py. I checked spark-1977 and found that twitter.chill resolve the problem in v.0.4.0 not v.0.3.6, but spark depends on twitter.chill v0.3.6 based on maven page. For more information, you can check the following pages - https://github.com/twitter/chill - http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.0.2 Can anyone give me advises? Thanks -- Aaron Lin
Re: RDD.aggregate versus accumulables...
You should *never* use accumulators for this purpose because you may get incorrect answers. Accumulators can count the same thing multiple times - you cannot rely upon the correctness of the values they compute. See SPARK-732 https://issues.apache.org/jira/browse/SPARK-732 for more info. On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L nathan.l.segerl...@intel.com wrote: Hi All. I am trying to get my head around why using accumulators and accumulables seems to be the most recommended method for accumulating running sums, averages, variances and the like, whereas the aggregate method seems to me to be the right one. I have no performance measurements as of yet, but it seems that aggregate is simpler and more intuitive (And it does what one might expect an accumulator to do) whereas the accumulators and accumulables seem to have some extra complications and overhead. So… What’s the real difference between an accumulator/accumulable and aggregating an RDD? When is one method of aggregation preferred over the other? Thanks, Nate -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io
Re: Building Spark with hive does not work
Hey Hao, Which commit are you using? Just tried 64c6b9b with exactly the same command line flags, couldn't reproduce this issue. Cheng On 11/17/14 10:02 PM, Hao Ren wrote: Hi, I am building spark on the most recent master branch. I checked this page: https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md The cmd *./sbt/sbt -Phive -Phive-thirftserver clean assembly/assembly* works fine. A fat jar is created. However, when I started the SQL-CLI, I encountered an exception: Spark assembly has been built with Hive, including Datanucleus jars on classpath java.lang.ClassNotFoundException: org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:337) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Failed to load main class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver. *You need to build Spark with -Phive and -Phive-thriftserver.* Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties It's suggested to do with -Phive and -Phive-thriftserver, which is actually what I have done. Any idea ? Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-with-hive-does-not-work-tp19072.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RDD.aggregate versus accumulables...
We use Algebird for calculating things like min/max, stddev, variance, etc. https://github.com/twitter/algebird/wiki -Suren On Mon, Nov 17, 2014 at 11:32 AM, Daniel Siegmann daniel.siegm...@velos.io wrote: You should *never* use accumulators for this purpose because you may get incorrect answers. Accumulators can count the same thing multiple times - you cannot rely upon the correctness of the values they compute. See SPARK-732 https://issues.apache.org/jira/browse/SPARK-732 for more info. On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L nathan.l.segerl...@intel.com wrote: Hi All. I am trying to get my head around why using accumulators and accumulables seems to be the most recommended method for accumulating running sums, averages, variances and the like, whereas the aggregate method seems to me to be the right one. I have no performance measurements as of yet, but it seems that aggregate is simpler and more intuitive (And it does what one might expect an accumulator to do) whereas the accumulators and accumulables seem to have some extra complications and overhead. So… What’s the real difference between an accumulator/accumulable and aggregating an RDD? When is one method of aggregation preferred over the other? Thanks, Nate -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
How to broadcast a textFile?
I have a 1 million row file that I'd like to read from my edge node, and then send a copy of it to each Hadoop machine's memory in order to run JOINs in my spark streaming code. I see examples in the docs of how use use broadcast() for a simple array, but how about when the data is in a textFile? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-broadcast-a-textFile-tp19083.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: RandomGenerator class not found exception
As you are using sbt ..u need not put in ~/.m2/repositories for maven. Include the jar explicitly using the option --driver-class-path while submitting the jar to spark cluster On Mon, Nov 17, 2014 at 7:41 PM, Ritesh Kumar Singh [via Apache Spark User List] ml-node+s1001560n1907...@n3.nabble.com wrote: It's still not working. Keep getting the same error. I even deleted the commons-math3/* folder containing the jar. And then under the directory org/apache/commons/ made a folder called 'math3' and put the commons-math3-3.3.jar in it. Still its not working. I also tried sc.addJar(/path/to/jar) within spark-shell and in my project sourcefile It still didn't import the jar at both locations. More Any fixes? Please help On Mon, Nov 17, 2014 at 2:14 PM, Chitturi Padma [hidden email] http://user/SendEmail.jtp?type=nodenode=19073i=0 wrote: Include the commons-math3/3.3 in class path while submitting jar to spark cluster. Like.. spark-submit --driver-class-path maths3.3jar --class MainClass --master spark cluster url appjar On Mon, Nov 17, 2014 at 1:55 PM, Ritesh Kumar Singh [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=19057i=0 wrote: My sbt file for the project includes this: libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.0, org.apache.spark %% spark-mllib % 1.1.0, org.apache.commons % commons-math3 % 3.3 ) = Still I am getting this error: java.lang.NoClassDefFoundError: org/apache/commons/math3/random/RandomGenerator = The jar at location: ~/.m2/repository/org/apache/commons/commons-math3/3.3 contains the random generator class: $ jar tvf commons-math3-3.3.jar | grep RandomGenerator org/apache/commons/math3/random/RandomGenerator.class org/apache/commons/math3/random/UniformRandomGenerator.class org/apache/commons/math3/random/SynchronizedRandomGenerator.class org/apache/commons/math3/random/AbstractRandomGenerator.class org/apache/commons/math3/random/RandomGeneratorFactory$1.class org/apache/commons/math3/random/RandomGeneratorFactory.class org/apache/commons/math3/random/StableRandomGenerator.class org/apache/commons/math3/random/NormalizedRandomGenerator.class org/apache/commons/math3/random/JDKRandomGenerator.class org/apache/commons/math3/random/GaussianRandomGenerator.class Please help -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055.html To start a new topic under Apache Spark User List, email [hidden email] http://user/SendEmail.jtp?type=nodenode=19057i=1 To unsubscribe from Apache Spark User List, click here. NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: Re: RandomGenerator class not found exception http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19057.html Sent from the Apache Spark User List mailing list archive http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19073.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg== . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19086.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Building Spark with hive does not work
Sry for spamming, Just after my previous post, I noticed that the command used is: ./sbt/sbt -Phive -Phive-thirftserver clean assembly/assembly thriftserver* the typo error is the evil. Stupid, me. I believe I just copy-pasted from somewhere else, but no even checked it, meanwhile no error msg, such as no such option, is displayed, which makes me consider the flags are correct. Sry for the carelessness. Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-with-hive-does-not-work-tp19072p19087.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Building Spark with hive does not work
Looks like this was where you got that commandline: http://search-hadoop.com/m/JW1q5RlPrl Cheers On Mon, Nov 17, 2014 at 9:44 AM, Hao Ren inv...@gmail.com wrote: Sry for spamming, Just after my previous post, I noticed that the command used is: ./sbt/sbt -Phive -Phive-thirftserver clean assembly/assembly thriftserver* the typo error is the evil. Stupid, me. I believe I just copy-pasted from somewhere else, but no even checked it, meanwhile no error msg, such as no such option, is displayed, which makes me consider the flags are correct. Sry for the carelessness. Hao -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-with-hive-does-not-work-tp19072p19087.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How can I apply such an inner join in Spark Scala/Python
So let us say I have RDDs A and B with the following values. A = [ (1, 2), (2, 4), (3, 6) ] B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ] I want to apply an inner join, such that I get the following as a result. C = [ (1, (2, 3)), (2, (4, 5)), (3, (6,6)) ] That is, those keys which are not present in A should disappear after the left inner join. How can I achieve that? I can see outerJoin functions but no innerJoin functions in the Spark RDD class.
Exception in spark sql when running a group by query
While testing sparkSQL, we were running this group by with expression query and got an exception. The same query worked fine on hive. SELECT from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200), '/MM/dd') as pst_date, count(*) as num_xyzs FROM all_matched_abc GROUP BY from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200), '/MM/dd') 14/11/17 17:41:46 ERROR thriftserver.SparkSQLDriver: Failed in [SELECT from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200), '/MM/dd') as pst_date, count(*) as num_xyzs FROM all_matched_abc GROUP BY from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200), '/MM/dd') ] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor(((CAST(xyz#183.whenrequestreceived AS whenrequestreceived#187L, DoubleType) / 1000.0) - CAST(25200, DoubleType))),/MM/dd) AS pst_date#179, tree: Aggregate [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor(((CAST(xyz#183.whenrequestreceived, DoubleType) / 1000.0) - CAST(25200, DoubleType))),/MM/dd)], [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor(((CAST(xyz#183.whenrequestreceived AS whenrequestreceived#187L, DoubleType) / 1000.0) - CAST(25200, DoubleType))),/MM/dd) AS pst_date#179,COUNT(1) AS num_xyzs#180L] MetastoreRelation default, all_matched_abc, None at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:127) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:125) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:425) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:59) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:276) at
How do I get the executor ID from running Java code
The spark UI lists a number of Executor IDS on the cluster. I would like to access both executor ID and Task/Attempt IDs from the code inside a function running on a slave machine. Currently my motivation is to examine parallelism and locality but in Hadoop this aids in allowing code to write non-overlapping temporary files
Spark streaming on Yarn
Hi, I have been using spark streaming in standalone mode and now I want to migrate to spark running on yarn, but I am not sure how you would you would go about designating a specific node in the cluster to act as an avro listener since I am using flume based push approach with spark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-on-Yarn-tp19093.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to broadcast a textFile?
OK then I'd still need to write the code (within my spark streaming code I'm guessing) to convert my text file into an object like a HashMap before broadcasting. How can I make sure only the HashMap is being broadcast while all the pre-processing to create the HashMap is only performed once? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-broadcast-a-textFile-tp19083p19094.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: RDD.aggregate versus accumulables...
Thanks for the link to the bug. Unfortunately, using accumulators like this is getting spread around as a recommended practice despite the bug. From: Daniel Siegmann [mailto:daniel.siegm...@velos.io] Sent: Monday, November 17, 2014 8:32 AM To: Segerlind, Nathan L Cc: user Subject: Re: RDD.aggregate versus accumulables... You should never use accumulators for this purpose because you may get incorrect answers. Accumulators can count the same thing multiple times - you cannot rely upon the correctness of the values they compute. See SPARK-732https://issues.apache.org/jira/browse/SPARK-732 for more info. On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L nathan.l.segerl...@intel.commailto:nathan.l.segerl...@intel.com wrote: Hi All. I am trying to get my head around why using accumulators and accumulables seems to be the most recommended method for accumulating running sums, averages, variances and the like, whereas the aggregate method seems to me to be the right one. I have no performance measurements as of yet, but it seems that aggregate is simpler and more intuitive (And it does what one might expect an accumulator to do) whereas the accumulators and accumulables seem to have some extra complications and overhead. So… What’s the real difference between an accumulator/accumulable and aggregating an RDD? When is one method of aggregation preferred over the other? Thanks, Nate -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.iomailto:daniel.siegm...@velos.io W: www.velos.iohttp://www.velos.io
Re: How can I apply such an inner join in Spark Scala/Python
Just RDD.join() should be an inner join. On Mon, Nov 17, 2014 at 5:51 PM, Blind Faith person.of.b...@gmail.com wrote: So let us say I have RDDs A and B with the following values. A = [ (1, 2), (2, 4), (3, 6) ] B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ] I want to apply an inner join, such that I get the following as a result. C = [ (1, (2, 3)), (2, (4, 5)), (3, (6,6)) ] That is, those keys which are not present in A should disappear after the left inner join. How can I achieve that? I can see outerJoin functions but no innerJoin functions in the Spark RDD class. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
only option is to split you problem further by increasing parallelism My understanding is by increasing the number of partitions, is that right? That didn't seem to help because it is seem the partitions are not uniformly sized. My observation is when I increase the number of partitions, it creates many empty block partitions and may larger partition is not broken down into smaller size. Any hints, on how I can get uniform partitions. I noticed many threads, but was not able to do any thing effective from Java api. I will appreciate any help/insight you can provide. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-Requested-array-size-exceeds-VM-limit-tp16809p19097.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Missing SparkSQLCLIDriver and Beeline drivers in Spark
Minor correction: there was a typo in commandline hive-thirftserver should be hive-thriftserver Cheers On Thu, Aug 7, 2014 at 6:49 PM, Cheng Lian lian.cs@gmail.com wrote: Things have changed a bit in the master branch, and the SQL programming guide in master branch actually doesn’t apply to branch-1.0-jdbc. In branch-1.0-jdbc, Hive Thrift server and Spark SQL CLI are included in the hive profile and are thus not enabled by default. You need to either - pass -Phive to Maven to enable it, or - use SPARK_HIVE=true ./sbt/sbt assembly In the most recent master branch, however, Hive Thrift server and Spark SQL CLI are moved into a separate hive-thriftserver profile. And our SBT build file now delegates to Maven. So, to build the master branch, you can either - ./sbt/sbt -Phive-thirftserver clean assembly/assembly, or - mvn -Phive-thriftserver clean package -DskipTests On Fri, Aug 8, 2014 at 6:12 AM, ajatix a...@sigmoidanalytics.com wrote: Hi I wish to migrate from shark to the spark-sql shell, where I am facing some difficulties in setting up. I cloned the branch-1.0-jdbc to test out the spark-sql shell, but I am unable to run it after building the source. I've tried two methods for building (with Hadoop 1.0.4) - sbt/sbt assembly; and mvn -DskipTests clean package -X. Both build successfully, but when I run bin/spark-sql, I get the following error: Exception in thread main java.lang.ClassNotFoundException: org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:311) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) and when I run bin/beeline, I get this: Error: Could not find or load main class org.apache.hive.beeline.BeeLine bin/spark-shell works fine. It there something else I have to add to the build parameters? According to this - https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md, I tried rebuilding with -Phive-thriftserver, but it failed to detect the library while building. Thanks and Regards Ajay -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Missing-SparkSQLCLIDriver-and-Beeline-drivers-in-Spark-tp11724.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
IOException: exception in uploadSinglePart
Spark 1.1.0, running on AWS EMR cluster using yarn-client as master. I'm getting the following error when I attempt to save a RDD to S3. I've narrowed it down to a single partition that is ~150Mb in size (versus the other partitions that are closer to 1 Mb). I am able to work around this by saving to a HDFS file first, then using the hadoop distcp command to get the data upon S3, but it seems that Spark should be able to handle that. I've even tried writing to HDFS, then creating something like the following to get that up on S3: sc.textFile(hdfsPath).saveAsTextFile(s3Path) Here's the exception: java.io.IOException: exception in uploadSinglePart com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.uploadSinglePart(MultipartUploadOutputStream.java:164) com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.close(MultipartUploadOutputStream.java:220) org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72) org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105) org.apache.hadoop.io.compress.CompressorStream.close(CompressorStream.java:106) java.io.FilterOutputStream.close(FilterOutputStream.java:160) org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:109) org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:101) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:994) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:979) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Any ideas of what the underlying issue could be here? It feels like a timeout issue, but that's just a guess. Thanks, Justin
Re: Spark streaming cannot receive any message from Kafka
Hi all, I find the reason of this issue. It seems in the new version, if I do not specify spark.default.parallelism in KafkaUtils.createstream, there will be an exception since the kakfa stream creation stage. In the previous versions, it seems Spark will use the default value. Thanks! Bill On Thu, Nov 13, 2014 at 5:00 AM, Helena Edelson helena.edel...@datastax.com wrote: I encounter no issues with streaming from kafka to spark in 1.1.0. Do you perhaps have a version conflict? Helena On Nov 13, 2014 12:55 AM, Jay Vyas jayunit100.apa...@gmail.com wrote: Yup , very important that n1 for spark streaming jobs, If local use local[2] The thing to remember is that your spark receiver will take a thread to itself and produce data , so u need another thread to consume it . In a cluster manager like yarn or mesos, the word thread Is not used anymore, I guess has different meaning- you need 2 or more free compute slots, and that should be guaranteed by looking to see how many free node managers are running etc. On Nov 12, 2014, at 7:53 PM, Shao, Saisai saisai.s...@intel.com wrote: Did you configure Spark master as local, it should be local[n], n 1 for local mode. Beside there’s a Kafka wordcount example in Spark Streaming example, you can try that. I’ve tested with latest master, it’s OK. Thanks Jerry *From:* Tobias Pfeiffer [mailto:t...@preferred.jp t...@preferred.jp] *Sent:* Thursday, November 13, 2014 8:45 AM *To:* Bill Jay *Cc:* u...@spark.incubator.apache.org *Subject:* Re: Spark streaming cannot receive any message from Kafka Bill, However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the code. Do you see any suspicious messages in the log output? Tobias
independent user sessions with a multi-user spark sql thriftserver (Spark 1.1)
Hello, We're running a spark sql thriftserver that several users connect to with beeline. One limitation we've run into is that the current working database (set with use db) is shared across all connections. So changing the database on one connection changes the database for all connections. This is also the case for spark sql settings, but that's less of an issue. Is there a way (or a hack) to make the current database selection independent for each beeline connection? I'm not afraid to hack into the source code if there's a straightforward fix/workaround, but I could use some guidance on what to hack on if that's required. Thank you! Michael - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: RDD.aggregate versus accumulables...
I have been playing with using accumulators (despite the possible error with multiple attempts) These provide a convenient way to get some numbers while still performing business logic. I posted some sample code at http://lordjoesoftware.blogspot.com/. Even if accumulators are not perfect today - future versions may improve them and they are great ways to monitor execution and get a sense of performance on lazily executed systems -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-versus-accumulables-tp19044p19102.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Assigning input files to spark partitions
Hi Daniel, Yes that should work also. However, is it possible to setup so that each RDD has exactly one partition, without repartitioning (and thus incurring extra cost)? Is there a mechanism similar to MR where we can ensure each partition is assigned some amount of data by size, by setting some block size parameter? On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote No i don't want separate RDD because each of these partitions are being processed the same way (in my case, each partition corresponds to HBase keys belonging to one region server, and i will do HBase lookups). After that i have aggregations too, hence all these partitions should be in the same RDD. The reason to follow the partition structure is to limit concurrent HBase lookups targeting a single region server. Neither of these is necessarily a barrier to using separate RDDs. You can define the function you want to use and then pass it to multiple map methods. Then you could union all the RDDs to do your aggregations. For example, it might look something like this: val paths: String = ... // the paths to the files you want to load def myFunc(t: T) = ... // the function to apply to every RDD val rdds = paths.map { path = sc.textFile(path).map(myFunc) } val completeRdd = sc.union(rdds) Does that make any sense? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io
Re: Assigning input files to spark partitions
I'm not aware of any such mechanism. On Mon, Nov 17, 2014 at 2:55 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Hi Daniel, Yes that should work also. However, is it possible to setup so that each RDD has exactly one partition, without repartitioning (and thus incurring extra cost)? Is there a mechanism similar to MR where we can ensure each partition is assigned some amount of data by size, by setting some block size parameter? On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote No i don't want separate RDD because each of these partitions are being processed the same way (in my case, each partition corresponds to HBase keys belonging to one region server, and i will do HBase lookups). After that i have aggregations too, hence all these partitions should be in the same RDD. The reason to follow the partition structure is to limit concurrent HBase lookups targeting a single region server. Neither of these is necessarily a barrier to using separate RDDs. You can define the function you want to use and then pass it to multiple map methods. Then you could union all the RDDs to do your aggregations. For example, it might look something like this: val paths: String = ... // the paths to the files you want to load def myFunc(t: T) = ... // the function to apply to every RDD val rdds = paths.map { path = sc.textFile(path).map(myFunc) } val completeRdd = sc.union(rdds) Does that make any sense? -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 54 W 40th St, New York, NY 10018 E: daniel.siegm...@velos.io W: www.velos.io
Re: independent user sessions with a multi-user spark sql thriftserver (Spark 1.1)
This is an unfortunate/known issue that we are hoping to address in the next release: https://issues.apache.org/jira/browse/SPARK-2087 I'm not sure how straightforward a fix would be, but it would involve keeping / setting the SessionState for each connection to the server. It would be great if you could share any findings on that JIRA. Thanks! On Mon, Nov 17, 2014 at 11:01 AM, Michael Allman mich...@videoamp.com wrote: Hello, We're running a spark sql thriftserver that several users connect to with beeline. One limitation we've run into is that the current working database (set with use db) is shared across all connections. So changing the database on one connection changes the database for all connections. This is also the case for spark sql settings, but that's less of an issue. Is there a way (or a hack) to make the current database selection independent for each beeline connection? I'm not afraid to hack into the source code if there's a straightforward fix/workaround, but I could use some guidance on what to hack on if that's required. Thank you! Michael - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Exception in spark sql when running a group by query
You are perhaps hitting an issue that was fixed by #3248 https://github.com/apache/spark/pull/3248? On Mon, Nov 17, 2014 at 9:58 AM, Sadhan Sood sadhan.s...@gmail.com wrote: While testing sparkSQL, we were running this group by with expression query and got an exception. The same query worked fine on hive. SELECT from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200), '/MM/dd') as pst_date, count(*) as num_xyzs FROM all_matched_abc GROUP BY from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200), '/MM/dd') 14/11/17 17:41:46 ERROR thriftserver.SparkSQLDriver: Failed in [SELECT from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200), '/MM/dd') as pst_date, count(*) as num_xyzs FROM all_matched_abc GROUP BY from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200), '/MM/dd') ] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not in GROUP BY: HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor(((CAST(xyz#183.whenrequestreceived AS whenrequestreceived#187L, DoubleType) / 1000.0) - CAST(25200, DoubleType))),/MM/dd) AS pst_date#179, tree: Aggregate [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor(((CAST(xyz#183.whenrequestreceived, DoubleType) / 1000.0) - CAST(25200, DoubleType))),/MM/dd)], [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor(((CAST(xyz#183.whenrequestreceived AS whenrequestreceived#187L, DoubleType) / 1000.0) - CAST(25200, DoubleType))),/MM/dd) AS pst_date#179,COUNT(1) AS num_xyzs#180L] MetastoreRelation default, all_matched_abc, None at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:127) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:125) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422) at
Re: SparkSQL exception on spark.sql.codegen
What version of Spark SQL? On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com wrote: Hi all, We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we got exceptions as below, has anyone else saw these before? java.lang.ExceptionInInitializerError at org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:51) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:48) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.NullPointerException at scala.reflect.internal.Types$TypeRef.computeHashCode(Types.scala:2358) at scala.reflect.internal.Types$UniqueType.init(Types.scala:1304) at scala.reflect.internal.Types$TypeRef.init(Types.scala:2341) at scala.reflect.internal.Types$NoArgsTypeRef.init(Types.scala:2137) at scala.reflect.internal.Types$TypeRef$$anon$6.init(Types.scala:2544) at scala.reflect.internal.Types$TypeRef$.apply(Types.scala:2544) at scala.reflect.internal.Types$class.typeRef(Types.scala:3615) at scala.reflect.internal.SymbolTable.typeRef(SymbolTable.scala:13) at scala.reflect.internal.Symbols$TypeSymbol.newTypeRef(Symbols.scala:2752) at scala.reflect.internal.Symbols$TypeSymbol.typeConstructor(Symbols.scala:2806) at scala.reflect.internal.Symbols$SymbolContextApiImpl.toTypeConstructor(Symbols.scala:103) at scala.reflect.internal.Symbols$TypeSymbol.toTypeConstructor(Symbols.scala:2698) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$typecreator1$1.apply(CodeGenerator.scala:46) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.init(CodeGenerator.scala:46) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.init(GenerateProjection.scala:29) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.clinit(GenerateProjection.scala) ... 15 more -- Best Regards
RDD Blocks skewing to just few executors
Hi I'm running a standalone cluster with 8 worker servers. I'm developing a streaming app that is adding new lines of text to several different RDDs each batch interval. Each line has a well randomized unique identifier that I'm trying to use for partitioning, since the data stream does contain duplicates lines. I'm doing partitioning with this: val eventsByKey = streamRDD.map { event = (getUID(event), event)} val partionedEventsRdd = sparkContext.parallelize(eventsByKey.toSeq) .partitionBy(new HashPartitioner(numPartions)).map(e = e._2) I'm adding to the existing RDD like with this: val mergedRDD = currentRDD.zipPartitions(partionedEventsRdd, true) { (currentIter,batchIter) = val uniqEvents = ListBuffer[String]() val uids = Map[String,Boolean]() Array(currentIter, batchIter).foreach { iter = iter.foreach { event = val uid = getUID(event) if (!uids.contains(uid)) { uids(uid) = true uniqEvents +=event } } } uniqEvents.iterator } val count = mergedRDD.count The reason I'm doing it this way is that when I was doing: val mergedRDD = currentRDD.union(batchRDD).coalesce(numPartions).distinct val count = mergedRDD.count It would start taking a long time and a lot of shuffles. The zipPartitions approach does perform better, though after running an hour or so I start seeing this in the webUI. http://apache-spark-user-list.1001560.n3.nabble.com/file/n19112/Executors.png As you can see most of the data is skewing to just 2 executors, with 1 getting more than half the Blocks. These become a hotspot and eventually I start seeing OOM errors. I've tried this a half a dozen times and the 'hot' executors changes, but not the skewing behavior. Any idea what is going on here? Thanks, Mike -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Blocks-skewing-to-just-few-executors-tp19112.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Spark streaming cannot receive any message from Kafka
Hi Bill, Would you mind describing what you found a little more specifically, I’m not sure there’s the a parameter in KafkaUtils.createStream you can specify the spark parallelism, also what is the exception stacks. Thanks Jerry From: Bill Jay [mailto:bill.jaypeter...@gmail.com] Sent: Tuesday, November 18, 2014 2:47 AM To: Helena Edelson Cc: Jay Vyas; u...@spark.incubator.apache.org; Tobias Pfeiffer; Shao, Saisai Subject: Re: Spark streaming cannot receive any message from Kafka Hi all, I find the reason of this issue. It seems in the new version, if I do not specify spark.default.parallelism in KafkaUtils.createstream, there will be an exception since the kakfa stream creation stage. In the previous versions, it seems Spark will use the default value. Thanks! Bill On Thu, Nov 13, 2014 at 5:00 AM, Helena Edelson helena.edel...@datastax.commailto:helena.edel...@datastax.com wrote: I encounter no issues with streaming from kafka to spark in 1.1.0. Do you perhaps have a version conflict? Helena On Nov 13, 2014 12:55 AM, Jay Vyas jayunit100.apa...@gmail.commailto:jayunit100.apa...@gmail.com wrote: Yup , very important that n1 for spark streaming jobs, If local use local[2] The thing to remember is that your spark receiver will take a thread to itself and produce data , so u need another thread to consume it . In a cluster manager like yarn or mesos, the word thread Is not used anymore, I guess has different meaning- you need 2 or more free compute slots, and that should be guaranteed by looking to see how many free node managers are running etc. On Nov 12, 2014, at 7:53 PM, Shao, Saisai saisai.s...@intel.commailto:saisai.s...@intel.com wrote: Did you configure Spark master as local, it should be local[n], n 1 for local mode. Beside there’s a Kafka wordcount example in Spark Streaming example, you can try that. I’ve tested with latest master, it’s OK. Thanks Jerry From: Tobias Pfeiffer [mailto:t...@preferred.jp] Sent: Thursday, November 13, 2014 8:45 AM To: Bill Jay Cc: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org Subject: Re: Spark streaming cannot receive any message from Kafka Bill, However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the code. Do you see any suspicious messages in the log output? Tobias
Re: Using data in RDD to specify HDFS directory to write to
Yes, thank you for suggestion. The error I found below was in the worker logs. AssociationError [akka.tcp://sparkwor...@cloudera01.local.company.com:7078] - [akka.tcp://sparkexecu...@cloudera01.local.company.com:33329]: Error [Association failed with [akka.tcp://sparkexecu...@cloudera01.local.company.com:33329]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkexecu...@cloudera01.local.company.com:33329] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: cloudera01.local.company.com/10.40.19.67:33329 ] I looked into suggestions for this type of error and before I found out the real reason for the error I upgraded my CDH to 5.2 so I could try setting the driver and executor ports rather than have Spark choose them at random. My boss later turned off iptables and I no longer get that error. I do get a different one however. I have gone back into my project and changed my hadoop version to 2.5.0-cdh5.2.0 so that should not be a problem. from the master logs 2014-11-17 18:09:49,707 ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkmas...@cloudera01.local.local.com:7077] - [akka.tcp://spark@localhost:38181]: Error [Association failed with [akka.tcp://spark@localhost:38181]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@localhost:38181] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: localhost/127.0.0.1:38181 ] 2014-11-17 18:19:08,271 INFO akka.actor.LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.40.19.67%3A37795-29#-1248895472] was not delivered. [30] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 2014-11-17 18:19:28,251 ERROR Remoting: org.apache.spark.deploy.ApplicationDescription; local class incompatible: stream classdesc serialVersionUID = 583745679236071411, local class serialVersionUID = 7674242335164700840 java.io.InvalidClassException: org.apache.spark.deploy.ApplicationDescription; local class incompatible: stream classdesc serialVersionUID = 583745679236071411, local class serialVersionUID = 7674242335164700840 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136) at akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104) at scala.util.Try$.apply(Try.scala:161) at akka.serialization.Serialization.deserialize(Serialization.scala:98) at akka.remote.serialization.MessageContainerSerializer.fromBinary(MessageContainerSerializer.scala:58) at akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104) at scala.util.Try$.apply(Try.scala:161) at akka.serialization.Serialization.deserialize(Serialization.scala:98) at akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23) at akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:55) at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:55) at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:73) at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:764) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at
Re: Communication between Driver and Executors
Hi, so I didn't manage to get the Broadcast variable with a new value distributed to my executors in YARN mode. In local mode it worked fine, but when running on YARN either nothing happened (when unpersist() was called on the driver) or I got a TimeoutException (when called on the executor). I finally dropped the use of broadcast variables and added a HTTP polling mechanism from the executors to the driver. I find that a bit suboptimal, in particular since there is this whole Akka infrastructure already running and I should be able to just send messages around. However, Spark does not seem to encourage this. (In general I find that private is a bit overused in the Spark codebase...) Thanks Tobias
Re: Is there setup and cleanup function in spark?
Hi, On Fri, Nov 14, 2014 at 2:49 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ok, then we need another trick. let's have an *implicit lazy var connection/context* around our code. And setup() will trigger the eval and initialization. Due to lazy evaluation, I think having setup/teardown is a bit tricky. In particular teardown, because it is not easy to execute code after all computation is done. You can check http://apache-spark-user-list.1001560.n3.nabble.com/Keep-state-inside-map-function-tp10968p11009.html for an example of what worked for me. Tobias
Re: SparkSQL exception on spark.sql.codegen
Hi Michael, We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4. On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust mich...@databricks.com wrote: What version of Spark SQL? On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com wrote: Hi all, We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we got exceptions as below, has anyone else saw these before? java.lang.ExceptionInInitializerError at org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:51) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:48) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.NullPointerException at scala.reflect.internal.Types$TypeRef.computeHashCode(Types.scala:2358) at scala.reflect.internal.Types$UniqueType.init(Types.scala:1304) at scala.reflect.internal.Types$TypeRef.init(Types.scala:2341) at scala.reflect.internal.Types$NoArgsTypeRef.init(Types.scala:2137) at scala.reflect.internal.Types$TypeRef$$anon$6.init(Types.scala:2544) at scala.reflect.internal.Types$TypeRef$.apply(Types.scala:2544) at scala.reflect.internal.Types$class.typeRef(Types.scala:3615) at scala.reflect.internal.SymbolTable.typeRef(SymbolTable.scala:13) at scala.reflect.internal.Symbols$TypeSymbol.newTypeRef(Symbols.scala:2752) at scala.reflect.internal.Symbols$TypeSymbol.typeConstructor(Symbols.scala:2806) at scala.reflect.internal.Symbols$SymbolContextApiImpl.toTypeConstructor(Symbols.scala:103) at scala.reflect.internal.Symbols$TypeSymbol.toTypeConstructor(Symbols.scala:2698) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$typecreator1$1.apply(CodeGenerator.scala:46) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.init(CodeGenerator.scala:46) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.init(GenerateProjection.scala:29) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.clinit(GenerateProjection.scala) ... 15 more -- Best Regards -- Best Regards
Re: Status of MLLib exporting models to PMML
Hi Charles, I am not aware of other storage formats. Perhaps Sean or Sandy can elaborate more given their experience with Oryx. There is work by Smola et al at Google that talks about large scale model update and deployment. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu -Manish On Sunday, November 16, 2014, Charles Earl charles.ce...@gmail.com wrote: Manish and others, A follow up question on my mind is whether there are protobuf (or other binary format) frameworks in the vein of PMML. Perhaps scientific data storage frameworks like netcdf, root are possible also. I like the comprehensiveness of PMML but as you mention the complexity of management for large models is a concern. Cheers On Fri, Nov 14, 2014 at 1:35 AM, Manish Amde manish...@gmail.com javascript:_e(%7B%7D,'cvml','manish...@gmail.com'); wrote: @Aris, we are closely following the PMML work that is going on and as Xiangrui mentioned, it might be easier to migrate models such as logistic regression and then migrate trees. Some of the models get fairly large (as pointed out by Sung Chung) with deep trees as building blocks and we might have to consider a distributed storage and prediction strategy. On Tuesday, November 11, 2014, Xiangrui Meng men...@gmail.com javascript:_e(%7B%7D,'cvml','men...@gmail.com'); wrote: Vincenzo sent a PR and included k-means as an example. Sean is helping review it. PMML standard is quite large. So we may start with simple model export, like linear methods, then move forward to tree-based. -Xiangrui On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote: Hello Spark and MLLib folks, So a common problem in the real world of using machine learning is that some data analysis use tools like R, but the more data engineers out there will use more advanced systems like Spark MLLib or even Python Scikit Learn. In the real world, I want to have a system where multiple different modeling environments can learn from data / build models, represent the models in a common language, and then have a layer which just takes the model and run model.predict() all day long -- scores the models in other words. It looks like the project openscoring.io and jpmml-evaluator are some amazing systems for this, but they fundamentally use PMML as the model representation here. I have read some JIRA tickets that Xiangrui Meng is interested in getting PMML implemented to export MLLib models, is that happening? Further, would something like Manish Amde's boosted ensemble tree methods be representable in PMML? Thank you!! Aris - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- - Charles
Re: SparkSQL exception on spark.sql.codegen
Yes, it's always appears on a part of the whole tasks in a stage(i.e. 100/100 (65 failed)), and sometimes cause the stage to fail. And there is another error that I'm not sure if there is a correlation. java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$ at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:114) at org.apache.spark.sql.execution.Filter.conditionEvaluator$lzycompute(basicOperators.scala:55) at org.apache.spark.sql.execution.Filter.conditionEvaluator(basicOperators.scala:55) at org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:58) at org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:57) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) On Tue, Nov 18, 2014 at 11:41 AM, Michael Armbrust mich...@databricks.com wrote: Interesting, I believe we have run that query with version 1.1.0 with codegen turned on and not much has changed there. Is the error deterministic? On Mon, Nov 17, 2014 at 7:04 PM, Eric Zhen zhpeng...@gmail.com wrote: Hi Michael, We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4. On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust mich...@databricks.com wrote: What version of Spark SQL? On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com wrote: Hi all, We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we got exceptions as below, has anyone else saw these before? java.lang.ExceptionInInitializerError at org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:51) at org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:48) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.NullPointerException at scala.reflect.internal.Types$TypeRef.computeHashCode(Types.scala:2358) at scala.reflect.internal.Types$UniqueType.init(Types.scala:1304) at scala.reflect.internal.Types$TypeRef.init(Types.scala:2341) at scala.reflect.internal.Types$NoArgsTypeRef.init(Types.scala:2137) at scala.reflect.internal.Types$TypeRef$$anon$6.init(Types.scala:2544) at scala.reflect.internal.Types$TypeRef$.apply(Types.scala:2544) at scala.reflect.internal.Types$class.typeRef(Types.scala:3615) at scala.reflect.internal.SymbolTable.typeRef(SymbolTable.scala:13) at scala.reflect.internal.Symbols$TypeSymbol.newTypeRef(Symbols.scala:2752) at scala.reflect.internal.Symbols$TypeSymbol.typeConstructor(Symbols.scala:2806) at
Re: Status of MLLib exporting models to PMML
I'm just using PMML. I haven't hit any limitation of its expressiveness, for the model types is supports. I don't think there is a point in defining a new format for models, excepting that PMML can get very big. Still, just compressing the XML gets it down to a manageable size for just about any realistic model.* I can imagine some kind of translation from PMML-in-XML to PMML-in-something-else that is more compact. I've not seen anyone do this. * there still aren't formats for factored matrices and probably won't ever quite be, since they're just too large for a file format. On Tue, Nov 18, 2014 at 5:34 AM, Manish Amde manish...@gmail.com wrote: Hi Charles, I am not aware of other storage formats. Perhaps Sean or Sandy can elaborate more given their experience with Oryx. There is work by Smola et al at Google that talks about large scale model update and deployment. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu -Manish - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is there setup and cleanup function in spark?
I see. Agree that lazy eval is not suitable for proper setup and teardown. We also abandoned it due to inherent incompatibility between implicit and lazy. It was fun to come up this trick though. Jianshi On Tue, Nov 18, 2014 at 10:28 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Fri, Nov 14, 2014 at 2:49 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ok, then we need another trick. let's have an *implicit lazy var connection/context* around our code. And setup() will trigger the eval and initialization. Due to lazy evaluation, I think having setup/teardown is a bit tricky. In particular teardown, because it is not easy to execute code after all computation is done. You can check http://apache-spark-user-list.1001560.n3.nabble.com/Keep-state-inside-map-function-tp10968p11009.html for an example of what worked for me. Tobias -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Running PageRank in GraphX
Hi, I just ran the PageRank code in GraphX with some sample data. What I am seeing is that the total rank changes drastically if I change the number of iterations from 10 to 100. Why is that so? Thank You
Null pointer exception with larger datasets
Hi, I am having list Students and size is one Lakh and I am trying to save the file. It is throwing null pointer exception. JavaRDDStudent distData = sc.parallelize(list); distData.saveAsTextFile(hdfs://master/data/spark/instruments.txt); 14/11/18 01:33:21 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, master): java.lang.NullPointerException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158) org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) How to handle this? -Naveen
Probability in Naive Bayes
I am trying to use Naive Bayes for a project of mine in Python and I want to obtain the probability value after having built the model. Suppose I have two classes - A and B. Currently there is an API to to find which class a sample belongs to (predict). Now, I want to find the probability of it belonging to Class A or Class B. Can you please provide any details about how I can obtain the probability values for it belonging to the either class A or class B?
Is it safe to use Scala 2.11 for Spark build?
Any notable issues for using Scala 2.11? Is it stable now? Or can I use Scala 2.11 in my spark application and use Spark dist build with 2.10 ? I'm looking forward to migrate to 2.11 for some quasiquote features. Couldn't make it run in 2.10... Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: Probability in Naive Bayes
This was recently discussed on this mailing list. You can't get the probabilities out directly now, but you can hack a bit to get the internal data structures of NaiveBayesModel and compute it from there. If you really mean the probability of either A or B, then if your classes are exclusive it is just the sum of the class probabilities. You won't be able to compute this otherwise from what Naive Bayes computes. On Nov 18, 2014 7:42 AM, Samarth Mailinglist mailinglistsama...@gmail.com wrote: I am trying to use Naive Bayes for a project of mine in Python and I want to obtain the probability value after having built the model. Suppose I have two classes - A and B. Currently there is an API to to find which class a sample belongs to (predict). Now, I want to find the probability of it belonging to Class A or Class B. Can you please provide any details about how I can obtain the probability values for it belonging to the either class A or class B?
Re: Is it safe to use Scala 2.11 for Spark build?
It is safe in the sense we would help you with the fix if you run into issues. I have used it, but since I worked on the patch the opinion can be biased. I am using scala 2.11 for day to day development. You should checkout the build instructions here : https://github.com/ScrapCodes/spark-1/blob/patch-3/docs/building-spark.md Prashant Sharma On Tue, Nov 18, 2014 at 12:19 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Any notable issues for using Scala 2.11? Is it stable now? Or can I use Scala 2.11 in my spark application and use Spark dist build with 2.10 ? I'm looking forward to migrate to 2.11 for some quasiquote features. Couldn't make it run in 2.10... Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: Check your cluster UI to ensure that workers are registered and have sufficient memory
I occur to this issue with the spark on yarn version 1.0.2. Is there any hints? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Check-your-cluster-UI-to-ensure-that-workers-are-registered-and-have-sufficient-memory-tp5358p19133.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How can I apply such an inner join in Spark Scala/Python
Simple join would do it. val a: List[(Int, Int)] = List((1,2),(2,4),(3,6)) val b: List[(Int, Int)] = List((1,3),(2,5),(3,6), (4,5),(5,6)) val A = sparkContext.parallelize(a) val B = sparkContext.parallelize(b) val ac = new PairRDDFunctions[Int, Int](A) *val C = ac.join(B)* C.foreach(println) Thanks Best Regards On Mon, Nov 17, 2014 at 11:54 PM, Sean Owen so...@cloudera.com wrote: Just RDD.join() should be an inner join. On Mon, Nov 17, 2014 at 5:51 PM, Blind Faith person.of.b...@gmail.com wrote: So let us say I have RDDs A and B with the following values. A = [ (1, 2), (2, 4), (3, 6) ] B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ] I want to apply an inner join, such that I get the following as a result. C = [ (1, (2, 3)), (2, (4, 5)), (3, (6,6)) ] That is, those keys which are not present in A should disappear after the left inner join. How can I achieve that? I can see outerJoin functions but no innerJoin functions in the Spark RDD class. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is it safe to use Scala 2.11 for Spark build?
Looks like sbt/sbt -Pscala-2.11 is broken by a recent patch for improving maven build. Prashant Sharma On Tue, Nov 18, 2014 at 12:57 PM, Prashant Sharma scrapco...@gmail.com wrote: It is safe in the sense we would help you with the fix if you run into issues. I have used it, but since I worked on the patch the opinion can be biased. I am using scala 2.11 for day to day development. You should checkout the build instructions here : https://github.com/ScrapCodes/spark-1/blob/patch-3/docs/building-spark.md Prashant Sharma On Tue, Nov 18, 2014 at 12:19 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Any notable issues for using Scala 2.11? Is it stable now? Or can I use Scala 2.11 in my spark application and use Spark dist build with 2.10 ? I'm looking forward to migrate to 2.11 for some quasiquote features. Couldn't make it run in 2.10... Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
Re: Null pointer exception with larger datasets
Make sure your list is not null, if that is null then its more like doing: JavaRDDStudent distData = sc.parallelize(*null*) distData.foreach(println) Thanks Best Regards On Tue, Nov 18, 2014 at 12:07 PM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Hi, I am having list Students and size is one Lakh and I am trying to save the file. It is throwing null pointer exception. JavaRDDStudent distData = sc.parallelize(list); distData.saveAsTextFile(hdfs://master/data/spark/instruments.txt); 14/11/18 01:33:21 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, master): java.lang.NullPointerException: org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158) org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) How to handle this? -Naveen
Re: Is it safe to use Scala 2.11 for Spark build?
Hi Prashant Sharma, It's not even ok to build with scala-2.11 profile on my machine. Just check out the master(c6e0c2ab1c29c184a9302d23ad75e4ccd8060242) run sbt/sbt -Pscala-2.11 clean assembly: .. skip the normal part info] Resolving org.scalamacros#quasiquotes_2.11;2.0.1 ... [warn] module not found: org.scalamacros#quasiquotes_2.11;2.0.1 [warn] local: tried [warn] /Users/yexianjin/.ivy2/local/org.scalamacros/quasiquotes_2.11/2.0.1/ivys/ivy.xml [warn] public: tried [warn] https://repo1.maven.org/maven2/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom [warn] central: tried [warn] https://repo1.maven.org/maven2/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom [warn] apache-repo: tried [warn] https://repository.apache.org/content/repositories/releases/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom [warn] jboss-repo: tried [warn] https://repository.jboss.org/nexus/content/repositories/releases/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom [warn] mqtt-repo: tried [warn] https://repo.eclipse.org/content/repositories/paho-releases/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom [warn] cloudera-repo: tried [warn] https://repository.cloudera.com/artifactory/cloudera-repos/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom [warn] mapr-repo: tried [warn] http://repository.mapr.com/maven/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom [warn] spring-releases: tried [warn] https://repo.spring.io/libs-release/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom [warn] spark-staging: tried [warn] https://oss.sonatype.org/content/repositories/orgspark-project-1085/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom [warn] spark-staging-hive13: tried [warn] https://oss.sonatype.org/content/repositories/orgspark-project-1089/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom [warn] apache.snapshots: tried [warn] http://repository.apache.org/snapshots/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom [warn] Maven2 Local: tried [warn] file:/Users/yexianjin/.m2/repository/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom [info] Resolving jline#jline;2.12 ... [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: org.scalamacros#quasiquotes_2.11;2.0.1: not found [warn] :: [info] Resolving org.scala-lang#scala-library;2.11.2 ... [warn] [warn] Note: Unresolved dependencies path: [warn] org.scalamacros:quasiquotes_2.11:2.0.1 ((com.typesafe.sbt.pom.MavenHelper) MavenHelper.scala#L76) [warn] +- org.apache.spark:spark-catalyst_2.11:1.2.0-SNAPSHOT [info] Resolving jline#jline;2.12 ... [info] Done updating. [info] Updating {file:/Users/yexianjin/spark/}streaming-twitter... [info] Updating {file:/Users/yexianjin/spark/}streaming-zeromq... [info] Updating {file:/Users/yexianjin/spark/}streaming-flume... [info] Updating {file:/Users/yexianjin/spark/}streaming-mqtt... [info] Resolving jline#jline;2.12 ... [info] Done updating. [info] Resolving com.esotericsoftware.minlog#minlog;1.2 ... [info] Updating {file:/Users/yexianjin/spark/}streaming-kafka... [info] Resolving jline#jline;2.12 ... [info] Done updating. [info] Resolving jline#jline;2.12 ... [info] Done updating. [info] Resolving jline#jline;2.12 ... [info] Done updating. [info] Resolving org.apache.kafka#kafka_2.11;0.8.0 ... [warn] module not found: org.apache.kafka#kafka_2.11;0.8.0 [warn] local: tried [warn] /Users/yexianjin/.ivy2/local/org.apache.kafka/kafka_2.11/0.8.0/ivys/ivy.xml [warn] public: tried [warn] https://repo1.maven.org/maven2/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom [warn] central: tried [warn] https://repo1.maven.org/maven2/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom [warn] apache-repo: tried [warn] https://repository.apache.org/content/repositories/releases/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom [warn] jboss-repo: tried [warn] https://repository.jboss.org/nexus/content/repositories/releases/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom [warn] mqtt-repo: tried [warn] https://repo.eclipse.org/content/repositories/paho-releases/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom [warn] cloudera-repo: tried [warn] https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom [warn] mapr-repo: tried [warn] http://repository.mapr.com/maven/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom [warn] spring-releases: tried [warn] https://repo.spring.io/libs-release/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom [warn]
Spark On Yarn Issue: Initial job has not accepted any resources
Hi All:I was submitting a spark_program.jar to `spark on yarn cluster` on a driver machine with yarn-client mode. Here is the spark-submit command I used: ./spark-submit --master yarn-client --class com.charlie.spark.grax.OldFollowersExample --queue dt_spark ~/script/spark-flume-test-0.1-SNAPSHOT-hadoop2.0.0-mr1-cdh4.2.1.jarThe queue `dt_spark` was free, and the program was submitted succesfully and running on the cluster. But on console, it showed repeatedly that: 14/11/18 15:11:48 WARN YarnClientClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Checked the cluster UI logs, I find no errors: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/disk5/yarn/usercache/linqili/filecache/6957209742046754908/spark-assembly-1.0.2-hadoop2.0.0-cdh4.2.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-2.0.0-cdh4.2.1/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/11/18 14:28:16 INFO SecurityManager: Changing view acls to: hadoop,linqili 14/11/18 14:28:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop, linqili) 14/11/18 14:28:17 INFO Slf4jLogger: Slf4jLogger started 14/11/18 14:28:17 INFO Remoting: Starting remoting 14/11/18 14:28:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkyar...@longzhou-hdp3.lz.dscc:37187] 14/11/18 14:28:17 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkyar...@longzhou-hdp3.lz.dscc:37187] 14/11/18 14:28:17 INFO ExecutorLauncher: ApplicationAttemptId: appattempt_1415961020140_0325_01 14/11/18 14:28:17 INFO ExecutorLauncher: Connecting to ResourceManager at longzhou-hdpnn.lz.dscc/192.168.19.107:12032 14/11/18 14:28:17 INFO ExecutorLauncher: Registering the ApplicationMaster 14/11/18 14:28:18 INFO ExecutorLauncher: Waiting for spark driver to be reachable. 14/11/18 14:28:18 INFO ExecutorLauncher: Master now available: 192.168.59.90:36691 14/11/18 14:28:18 INFO ExecutorLauncher: Listen to driver: akka.tcp://spark@192.168.59.90:36691/user/CoarseGrainedScheduler 14/11/18 14:28:18 INFO ExecutorLauncher: Allocating 1 executors. 14/11/18 14:28:18 INFO YarnAllocationHandler: Allocating 1 executor containers with 1408 of memory each. 14/11/18 14:28:18 INFO YarnAllocationHandler: ResourceRequest (host : *, num containers: 1, priority = 1 , capability : memory: 1408) 14/11/18 14:28:18 INFO YarnAllocationHandler: Allocating 1 executor containers with 1408 of memory each. 14/11/18 14:28:18 INFO YarnAllocationHandler: ResourceRequest (host : *, num containers: 1, priority = 1 , capability : memory: 1408) 14/11/18 14:28:18 INFO RackResolver: Resolved longzhou-hdp3.lz.dscc to /rack1 14/11/18 14:28:18 INFO YarnAllocationHandler: launching container on container_1415961020140_0325_01_02 host longzhou-hdp3.lz.dscc 14/11/18 14:28:18 INFO ExecutorRunnable: Starting Executor Container 14/11/18 14:28:18 INFO ExecutorRunnable: Connecting to ContainerManager at longzhou-hdp3.lz.dscc:12040 14/11/18 14:28:18 INFO ExecutorRunnable: Setting up ContainerLaunchContext 14/11/18 14:28:18 INFO ExecutorRunnable: Preparing Local resources 14/11/18 14:28:18 INFO ExecutorLauncher: All executors have launched. 14/11/18 14:28:18 INFO ExecutorLauncher: Started progress reporter thread - sleep time : 5000 14/11/18 14:28:18 INFO YarnAllocationHandler: ResourceRequest (host : *, num containers: 0, priority = 1 , capability : memory: 1408) 14/11/18 14:28:18 INFO ExecutorRunnable: Prepared Local resources Map(__spark__.jar - resource {, scheme: hdfs, host: longzhou-hdpnn.lz.dscc, port: 11000, file: /user/linqili/.sparkStaging/application_1415961020140_0325/spark-assembly-1.0.2-hadoop2.0.0-cdh4.2.1.jar, }, size: 134859131, timestamp: 1416292093988, type: FILE, visibility: PRIVATE, ) 14/11/18 14:28:18 INFO ExecutorRunnable: Setting up executor with commands: List($JAVA_HOME/bin/java, -server, -XX:OnOutOfMemoryError='kill %p', -Xms1024m -Xmx1024m , -Djava.security.krb5.conf=/home/linqili/proc/spark_client/hadoop/kerberos5-client/etc/krb5.conf -Djava.library.path=/home/linqili/proc/spark_client/hadoop/lib/native/Linux-amd64-64, -Djava.io.tmpdir=$PWD/tmp, -Dlog4j.configuration=log4j-spark-container.properties, org.apache.spark.executor.CoarseGrainedExecutorBackend, akka.tcp://spark@192.168.59.90:36691/user/CoarseGrainedScheduler, 1, longzhou-hdp3.lz.dscc, 3, 1, LOG_DIR/stdout, 2, LOG_DIR/stderr) 14/11/18 14:28:23 INFO YarnAllocationHandler: ResourceRequest (host : *, num containers: 0, priority = 1 , capability : memory: 1408) 14/11/18 14:28:23 INFO YarnAllocationHandler: Completed container