Re: spark-submit question

2014-11-17 Thread Samarth Mailinglist
I figured it out. I had to use pyspark.files.SparkFiles to get the locations of files loaded into Spark. On Mon, Nov 17, 2014 at 1:26 PM, Sean Owen so...@cloudera.com wrote: You are changing these paths and filenames to match your own actual scripts and file locations right? On Nov 17, 2014

RandomGenerator class not found exception

2014-11-17 Thread Ritesh Kumar Singh
My sbt file for the project includes this: libraryDependencies ++= Seq( org.apache.spark %% spark-core % 1.1.0, org.apache.spark %% spark-mllib % 1.1.0, org.apache.commons % commons-math3 % 3.3 ) = Still I am

Re: RandomGenerator class not found exception

2014-11-17 Thread Akhil Das
Add this jar http://mvnrepository.com/artifact/org.apache.commons/commons-math3/3.3 while creating the sparkContext. sc.addJar(/path/to/commons-math3-3.3.jar) ​And make sure it is shipped and available in the environment tab (4040)​ Thanks Best Regards On Mon, Nov 17, 2014 at 1:54 PM, Ritesh

Re: RandomGenerator class not found exception

2014-11-17 Thread Chitturi Padma
Include the commons-math3/3.3 in class path while submitting jar to spark cluster. Like.. spark-submit --driver-class-path maths3.3jar --class MainClass --master spark cluster url appjar On Mon, Nov 17, 2014 at 1:55 PM, Ritesh Kumar Singh [via Apache Spark User List]

Re: Functions in Spark

2014-11-17 Thread Gerard Maas
One 'rule of thumbs' is to use rdd.toDebugString and check the lineage for ShuffleRDD. As long as there's no need for restructuring the RDD, operations can be pipelined on each partition. rdd.toDebugString is your friend :-) -kr, Gerard. On Mon, Nov 17, 2014 at 7:37 AM, Mukesh Jha

Landmarks in GraphX section of Spark API

2014-11-17 Thread Deep Pradhan
Hi, I was going through the graphx section in the Spark API in https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$ Here, I find the word landmark. Can anyone explain to me what is landmark means. Is it a simple English word or does it mean

Re: How to incrementally compile spark examples using mvn

2014-11-17 Thread Sean Owen
The downloads just happen once so this is not a problem. If you are just building one module in a project, it needs a compiled copy of other modules. It will either use your locally-built and locally-installed artifact, or, download one from the repo if possible. This isn't needed if you are

Spark streaming batch overrun

2014-11-17 Thread facboy
Hi all, In this presentation (https://prezi.com/1jzqym68hwjp/spark-gotchas-and-anti-patterns/) it mentions that Spark Streaming's behaviour is undefined if a batch overruns the polling interval. Is this something that might be addressed in future or is it fundamental to the design? -- View

HDFS read text file

2014-11-17 Thread Naveen Kumar Pokala
Hi, JavaRDDInstrument studentsData = sc.parallelize(list);--list is Student Info ListStudent studentsData.saveAsTextFile(hdfs://master/data/spark/instruments.txt); above statements saved the students information in the HDFS as a text file. Each object is a line in text file as below.

Building Spark for Hive The requested profile hadoop-1.2 could not be activated because it does not exist.

2014-11-17 Thread akshayhazari
I am using Apache Hadoop 1.2.1 . I wanted to use Spark Sql with Hive. So I tried to build Spark like so . mvn -Phive,hadoop-1.2 -Dhadoop.version=1.2.1 clean -DskipTests package But I get the following error. The requested profile hadoop-1.2 could not be activated because it does not

How to measure communication between nodes in Spark Standalone Cluster?

2014-11-17 Thread Hlib Mykhailenko
Hello, I use Spark Standalone Cluster and I want to measure somehow internode communication. As I understood, Graphx transfers only vertices values. Am I right? But I do not want to get number of bytes which were transferred between any two nodes. So is there way to measure how many values

Re: HDFS read text file

2014-11-17 Thread Akhil Das
You can use the sc.objectFile https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext to read it. It will be RDD[Student] type. Thanks Best Regards On Mon, Nov 17, 2014 at 4:03 PM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Hi, JavaRDDInstrument

Re: How to measure communication between nodes in Spark Standalone Cluster?

2014-11-17 Thread Akhil Das
You can use Ganglia to see the overall data transfer across the cluster/nodes. I don't think there's a direct way to get the vertices being transferred. Thanks Best Regards On Mon, Nov 17, 2014 at 4:29 PM, Hlib Mykhailenko hlib.mykhaile...@inria.fr wrote: Hello, I use Spark Standalone

Re: HDFS read text file

2014-11-17 Thread Hlib Mykhailenko
Hello Naveen, I think you should first override toString method of your sample.spark.test.Student class. -- Cordialement, Hlib Mykhailenko Doctorant à INRIA Sophia-Antipolis Méditerranée 2004 Route des Lucioles BP93 06902 SOPHIA ANTIPOLIS cedex - Original Message - From:

Re: Building Spark for Hive The requested profile hadoop-1.2 could not be activated because it does not exist.

2014-11-17 Thread akshayhazari
Oops , I guess , this is the right way to do it mvn -Phive -Dhadoop.version=1.2.1 clean -DskipTests package -- View this message in context:

Re: Returning breeze.linalg.DenseMatrix from method

2014-11-17 Thread tribhuvan...@gmail.com
This should fix it -- def func(str: String): DenseMatrix*[Double]* = { ... ... } So, why is this required? Think of it like this -- If you hadn't explicitly mentioned Double, it might have been that the calling function expected a DenseMatrix[SomeOtherType], and performed a

Building Spark with hive does not work

2014-11-17 Thread Hao Ren
Hi, I am building spark on the most recent master branch. I checked this page: https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md The cmd *./sbt/sbt -Phive -Phive-thirftserver clean assembly/assembly* works fine. A fat jar is created. However, when I started the SQL-CLI,

Re: Returning breeze.linalg.DenseMatrix from method

2014-11-17 Thread Ritesh Kumar Singh
Yeah, it works. Although when I try to define a var of type DenseMatrix, like this: var mat1: DenseMatrix[Double] It gives an error saying we need to initialise the matrix mat1 at the time of declaration. Had to initialise it as : var mat1: DenseMatrix[Double] = DenseMatrix.zeros[Double](1,1)

Re: How do you force a Spark Application to run in multiple tasks

2014-11-17 Thread Daniel Siegmann
I've never used Mesos, sorry. On Fri, Nov 14, 2014 at 5:30 PM, Steve Lewis lordjoe2...@gmail.com wrote: The cluster runs Mesos and I can see the tasks in the Mesos UI but most are not doing much - any hints about that UI On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann

Re: How to measure communication between nodes in Spark Standalone Cluster?

2014-11-17 Thread Yifan LI
I am not sure there is a direct way(an api in graphx, etc) to measure the number of transferred vertex values among nodes during computation. It might depend on: - the operations in your application, e.g. only communicate with its immediate neighbours for each vertex. - the partition strategy

Re: same error of SPARK-1977 while using trainImplicit in mllib 1.0.2

2014-11-17 Thread aaronlin
Thanks. It works for me. -- Aaron Lin On 2014年11月15日 Saturday at 上午1:19, Xiangrui Meng wrote: If you use Kryo serialier, you need to register mutable.BitSet and Rating:

Re: RDD.aggregate versus accumulables...

2014-11-17 Thread Daniel Siegmann
You should *never* use accumulators for this purpose because you may get incorrect answers. Accumulators can count the same thing multiple times - you cannot rely upon the correctness of the values they compute. See SPARK-732 https://issues.apache.org/jira/browse/SPARK-732 for more info. On Sun,

Re: Building Spark with hive does not work

2014-11-17 Thread Cheng Lian
Hey Hao, Which commit are you using? Just tried 64c6b9b with exactly the same command line flags, couldn't reproduce this issue. Cheng On 11/17/14 10:02 PM, Hao Ren wrote: Hi, I am building spark on the most recent master branch. I checked this page:

Re: RDD.aggregate versus accumulables...

2014-11-17 Thread Surendranauth Hiraman
We use Algebird for calculating things like min/max, stddev, variance, etc. https://github.com/twitter/algebird/wiki -Suren On Mon, Nov 17, 2014 at 11:32 AM, Daniel Siegmann daniel.siegm...@velos.io wrote: You should *never* use accumulators for this purpose because you may get incorrect

How to broadcast a textFile?

2014-11-17 Thread YaoPau
I have a 1 million row file that I'd like to read from my edge node, and then send a copy of it to each Hadoop machine's memory in order to run JOINs in my spark streaming code. I see examples in the docs of how use use broadcast() for a simple array, but how about when the data is in a textFile?

Re: RandomGenerator class not found exception

2014-11-17 Thread Chitturi Padma
As you are using sbt ..u need not put in ~/.m2/repositories for maven. Include the jar explicitly using the option --driver-class-path while submitting the jar to spark cluster On Mon, Nov 17, 2014 at 7:41 PM, Ritesh Kumar Singh [via Apache Spark User List] ml-node+s1001560n1907...@n3.nabble.com

Re: Building Spark with hive does not work

2014-11-17 Thread Hao Ren
Sry for spamming, Just after my previous post, I noticed that the command used is: ./sbt/sbt -Phive -Phive-thirftserver clean assembly/assembly thriftserver* the typo error is the evil. Stupid, me. I believe I just copy-pasted from somewhere else, but no even checked it, meanwhile no error

Re: Building Spark with hive does not work

2014-11-17 Thread Ted Yu
Looks like this was where you got that commandline: http://search-hadoop.com/m/JW1q5RlPrl Cheers On Mon, Nov 17, 2014 at 9:44 AM, Hao Ren inv...@gmail.com wrote: Sry for spamming, Just after my previous post, I noticed that the command used is: ./sbt/sbt -Phive -Phive-thirftserver clean

How can I apply such an inner join in Spark Scala/Python

2014-11-17 Thread Blind Faith
So let us say I have RDDs A and B with the following values. A = [ (1, 2), (2, 4), (3, 6) ] B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ] I want to apply an inner join, such that I get the following as a result. C = [ (1, (2, 3)), (2, (4, 5)), (3, (6,6)) ] That is, those keys which are not

Exception in spark sql when running a group by query

2014-11-17 Thread Sadhan Sood
While testing sparkSQL, we were running this group by with expression query and got an exception. The same query worked fine on hive. SELECT from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200), '/MM/dd') as pst_date, count(*) as num_xyzs FROM all_matched_abc GROUP BY

How do I get the executor ID from running Java code

2014-11-17 Thread Steve Lewis
The spark UI lists a number of Executor IDS on the cluster. I would like to access both executor ID and Task/Attempt IDs from the code inside a function running on a slave machine. Currently my motivation is to examine parallelism and locality but in Hadoop this aids in allowing code to write

Spark streaming on Yarn

2014-11-17 Thread kpeng1
Hi, I have been using spark streaming in standalone mode and now I want to migrate to spark running on yarn, but I am not sure how you would you would go about designating a specific node in the cluster to act as an avro listener since I am using flume based push approach with spark. -- View

Re: How to broadcast a textFile?

2014-11-17 Thread YaoPau
OK then I'd still need to write the code (within my spark streaming code I'm guessing) to convert my text file into an object like a HashMap before broadcasting. How can I make sure only the HashMap is being broadcast while all the pre-processing to create the HashMap is only performed once?

RE: RDD.aggregate versus accumulables...

2014-11-17 Thread Segerlind, Nathan L
Thanks for the link to the bug. Unfortunately, using accumulators like this is getting spread around as a recommended practice despite the bug. From: Daniel Siegmann [mailto:daniel.siegm...@velos.io] Sent: Monday, November 17, 2014 8:32 AM To: Segerlind, Nathan L Cc: user Subject: Re:

Re: How can I apply such an inner join in Spark Scala/Python

2014-11-17 Thread Sean Owen
Just RDD.join() should be an inner join. On Mon, Nov 17, 2014 at 5:51 PM, Blind Faith person.of.b...@gmail.com wrote: So let us say I have RDDs A and B with the following values. A = [ (1, 2), (2, 4), (3, 6) ] B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ] I want to apply an inner join,

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-11-17 Thread akhandeshi
only option is to split you problem further by increasing parallelism My understanding is by increasing the number of partitions, is that right? That didn't seem to help because it is seem the partitions are not uniformly sized. My observation is when I increase the number of partitions, it

Re: Missing SparkSQLCLIDriver and Beeline drivers in Spark

2014-11-17 Thread Ted Yu
Minor correction: there was a typo in commandline hive-thirftserver should be hive-thriftserver Cheers On Thu, Aug 7, 2014 at 6:49 PM, Cheng Lian lian.cs@gmail.com wrote: Things have changed a bit in the master branch, and the SQL programming guide in master branch actually doesn’t apply

IOException: exception in uploadSinglePart

2014-11-17 Thread Justin Mills
Spark 1.1.0, running on AWS EMR cluster using yarn-client as master. I'm getting the following error when I attempt to save a RDD to S3. I've narrowed it down to a single partition that is ~150Mb in size (versus the other partitions that are closer to 1 Mb). I am able to work around this by

Re: Spark streaming cannot receive any message from Kafka

2014-11-17 Thread Bill Jay
Hi all, I find the reason of this issue. It seems in the new version, if I do not specify spark.default.parallelism in KafkaUtils.createstream, there will be an exception since the kakfa stream creation stage. In the previous versions, it seems Spark will use the default value. Thanks! Bill On

independent user sessions with a multi-user spark sql thriftserver (Spark 1.1)

2014-11-17 Thread Michael Allman
Hello, We're running a spark sql thriftserver that several users connect to with beeline. One limitation we've run into is that the current working database (set with use db) is shared across all connections. So changing the database on one connection changes the database for all connections.

RE: RDD.aggregate versus accumulables...

2014-11-17 Thread lordjoe
I have been playing with using accumulators (despite the possible error with multiple attempts) These provide a convenient way to get some numbers while still performing business logic. I posted some sample code at http://lordjoesoftware.blogspot.com/. Even if accumulators are not perfect today -

Re: Assigning input files to spark partitions

2014-11-17 Thread Pala M Muthaia
Hi Daniel, Yes that should work also. However, is it possible to setup so that each RDD has exactly one partition, without repartitioning (and thus incurring extra cost)? Is there a mechanism similar to MR where we can ensure each partition is assigned some amount of data by size, by setting some

Re: Assigning input files to spark partitions

2014-11-17 Thread Daniel Siegmann
I'm not aware of any such mechanism. On Mon, Nov 17, 2014 at 2:55 PM, Pala M Muthaia mchett...@rocketfuelinc.com wrote: Hi Daniel, Yes that should work also. However, is it possible to setup so that each RDD has exactly one partition, without repartitioning (and thus incurring extra cost)?

Re: independent user sessions with a multi-user spark sql thriftserver (Spark 1.1)

2014-11-17 Thread Michael Armbrust
This is an unfortunate/known issue that we are hoping to address in the next release: https://issues.apache.org/jira/browse/SPARK-2087 I'm not sure how straightforward a fix would be, but it would involve keeping / setting the SessionState for each connection to the server. It would be great if

Re: Exception in spark sql when running a group by query

2014-11-17 Thread Michael Armbrust
You are perhaps hitting an issue that was fixed by #3248 https://github.com/apache/spark/pull/3248? On Mon, Nov 17, 2014 at 9:58 AM, Sadhan Sood sadhan.s...@gmail.com wrote: While testing sparkSQL, we were running this group by with expression query and got an exception. The same query worked

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Michael Armbrust
What version of Spark SQL? On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com wrote: Hi all, We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we got exceptions as below, has anyone else saw these before? java.lang.ExceptionInInitializerError at

RDD Blocks skewing to just few executors

2014-11-17 Thread mtimper
Hi I'm running a standalone cluster with 8 worker servers. I'm developing a streaming app that is adding new lines of text to several different RDDs each batch interval. Each line has a well randomized unique identifier that I'm trying to use for partitioning, since the data stream does contain

RE: Spark streaming cannot receive any message from Kafka

2014-11-17 Thread Shao, Saisai
Hi Bill, Would you mind describing what you found a little more specifically, I’m not sure there’s the a parameter in KafkaUtils.createStream you can specify the spark parallelism, also what is the exception stacks. Thanks Jerry From: Bill Jay [mailto:bill.jaypeter...@gmail.com] Sent:

Re: Using data in RDD to specify HDFS directory to write to

2014-11-17 Thread jschindler
Yes, thank you for suggestion. The error I found below was in the worker logs. AssociationError [akka.tcp://sparkwor...@cloudera01.local.company.com:7078] - [akka.tcp://sparkexecu...@cloudera01.local.company.com:33329]: Error [Association failed with

Re: Communication between Driver and Executors

2014-11-17 Thread Tobias Pfeiffer
Hi, so I didn't manage to get the Broadcast variable with a new value distributed to my executors in YARN mode. In local mode it worked fine, but when running on YARN either nothing happened (when unpersist() was called on the driver) or I got a TimeoutException (when called on the executor). I

Re: Is there setup and cleanup function in spark?

2014-11-17 Thread Tobias Pfeiffer
Hi, On Fri, Nov 14, 2014 at 2:49 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Ok, then we need another trick. let's have an *implicit lazy var connection/context* around our code. And setup() will trigger the eval and initialization. Due to lazy evaluation, I think having

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Eric Zhen
Hi Michael, We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4. On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust mich...@databricks.com wrote: What version of Spark SQL? On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com wrote: Hi all, We run SparkSQL on TPCDS

Re: Status of MLLib exporting models to PMML

2014-11-17 Thread Manish Amde
Hi Charles, I am not aware of other storage formats. Perhaps Sean or Sandy can elaborate more given their experience with Oryx. There is work by Smola et al at Google that talks about large scale model update and deployment.

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Eric Zhen
Yes, it's always appears on a part of the whole tasks in a stage(i.e. 100/100 (65 failed)), and sometimes cause the stage to fail. And there is another error that I'm not sure if there is a correlation. java.lang.NoClassDefFoundError: Could not initialize class

Re: Status of MLLib exporting models to PMML

2014-11-17 Thread Sean Owen
I'm just using PMML. I haven't hit any limitation of its expressiveness, for the model types is supports. I don't think there is a point in defining a new format for models, excepting that PMML can get very big. Still, just compressing the XML gets it down to a manageable size for just about any

Re: Is there setup and cleanup function in spark?

2014-11-17 Thread Jianshi Huang
I see. Agree that lazy eval is not suitable for proper setup and teardown. We also abandoned it due to inherent incompatibility between implicit and lazy. It was fun to come up this trick though. Jianshi On Tue, Nov 18, 2014 at 10:28 AM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Fri,

Running PageRank in GraphX

2014-11-17 Thread Deep Pradhan
Hi, I just ran the PageRank code in GraphX with some sample data. What I am seeing is that the total rank changes drastically if I change the number of iterations from 10 to 100. Why is that so? Thank You

Null pointer exception with larger datasets

2014-11-17 Thread Naveen Kumar Pokala
Hi, I am having list Students and size is one Lakh and I am trying to save the file. It is throwing null pointer exception. JavaRDDStudent distData = sc.parallelize(list); distData.saveAsTextFile(hdfs://master/data/spark/instruments.txt); 14/11/18 01:33:21 WARN scheduler.TaskSetManager: Lost

Probability in Naive Bayes

2014-11-17 Thread Samarth Mailinglist
I am trying to use Naive Bayes for a project of mine in Python and I want to obtain the probability value after having built the model. Suppose I have two classes - A and B. Currently there is an API to to find which class a sample belongs to (predict). Now, I want to find the probability of it

Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Jianshi Huang
Any notable issues for using Scala 2.11? Is it stable now? Or can I use Scala 2.11 in my spark application and use Spark dist build with 2.10 ? I'm looking forward to migrate to 2.11 for some quasiquote features. Couldn't make it run in 2.10... Cheers, -- Jianshi Huang LinkedIn: jianshi

Re: Probability in Naive Bayes

2014-11-17 Thread Sean Owen
This was recently discussed on this mailing list. You can't get the probabilities out directly now, but you can hack a bit to get the internal data structures of NaiveBayesModel and compute it from there. If you really mean the probability of either A or B, then if your classes are exclusive it

Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Prashant Sharma
It is safe in the sense we would help you with the fix if you run into issues. I have used it, but since I worked on the patch the opinion can be biased. I am using scala 2.11 for day to day development. You should checkout the build instructions here :

Re: Check your cluster UI to ensure that workers are registered and have sufficient memory

2014-11-17 Thread lin_qili
I occur to this issue with the spark on yarn version 1.0.2. Is there any hints? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Check-your-cluster-UI-to-ensure-that-workers-are-registered-and-have-sufficient-memory-tp5358p19133.html Sent from the Apache

Re: How can I apply such an inner join in Spark Scala/Python

2014-11-17 Thread Akhil Das
Simple join would do it. val a: List[(Int, Int)] = List((1,2),(2,4),(3,6)) val b: List[(Int, Int)] = List((1,3),(2,5),(3,6), (4,5),(5,6)) val A = sparkContext.parallelize(a) val B = sparkContext.parallelize(b) val ac = new PairRDDFunctions[Int, Int](A) *val C =

Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Prashant Sharma
Looks like sbt/sbt -Pscala-2.11 is broken by a recent patch for improving maven build. Prashant Sharma On Tue, Nov 18, 2014 at 12:57 PM, Prashant Sharma scrapco...@gmail.com wrote: It is safe in the sense we would help you with the fix if you run into issues. I have used it, but since I

Re: Null pointer exception with larger datasets

2014-11-17 Thread Akhil Das
Make sure your list is not null, if that is null then its more like doing: JavaRDDStudent distData = sc.parallelize(*null*) distData.foreach(println) Thanks Best Regards On Tue, Nov 18, 2014 at 12:07 PM, Naveen Kumar Pokala npok...@spcapitaliq.com wrote: Hi, I am having list Students

Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Ye Xianjin
Hi Prashant Sharma, It's not even ok to build with scala-2.11 profile on my machine. Just check out the master(c6e0c2ab1c29c184a9302d23ad75e4ccd8060242) run sbt/sbt -Pscala-2.11 clean assembly: .. skip the normal part info] Resolving org.scalamacros#quasiquotes_2.11;2.0.1 ... [warn] module

Spark On Yarn Issue: Initial job has not accepted any resources

2014-11-17 Thread LinCharlie
Hi All:I was submitting a spark_program.jar to `spark on yarn cluster` on a driver machine with yarn-client mode. Here is the spark-submit command I used: ./spark-submit --master yarn-client --class com.charlie.spark.grax.OldFollowersExample --queue dt_spark