Re: spark-submit question

2014-11-17 Thread Samarth Mailinglist
I figured it out. I had to use pyspark.files.SparkFiles to get the
locations of files loaded into Spark.


On Mon, Nov 17, 2014 at 1:26 PM, Sean Owen so...@cloudera.com wrote:

 You are changing these paths and filenames to match your own actual
 scripts and file locations right?
 On Nov 17, 2014 4:59 AM, Samarth Mailinglist 
 mailinglistsama...@gmail.com wrote:

 I am trying to run a job written in python with the following command:

 bin/spark-submit --master spark://localhost:7077 
 /path/spark_solution_basic.py --py-files /path/*.py --files 
 /path/config.properties

 I always get an exception that config.properties is not found:

 INFO - IOError: [Errno 2] No such file or directory: 'config.properties'

 Why isn't this working?
 ​




RandomGenerator class not found exception

2014-11-17 Thread Ritesh Kumar Singh
My sbt file for the project includes this:

libraryDependencies ++= Seq(
org.apache.spark  %% spark-core  % 1.1.0,
org.apache.spark  %% spark-mllib % 1.1.0,
org.apache.commons % commons-math3 % 3.3
)
=

Still I am getting this error:

java.lang.NoClassDefFoundError:
org/apache/commons/math3/random/RandomGenerator

=

The jar at location: ~/.m2/repository/org/apache/commons/commons-math3/3.3
contains the random generator class:

 $ jar tvf commons-math3-3.3.jar | grep RandomGenerator
org/apache/commons/math3/random/RandomGenerator.class
org/apache/commons/math3/random/UniformRandomGenerator.class
org/apache/commons/math3/random/SynchronizedRandomGenerator.class
org/apache/commons/math3/random/AbstractRandomGenerator.class
org/apache/commons/math3/random/RandomGeneratorFactory$1.class
org/apache/commons/math3/random/RandomGeneratorFactory.class
org/apache/commons/math3/random/StableRandomGenerator.class
org/apache/commons/math3/random/NormalizedRandomGenerator.class
org/apache/commons/math3/random/JDKRandomGenerator.class
org/apache/commons/math3/random/GaussianRandomGenerator.class


Please help


Re: RandomGenerator class not found exception

2014-11-17 Thread Akhil Das
Add this jar
http://mvnrepository.com/artifact/org.apache.commons/commons-math3/3.3
while creating the sparkContext.

sc.addJar(/path/to/commons-math3-3.3.jar)

​And make sure it is shipped and available in the environment tab (4040)​


Thanks
Best Regards

On Mon, Nov 17, 2014 at 1:54 PM, Ritesh Kumar Singh 
riteshoneinamill...@gmail.com wrote:

 My sbt file for the project includes this:

 libraryDependencies ++= Seq(
 org.apache.spark  %% spark-core  % 1.1.0,
 org.apache.spark  %% spark-mllib % 1.1.0,
 org.apache.commons % commons-math3 % 3.3
 )
 =

 Still I am getting this error:

 java.lang.NoClassDefFoundError:
 org/apache/commons/math3/random/RandomGenerator

 =

 The jar at location: ~/.m2/repository/org/apache/commons/commons-math3/3.3
 contains the random generator class:

  $ jar tvf commons-math3-3.3.jar | grep RandomGenerator
 org/apache/commons/math3/random/RandomGenerator.class
 org/apache/commons/math3/random/UniformRandomGenerator.class
 org/apache/commons/math3/random/SynchronizedRandomGenerator.class
 org/apache/commons/math3/random/AbstractRandomGenerator.class
 org/apache/commons/math3/random/RandomGeneratorFactory$1.class
 org/apache/commons/math3/random/RandomGeneratorFactory.class
 org/apache/commons/math3/random/StableRandomGenerator.class
 org/apache/commons/math3/random/NormalizedRandomGenerator.class
 org/apache/commons/math3/random/JDKRandomGenerator.class
 org/apache/commons/math3/random/GaussianRandomGenerator.class


 Please help



Re: RandomGenerator class not found exception

2014-11-17 Thread Chitturi Padma
Include the commons-math3/3.3 in class path while submitting jar to spark
cluster. Like..
spark-submit --driver-class-path maths3.3jar --class MainClass --master
spark cluster url appjar

On Mon, Nov 17, 2014 at 1:55 PM, Ritesh Kumar Singh [via Apache Spark User
List] ml-node+s1001560n19055...@n3.nabble.com wrote:

 My sbt file for the project includes this:

 libraryDependencies ++= Seq(
 org.apache.spark  %% spark-core  % 1.1.0,
 org.apache.spark  %% spark-mllib % 1.1.0,
 org.apache.commons % commons-math3 % 3.3
 )
 =

 Still I am getting this error:

 java.lang.NoClassDefFoundError:
 org/apache/commons/math3/random/RandomGenerator

 =

 The jar at location: ~/.m2/repository/org/apache/commons/commons-math3/3.3
 contains the random generator class:

  $ jar tvf commons-math3-3.3.jar | grep RandomGenerator
 org/apache/commons/math3/random/RandomGenerator.class
 org/apache/commons/math3/random/UniformRandomGenerator.class
 org/apache/commons/math3/random/SynchronizedRandomGenerator.class
 org/apache/commons/math3/random/AbstractRandomGenerator.class
 org/apache/commons/math3/random/RandomGeneratorFactory$1.class
 org/apache/commons/math3/random/RandomGeneratorFactory.class
 org/apache/commons/math3/random/StableRandomGenerator.class
 org/apache/commons/math3/random/NormalizedRandomGenerator.class
 org/apache/commons/math3/random/JDKRandomGenerator.class
 org/apache/commons/math3/random/GaussianRandomGenerator.class


 Please help


 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19057.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Functions in Spark

2014-11-17 Thread Gerard Maas
One 'rule of thumbs' is to use rdd.toDebugString and check the lineage for
ShuffleRDD. As long as there's no need for restructuring the RDD,
operations can be pipelined on each partition.

rdd.toDebugString is your friend :-)

-kr, Gerard.


On Mon, Nov 17, 2014 at 7:37 AM, Mukesh Jha me.mukesh@gmail.com wrote:

 Thanks I did go through the video it was very informative, but I think I's
 looking for the Transformations section @ page
 https://spark.apache.org/docs/0.9.1/scala-programming-guide.html.


 On Mon, Nov 17, 2014 at 10:31 AM, Samarth Mailinglist 
 mailinglistsama...@gmail.com wrote:

 Check this video out:
 https://www.youtube.com/watch?v=dmL0N3qfSc8list=UURzsq7k4-kT-h3TDUBQ82-w

 On Mon, Nov 17, 2014 at 9:43 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 Hi,
 Is there any way to know which of my functions perform better in Spark?
 In other words, say I have achieved same thing using two different
 implementations. How do I judge as to which implementation is better than
 the other. Is processing time the only metric that we can use to claim the
 goodness of one implementation to the other?
 Can anyone please share some thoughts on this?

 Thank You





 --


 Thanks  Regards,

 *Mukesh Jha me.mukesh@gmail.com*



Landmarks in GraphX section of Spark API

2014-11-17 Thread Deep Pradhan
Hi,
I was going through the graphx section in the Spark API in
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$

Here, I find the word landmark. Can anyone explain to me what is landmark
means. Is it a simple English word or does it mean something else in graphx.

Thank You


Re: How to incrementally compile spark examples using mvn

2014-11-17 Thread Sean Owen
The downloads just happen once so this is not a problem.

If you are just building one module in a project, it needs a compiled
copy of other modules. It will either use your locally-built and
locally-installed artifact, or, download one from the repo if
possible.

This isn't needed if you are compiling all modules at once. If you
want to compile everything and reuse the local artifacts later, you
need 'install' not 'package'.

On Mon, Nov 17, 2014 at 12:27 AM, Yiming (John) Zhang sdi...@gmail.com wrote:
 Thank you Marcelo. I tried your suggestion (# mvn -pl :spark-examples_2.10 
 compile), but it required to download many spark components (as listed 
 below), which I have already compiled on my server.

 Downloading: 
 https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.10/1.1.0/spark-core_2.10-1.1.0.pom
 ...
 Downloading: 
 https://repo1.maven.org/maven2/org/apache/spark/spark-streaming_2.10/1.1.0/spark-streaming_2.10-1.1.0.pom
 ...
 Downloading: 
 https://repository.jboss.org/nexus/content/repositories/releases/org/apache/spark/spark-hive_2.10/1.1.0/spark-hive_2.10-1.1.0.pom
 ...

 This problem didn't happen when I compiled the whole project using ``mvn 
 -DskipTests package''. I guess some configurations have to be made to tell 
 mvn the dependencies are local. Any idea for that?

 Thank you for your help!

 Cheers,
 Yiming

 -邮件原件-
 发件人: Marcelo Vanzin [mailto:van...@cloudera.com]
 发送时间: 2014年11月16日 10:26
 收件人: sdi...@gmail.com
 抄送: user@spark.apache.org
 主题: Re: How to incrementally compile spark examples using mvn

 I haven't tried scala:cc, but you can ask maven to just build a particular 
 sub-project. For example:

   mvn -pl :spark-examples_2.10 compile

 On Sat, Nov 15, 2014 at 5:31 PM, Yiming (John) Zhang sdi...@gmail.com wrote:
 Hi,



 I have already successfully compile and run spark examples. My problem
 is that if I make some modifications (e.g., on SparkPi.scala or
 LogQuery.scala) I have to use “mvn -DskipTests package” to rebuild the
 whole spark project and wait a relatively long time.



 I also tried “mvn scala:cc” as described in
 http://spark.apache.org/docs/latest/building-with-maven.html, but I
 could only get infinite stop like:

 [INFO] --- scala-maven-plugin:3.2.0:cc (default-cli) @ spark-parent
 ---

 [INFO] wait for files to compile...



 Is there any method to incrementally compile the examples using mvn?
 Thank you!



 Cheers,

 Yiming



 --
 Marcelo


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark streaming batch overrun

2014-11-17 Thread facboy
Hi all,

In this presentation
(https://prezi.com/1jzqym68hwjp/spark-gotchas-and-anti-patterns/) it
mentions that Spark Streaming's behaviour is undefined if a batch overruns
the polling interval.  Is this something that might be addressed in future
or is it fundamental to the design?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-batch-overrun-tp19061.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



HDFS read text file

2014-11-17 Thread Naveen Kumar Pokala
Hi,


JavaRDDInstrument studentsData = sc.parallelize(list);--list is Student Info 
ListStudent

studentsData.saveAsTextFile(hdfs://master/data/spark/instruments.txt);

above statements saved the students information in the HDFS as a text file. 
Each object is a line in text file as below.

[cid:image001.png@01D0027F.FB321550]

How to read that file, I mean each line as Object of student.

-Naveen


Building Spark for Hive The requested profile hadoop-1.2 could not be activated because it does not exist.

2014-11-17 Thread akshayhazari
I am using Apache Hadoop 1.2.1 . I wanted to use Spark Sql with Hive. So I
tried to build Spark like so .

 mvn -Phive,hadoop-1.2 -Dhadoop.version=1.2.1 clean -DskipTests package
  
   But I get the following error.
  The requested profile hadoop-1.2 could not be activated because it does
not exist. 
  Is there some way to handle this or do I have to downgrade Hadoop. Any
help is Appreciated




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-for-Hive-The-requested-profile-hadoop-1-2-could-not-be-activated-because-it-does-not--tp19063.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to measure communication between nodes in Spark Standalone Cluster?

2014-11-17 Thread Hlib Mykhailenko
Hello, 

I use Spark Standalone Cluster and I want to measure somehow internode 
communication. 
As I understood, Graphx transfers only vertices values. Am I right? 

But I do not want to get number of bytes which were transferred between any two 
nodes. 
So is there way to measure how many values of vertices were transferred among 
nodes? 

Thanks! 

-- 
Cordialement, 
Hlib Mykhailenko 
Doctorant à INRIA Sophia-Antipolis Méditerranée 
2004 Route des Lucioles BP93 
06902 SOPHIA ANTIPOLIS cedex 



Re: HDFS read text file

2014-11-17 Thread Akhil Das
You can use the sc.objectFile
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
to read it. It will be RDD[Student] type.

Thanks
Best Regards

On Mon, Nov 17, 2014 at 4:03 PM, Naveen Kumar Pokala 
npok...@spcapitaliq.com wrote:

 Hi,





 JavaRDDInstrument studentsData = sc.parallelize(list);--list is Student
 Info ListStudent



 studentsData.saveAsTextFile(hdfs://master/data/spark/instruments.txt);



 above statements saved the students information in the HDFS as a text
 file. Each object is a line in text file as below.





 How to read that file, I mean each line as Object of student.



 -Naveen



Re: How to measure communication between nodes in Spark Standalone Cluster?

2014-11-17 Thread Akhil Das
You can use Ganglia to see the overall data transfer across the
cluster/nodes. I don't think there's a direct way to get the vertices being
transferred.

Thanks
Best Regards

On Mon, Nov 17, 2014 at 4:29 PM, Hlib Mykhailenko hlib.mykhaile...@inria.fr
 wrote:

 Hello,

 I use Spark Standalone Cluster and I want to measure somehow internode
 communication.
 As I understood, Graphx transfers only vertices values. Am I right?

 But I do not want to get number of bytes which were transferred between
 any two nodes.
 So is  there way to measure how many values of vertices were transferred
 among nodes?

 Thanks!

 --
 Cordialement,
 *Hlib Mykhailenko*
 Doctorant à INRIA Sophia-Antipolis Méditerranée
 http://www.inria.fr/centre/sophia/
 2004 Route des Lucioles BP93
 06902 SOPHIA ANTIPOLIS cedex




Re: HDFS read text file

2014-11-17 Thread Hlib Mykhailenko
Hello Naveen, 

I think you should first override toString method of your 
sample.spark.test.Student class. 

-- 
Cordialement, 
Hlib Mykhailenko 
Doctorant à INRIA Sophia-Antipolis Méditerranée 
2004 Route des Lucioles BP93 
06902 SOPHIA ANTIPOLIS cedex 

- Original Message -

 From: Naveen Kumar Pokala npok...@spcapitaliq.com
 To: user@spark.apache.org
 Sent: Monday, November 17, 2014 11:33:44 AM
 Subject: HDFS read text file

 Hi,

 JavaRDDInstrument studentsData = sc .parallelize( list );--list is Student
 Info ListStudent

 studentsData .saveAsTextFile( hdfs://master/data/spark/instruments.txt );

 above statements saved the students information in the HDFS as a text file.
 Each object is a line in text file as below.

 How to read that file, I mean each line as Object of student.

 -Naveen


Re: Building Spark for Hive The requested profile hadoop-1.2 could not be activated because it does not exist.

2014-11-17 Thread akshayhazari
Oops , I guess , this is the right way to do it 
mvn -Phive -Dhadoop.version=1.2.1 clean -DskipTests package



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-for-Hive-The-requested-profile-hadoop-1-2-could-not-be-activated-because-it-does-not--tp19063p19068.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Returning breeze.linalg.DenseMatrix from method

2014-11-17 Thread tribhuvan...@gmail.com
This should fix it --

def func(str: String): DenseMatrix*[Double]* = {
...
...
}

So, why is this required?
Think of it like this -- If you hadn't explicitly mentioned Double, it
might have been that the calling function expected a
DenseMatrix[SomeOtherType], and performed a SomeOtherType-specific
operation which may have not been supported by the returned
DenseMatrix[Double]. (I'm also assuming that SomeOtherType has no subtype
relations with Double).

On 17 November 2014 00:14, Ritesh Kumar Singh riteshoneinamill...@gmail.com
 wrote:

 Hi,

 I have a method that returns DenseMatrix:
 def func(str: String): DenseMatrix = {
 ...
 ...
 }

 But I keep getting this error:
 *class DenseMatrix takes type parameters*

 I tried this too:
 def func(str: String): DenseMatrix(Int, Int, Array[Double]) = {
 ...
 ...
 }
 But this gives me this error:
 *'=' expected but '(' found*

 Any possible fixes?




-- 
*Tribhuvanesh Orekondy*


Building Spark with hive does not work

2014-11-17 Thread Hao Ren
Hi,

I am building spark on the most recent master branch.

I checked this page:
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md

The cmd *./sbt/sbt -Phive -Phive-thirftserver clean assembly/assembly* works
fine. A fat jar is created.

However, when I started the SQL-CLI, I encountered an exception:

Spark assembly has been built with Hive, including Datanucleus jars on
classpath
java.lang.ClassNotFoundException:
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:337)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Failed to load main class
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
*You need to build Spark with -Phive and -Phive-thriftserver.*
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties

It's suggested to do with -Phive and -Phive-thriftserver, which is actually
what I have done.

Any idea ?

Hao




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-with-hive-does-not-work-tp19072.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Returning breeze.linalg.DenseMatrix from method

2014-11-17 Thread Ritesh Kumar Singh
Yeah, it works.

Although when I try to define a var of type DenseMatrix, like this:

var mat1: DenseMatrix[Double]

It gives an error saying we need to initialise the matrix mat1 at the time
of declaration.
Had to initialise it as :
var mat1: DenseMatrix[Double] = DenseMatrix.zeros[Double](1,1)

Anyways, it works now
Thanks for helping :)

On Mon, Nov 17, 2014 at 4:56 PM, tribhuvan...@gmail.com 
tribhuvan...@gmail.com wrote:

 This should fix it --

 def func(str: String): DenseMatrix*[Double]* = {
 ...
 ...
 }

 So, why is this required?
 Think of it like this -- If you hadn't explicitly mentioned Double, it
 might have been that the calling function expected a
 DenseMatrix[SomeOtherType], and performed a SomeOtherType-specific
 operation which may have not been supported by the returned
 DenseMatrix[Double]. (I'm also assuming that SomeOtherType has no subtype
 relations with Double).

 On 17 November 2014 00:14, Ritesh Kumar Singh 
 riteshoneinamill...@gmail.com wrote:

 Hi,

 I have a method that returns DenseMatrix:
 def func(str: String): DenseMatrix = {
 ...
 ...
 }

 But I keep getting this error:
 *class DenseMatrix takes type parameters*

 I tried this too:
 def func(str: String): DenseMatrix(Int, Int, Array[Double]) = {
 ...
 ...
 }
 But this gives me this error:
 *'=' expected but '(' found*

 Any possible fixes?




 --
 *Tribhuvanesh Orekondy*



Re: How do you force a Spark Application to run in multiple tasks

2014-11-17 Thread Daniel Siegmann
I've never used Mesos, sorry.

On Fri, Nov 14, 2014 at 5:30 PM, Steve Lewis lordjoe2...@gmail.com wrote:

 The cluster runs Mesos and I can see the tasks in the Mesos UI but most
 are not doing much - any hints about that UI

 On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann 
 daniel.siegm...@velos.io wrote:

 Most of the information you're asking for can be found on the Spark web
 UI (see here http://spark.apache.org/docs/1.1.0/monitoring.html). You
 can see which tasks are being processed by which nodes.

 If you're using HDFS and your file size is smaller than the HDFS block
 size you will only have one partition (remember, there is exactly one task
 for each partition in a stage). If you want to force it to have more
 partitions, you can call RDD.repartition(numPartitions). Note that this
 will introduce a shuffle you wouldn't otherwise have.

 Also make sure your job is allocated more than one core in your cluster
 (you can see this on the web UI).

 On Fri, Nov 14, 2014 at 2:18 PM, Steve Lewis lordjoe2...@gmail.com
 wrote:

  I have instrumented word count to track how many machines the code runs
 on. I use an accumulator to maintain a Set or MacAddresses. I find that
 everything is done on a single machine. This is probably optimal for word
 count but not the larger problems I am working on.
 How to a force processing to be split into multiple tasks. How to I
 access the task and attempt numbers to track which processing happens in
 which attempt. Also is using MacAddress to determine which machine is
 running the code.
 As far as I can tell a simple word count is running in one thread on
  one machine and the remainder of the cluster does nothing,
 This is consistent with tests where I write to sdout from functions and
 see little output on most machines in the cluster





 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io W: www.velos.io




 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io


Re: How to measure communication between nodes in Spark Standalone Cluster?

2014-11-17 Thread Yifan LI
I am not sure there is a direct way(an api in graphx, etc) to measure the 
number of transferred vertex values among nodes during computation.

It might depend on:
- the operations in your application, e.g. only communicate with its immediate 
neighbours for each vertex.
- the partition strategy you chose, wrt the vertices replication factor
- the distribution of partitions on cluster
...


Best,
Yifan LI
LIP6, UPMC, Paris





 On 17 Nov 2014, at 11:59, Hlib Mykhailenko hlib.mykhaile...@inria.fr wrote:
 
 Hello,
 
 I use Spark Standalone Cluster and I want to measure somehow internode 
 communication. 
 As I understood, Graphx transfers only vertices values. Am I right?  
 
 But I do not want to get number of bytes which were transferred between any 
 two nodes.
 So is  there way to measure how many values of vertices were transferred 
 among nodes?
 
 Thanks!
 
 --
 Cordialement,
 Hlib Mykhailenko
 Doctorant à INRIA Sophia-Antipolis Méditerranée 
 http://www.inria.fr/centre/sophia/
 2004 Route des Lucioles BP93
 06902 SOPHIA ANTIPOLIS cedex
 



Re: same error of SPARK-1977 while using trainImplicit in mllib 1.0.2

2014-11-17 Thread aaronlin
Thanks. It works for me.  

--  
Aaron Lin


On 2014年11月15日 Saturday at 上午1:19, Xiangrui Meng wrote:

 If you use Kryo serialier, you need to register mutable.BitSet and Rating:
  
 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala#L102
  
 The JIRA was marked resolved because chill resolved the problem in
 v0.4.0 and we have this workaround.
  
 -Xiangrui
  
 On Fri, Nov 14, 2014 at 12:41 AM, aaronlin aaron...@kkbox.com 
 (mailto:aaron...@kkbox.com) wrote:
  Hi folks,
   
  Although spark-1977 said that this problem is resolved in 1.0.2, but I will
  have this problem while running the script in AWS EC2 via spark-c2.py.
   
  I checked spark-1977 and found that twitter.chill resolve the problem in
  v.0.4.0 not v.0.3.6, but spark depends on twitter.chill v0.3.6 based on
  maven page. For more information, you can check the following pages
  - https://github.com/twitter/chill
  - http://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10/1.0.2
   
  Can anyone give me advises?
  Thanks
   
  --
  Aaron Lin
   
  
  
  




Re: RDD.aggregate versus accumulables...

2014-11-17 Thread Daniel Siegmann
You should *never* use accumulators for this purpose because you may get
incorrect answers. Accumulators can count the same thing multiple times -
you cannot rely upon the correctness of the values they compute. See
SPARK-732 https://issues.apache.org/jira/browse/SPARK-732 for more info.

On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L 
nathan.l.segerl...@intel.com wrote:

  Hi All.



 I am trying to get my head around why using accumulators and accumulables
 seems to be the most recommended method for accumulating running sums,
 averages, variances and the like, whereas the aggregate method seems to me
 to be the right one. I have no performance measurements as of yet, but it
 seems that aggregate is simpler and more intuitive (And it does what one
 might expect an accumulator to do) whereas the accumulators and
 accumulables seem to have some extra complications and overhead.



 So…



 What’s the real difference between an accumulator/accumulable and
 aggregating an RDD? When is one method of aggregation preferred over the
 other?



 Thanks,

 Nate




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io


Re: Building Spark with hive does not work

2014-11-17 Thread Cheng Lian

Hey Hao,

Which commit are you using? Just tried 64c6b9b with exactly the same 
command line flags, couldn't reproduce this issue.


Cheng

On 11/17/14 10:02 PM, Hao Ren wrote:

Hi,

I am building spark on the most recent master branch.

I checked this page:
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md

The cmd *./sbt/sbt -Phive -Phive-thirftserver clean assembly/assembly* works
fine. A fat jar is created.

However, when I started the SQL-CLI, I encountered an exception:

Spark assembly has been built with Hive, including Datanucleus jars on
classpath
java.lang.ClassNotFoundException:
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:337)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Failed to load main class
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
*You need to build Spark with -Phive and -Phive-thriftserver.*
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties

It's suggested to do with -Phive and -Phive-thriftserver, which is actually
what I have done.

Any idea ?

Hao




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-with-hive-does-not-work-tp19072.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RDD.aggregate versus accumulables...

2014-11-17 Thread Surendranauth Hiraman
We use Algebird for calculating things like min/max, stddev, variance, etc.

https://github.com/twitter/algebird/wiki

-Suren


On Mon, Nov 17, 2014 at 11:32 AM, Daniel Siegmann daniel.siegm...@velos.io
wrote:

 You should *never* use accumulators for this purpose because you may get
 incorrect answers. Accumulators can count the same thing multiple times -
 you cannot rely upon the correctness of the values they compute. See
 SPARK-732 https://issues.apache.org/jira/browse/SPARK-732 for more info.

 On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L 
 nathan.l.segerl...@intel.com wrote:

  Hi All.



 I am trying to get my head around why using accumulators and accumulables
 seems to be the most recommended method for accumulating running sums,
 averages, variances and the like, whereas the aggregate method seems to me
 to be the right one. I have no performance measurements as of yet, but it
 seems that aggregate is simpler and more intuitive (And it does what one
 might expect an accumulator to do) whereas the accumulators and
 accumulables seem to have some extra complications and overhead.



 So…



 What’s the real difference between an accumulator/accumulable and
 aggregating an RDD? When is one method of aggregation preferred over the
 other?



 Thanks,

 Nate




 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io W: www.velos.io




-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v suren.hira...@sociocast.comelos.io
W: www.velos.io


How to broadcast a textFile?

2014-11-17 Thread YaoPau
I have a 1 million row file that I'd like to read from my edge node, and then
send a copy of it to each Hadoop machine's memory in order to run JOINs in
my spark streaming code.

I see examples in the docs of how use use broadcast() for a simple array,
but how about when the data is in a textFile?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-broadcast-a-textFile-tp19083.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: RandomGenerator class not found exception

2014-11-17 Thread Chitturi Padma
As you are using sbt ..u need not put in ~/.m2/repositories for maven.
Include the jar explicitly using the option
--driver-class-path while submitting the jar to spark cluster

On Mon, Nov 17, 2014 at 7:41 PM, Ritesh Kumar Singh [via Apache Spark User
List] ml-node+s1001560n1907...@n3.nabble.com wrote:

 It's still not working. Keep getting the same error.

 I even deleted the commons-math3/* folder containing the jar. And then
 under the directory org/apache/commons/ made a folder called 'math3'
 and put the commons-math3-3.3.jar in it.
 Still its not working.

 I also tried sc.addJar(/path/to/jar) within spark-shell and in my
 project sourcefile
 It still didn't import the jar at both locations.
 More

 Any fixes? Please help

 On Mon, Nov 17, 2014 at 2:14 PM, Chitturi Padma [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19073i=0 wrote:

 Include the commons-math3/3.3 in class path while submitting jar to spark
 cluster. Like..
 spark-submit --driver-class-path maths3.3jar --class MainClass --master
 spark cluster url appjar

 On Mon, Nov 17, 2014 at 1:55 PM, Ritesh Kumar Singh [via Apache Spark
 User List] [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19057i=0 wrote:

 My sbt file for the project includes this:

 libraryDependencies ++= Seq(
 org.apache.spark  %% spark-core  % 1.1.0,
 org.apache.spark  %% spark-mllib % 1.1.0,
 org.apache.commons % commons-math3 % 3.3
 )
 =

 Still I am getting this error:

 java.lang.NoClassDefFoundError:
 org/apache/commons/math3/random/RandomGenerator

 =

 The jar at location:
 ~/.m2/repository/org/apache/commons/commons-math3/3.3 contains the random
 generator class:

  $ jar tvf commons-math3-3.3.jar | grep RandomGenerator
 org/apache/commons/math3/random/RandomGenerator.class
 org/apache/commons/math3/random/UniformRandomGenerator.class
 org/apache/commons/math3/random/SynchronizedRandomGenerator.class
 org/apache/commons/math3/random/AbstractRandomGenerator.class
 org/apache/commons/math3/random/RandomGeneratorFactory$1.class
 org/apache/commons/math3/random/RandomGeneratorFactory.class
 org/apache/commons/math3/random/StableRandomGenerator.class
 org/apache/commons/math3/random/NormalizedRandomGenerator.class
 org/apache/commons/math3/random/JDKRandomGenerator.class
 org/apache/commons/math3/random/GaussianRandomGenerator.class


 Please help


 --
  If you reply to this email, your message will be added to the
 discussion below:

 http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055.html
  To start a new topic under Apache Spark User List, email [hidden email]
 http://user/SendEmail.jtp?type=nodenode=19057i=1
 To unsubscribe from Apache Spark User List, click here.
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



 --
 View this message in context: Re: RandomGenerator class not found
 exception
 http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19057.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.




 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19073.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=bGVhcm5pbmdzLmNoaXR0dXJpQGdtYWlsLmNvbXwxfC03NzExMjUwMg==
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RandomGenerator-class-not-found-exception-tp19055p19086.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Building Spark with hive does not work

2014-11-17 Thread Hao Ren
Sry for spamming,

Just after my previous post, I noticed that the command used is:

./sbt/sbt -Phive -Phive-thirftserver clean assembly/assembly

thriftserver* 

the typo error is the evil. Stupid, me.

I believe I just copy-pasted from somewhere else, but no even checked it,
meanwhile no error msg, such as no such option, is displayed, which makes
me consider the flags are correct.

Sry for the carelessness.

Hao




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-with-hive-does-not-work-tp19072p19087.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Building Spark with hive does not work

2014-11-17 Thread Ted Yu
Looks like this was where you got that commandline:

http://search-hadoop.com/m/JW1q5RlPrl

Cheers

On Mon, Nov 17, 2014 at 9:44 AM, Hao Ren inv...@gmail.com wrote:

 Sry for spamming,

 Just after my previous post, I noticed that the command used is:

 ./sbt/sbt -Phive -Phive-thirftserver clean assembly/assembly

 thriftserver*

 the typo error is the evil. Stupid, me.

 I believe I just copy-pasted from somewhere else, but no even checked it,
 meanwhile no error msg, such as no such option, is displayed, which makes
 me consider the flags are correct.

 Sry for the carelessness.

 Hao




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-with-hive-does-not-work-tp19072p19087.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




How can I apply such an inner join in Spark Scala/Python

2014-11-17 Thread Blind Faith
So let us say I have RDDs A and B with the following values.

A = [ (1, 2), (2, 4), (3, 6) ]

B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ]

I want to apply an inner join, such that I get the following as a result.

C = [ (1, (2, 3)), (2, (4, 5)), (3, (6,6)) ]

That is, those keys which are not present in A should disappear after the
left inner join.

How can I achieve that? I can see outerJoin functions but no innerJoin
functions in the Spark RDD class.


Exception in spark sql when running a group by query

2014-11-17 Thread Sadhan Sood
While testing sparkSQL, we were running this group by with expression query
and got an exception. The same query worked fine on hive.

SELECT from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200),
  '/MM/dd') as pst_date,
count(*) as num_xyzs
  FROM
all_matched_abc
  GROUP BY
from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200),
  '/MM/dd')

14/11/17 17:41:46 ERROR thriftserver.SparkSQLDriver: Failed in [SELECT
from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200),
  '/MM/dd') as pst_date,
count(*) as num_xyzs
  FROM
all_matched_abc
  GROUP BY
from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200),
  '/MM/dd')
]
org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
Expression not in GROUP BY:
HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor(((CAST(xyz#183.whenrequestreceived
AS whenrequestreceived#187L, DoubleType) / 1000.0) - CAST(25200,
DoubleType))),/MM/dd) AS pst_date#179, tree:

Aggregate 
[HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor(((CAST(xyz#183.whenrequestreceived,
DoubleType) / 1000.0) - CAST(25200, DoubleType))),/MM/dd)],
[HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor(((CAST(xyz#183.whenrequestreceived
AS whenrequestreceived#187L, DoubleType) / 1000.0) - CAST(25200,
DoubleType))),/MM/dd) AS pst_date#179,COUNT(1) AS num_xyzs#180L]

 MetastoreRelation default, all_matched_abc, None
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:127)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:125)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
at 
org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
at 
org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:425)
at 
org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:59)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:276)
at 

How do I get the executor ID from running Java code

2014-11-17 Thread Steve Lewis
 The spark UI lists a number of Executor IDS on the cluster. I would like
to access both executor ID and Task/Attempt IDs from the code inside a
function running on a slave machine.
Currently my motivation is to  examine parallelism and locality but in
Hadoop this aids in allowing code to write non-overlapping temporary files


Spark streaming on Yarn

2014-11-17 Thread kpeng1
Hi,
 
I have been using spark streaming in standalone mode and now I want to
migrate to spark running on yarn, but I am not sure how you would you would
go about designating a specific node in the cluster to act as an avro
listener since I am using flume based push approach with spark.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-on-Yarn-tp19093.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to broadcast a textFile?

2014-11-17 Thread YaoPau
OK then I'd still need to write the code (within my spark streaming code I'm
guessing) to convert my text file into an object like a HashMap before
broadcasting.  

How can I make sure only the HashMap is being broadcast while all the
pre-processing to create the HashMap is only performed once?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-broadcast-a-textFile-tp19083p19094.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: RDD.aggregate versus accumulables...

2014-11-17 Thread Segerlind, Nathan L
Thanks for the link to the bug.

Unfortunately, using accumulators like this is getting spread around as a 
recommended practice despite the bug.


From: Daniel Siegmann [mailto:daniel.siegm...@velos.io]
Sent: Monday, November 17, 2014 8:32 AM
To: Segerlind, Nathan L
Cc: user
Subject: Re: RDD.aggregate versus accumulables...

You should never use accumulators for this purpose because you may get 
incorrect answers. Accumulators can count the same thing multiple times - you 
cannot rely upon the correctness of the values they compute. See 
SPARK-732https://issues.apache.org/jira/browse/SPARK-732 for more info.

On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L 
nathan.l.segerl...@intel.commailto:nathan.l.segerl...@intel.com wrote:
Hi All.

I am trying to get my head around why using accumulators and accumulables seems 
to be the most recommended method for accumulating running sums, averages, 
variances and the like, whereas the aggregate method seems to me to be the 
right one. I have no performance measurements as of yet, but it seems that 
aggregate is simpler and more intuitive (And it does what one might expect an 
accumulator to do) whereas the accumulators and accumulables seem to have some 
extra complications and overhead.

So…

What’s the real difference between an accumulator/accumulable and aggregating 
an RDD? When is one method of aggregation preferred over the other?

Thanks,
Nate



--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.iomailto:daniel.siegm...@velos.io W: 
www.velos.iohttp://www.velos.io


Re: How can I apply such an inner join in Spark Scala/Python

2014-11-17 Thread Sean Owen
Just RDD.join() should be an inner join.

On Mon, Nov 17, 2014 at 5:51 PM, Blind Faith person.of.b...@gmail.com wrote:
 So let us say I have RDDs A and B with the following values.

 A = [ (1, 2), (2, 4), (3, 6) ]

 B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ]

 I want to apply an inner join, such that I get the following as a result.

 C = [ (1, (2, 3)), (2, (4, 5)), (3, (6,6)) ]

 That is, those keys which are not present in A should disappear after the
 left inner join.

 How can I achieve that? I can see outerJoin functions but no innerJoin
 functions in the Spark RDD class.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-11-17 Thread akhandeshi
only option is to split you problem further by increasing parallelism My
understanding is  by increasing the number of partitions, is that right? 
That didn't seem to help because it is seem the partitions are not uniformly
sized.   My observation is when I increase the number of partitions, it
creates many empty block partitions and may larger partition is not broken
down into smaller size.  Any hints, on how I can get uniform partitions.  I
noticed many threads, but was not able to do any thing effective from Java
api.  I will appreciate any help/insight you can provide.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-Requested-array-size-exceeds-VM-limit-tp16809p19097.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Missing SparkSQLCLIDriver and Beeline drivers in Spark

2014-11-17 Thread Ted Yu
Minor correction: there was a typo in commandline

hive-thirftserver should be hive-thriftserver

Cheers

On Thu, Aug 7, 2014 at 6:49 PM, Cheng Lian lian.cs@gmail.com wrote:

 Things have changed a bit in the master branch, and the SQL programming
 guide in master branch actually doesn’t apply to branch-1.0-jdbc.

 In branch-1.0-jdbc, Hive Thrift server and Spark SQL CLI are included in
 the hive profile and are thus not enabled by default. You need to either

- pass -Phive to Maven to enable it, or
- use SPARK_HIVE=true ./sbt/sbt assembly

 In the most recent master branch, however, Hive Thrift server and Spark
 SQL CLI are moved into a separate hive-thriftserver profile. And our SBT
 build file now delegates to Maven. So, to build the master branch, you can
 either

- ./sbt/sbt -Phive-thirftserver clean assembly/assembly, or
- mvn -Phive-thriftserver clean package -DskipTests

 On Fri, Aug 8, 2014 at 6:12 AM, ajatix a...@sigmoidanalytics.com wrote:

 Hi

 I wish to migrate from shark to the spark-sql shell, where I am facing
 some
 difficulties in setting up.

 I cloned the branch-1.0-jdbc to test out the spark-sql shell, but I am
 unable to run it after building the source.

 I've tried two methods for building (with Hadoop 1.0.4) - sbt/sbt
 assembly;
 and mvn -DskipTests clean package -X. Both build successfully, but when I
 run bin/spark-sql, I get the following error:

 Exception in thread main java.lang.ClassNotFoundException:
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:311)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 and when I run bin/beeline, I get this:
 Error: Could not find or load main class org.apache.hive.beeline.BeeLine

 bin/spark-shell works fine. It there something else I have to add to the
 build parameters?
 According to this -
 https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md,
 I
 tried rebuilding with -Phive-thriftserver, but it failed to detect the
 library while building.

 Thanks and Regards
 Ajay



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Missing-SparkSQLCLIDriver-and-Beeline-drivers-in-Spark-tp11724.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

  ​



IOException: exception in uploadSinglePart

2014-11-17 Thread Justin Mills
Spark 1.1.0, running on AWS EMR cluster using yarn-client as master.

I'm getting the following error when I attempt to save a RDD to S3. I've
narrowed it down to a single partition that is ~150Mb in size (versus the
other partitions that are closer to 1 Mb). I am able to work around this by
saving to a HDFS file first, then using the hadoop distcp command to get
the data upon S3, but it seems that Spark should be able to handle that.
I've even tried writing to HDFS, then creating something like the following
to get that up on S3:

sc.textFile(hdfsPath).saveAsTextFile(s3Path)

Here's the exception:

java.io.IOException: exception in uploadSinglePart

com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.uploadSinglePart(MultipartUploadOutputStream.java:164)

com.amazon.ws.emr.hadoop.fs.s3n.MultipartUploadOutputStream.close(MultipartUploadOutputStream.java:220)

org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)

org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:105)

org.apache.hadoop.io.compress.CompressorStream.close(CompressorStream.java:106)
java.io.FilterOutputStream.close(FilterOutputStream.java:160)

org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:109)

org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:101)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:994)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:979)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Any ideas of what the underlying issue could be here? It feels like a
timeout issue, but that's just a guess.

Thanks, Justin


Re: Spark streaming cannot receive any message from Kafka

2014-11-17 Thread Bill Jay
Hi all,

I find the reason of this issue. It seems in the new version, if I do not
specify spark.default.parallelism in KafkaUtils.createstream, there will be
an exception since the kakfa stream creation stage. In the previous
versions, it seems Spark will use the default value.

Thanks!

Bill

On Thu, Nov 13, 2014 at 5:00 AM, Helena Edelson helena.edel...@datastax.com
 wrote:

 I encounter no issues with streaming from kafka to spark in 1.1.0. Do you
 perhaps have a version conflict?

 Helena
 On Nov 13, 2014 12:55 AM, Jay Vyas jayunit100.apa...@gmail.com wrote:

 Yup , very important that  n1 for spark streaming jobs, If local use
 local[2]

 The thing to remember is that your spark receiver will take a thread to
 itself and produce data , so u need another thread to consume it .

 In a cluster manager like yarn or mesos, the word thread Is not used
 anymore, I guess has different meaning- you need 2 or more free compute
 slots, and that should be guaranteed by looking to see how many free node
 managers are running etc.

 On Nov 12, 2014, at 7:53 PM, Shao, Saisai saisai.s...@intel.com
 wrote:

  Did you configure Spark master as local, it should be local[n], n  1
 for local mode. Beside there’s a Kafka wordcount example in Spark Streaming
 example, you can try that. I’ve tested with latest master, it’s OK.



 Thanks

 Jerry



 *From:* Tobias Pfeiffer [mailto:t...@preferred.jp t...@preferred.jp]
 *Sent:* Thursday, November 13, 2014 8:45 AM
 *To:* Bill Jay
 *Cc:* u...@spark.incubator.apache.org
 *Subject:* Re: Spark streaming cannot receive any message from Kafka



 Bill,



   However, when I am currently using Spark 1.1.0. the Spark streaming
 job cannot receive any messages from Kafka. I have not made any change to
 the code.



 Do you see any suspicious messages in the log output?



 Tobias






independent user sessions with a multi-user spark sql thriftserver (Spark 1.1)

2014-11-17 Thread Michael Allman
Hello,

We're running a spark sql thriftserver that several users connect to with 
beeline. One limitation we've run into is that the current working database 
(set with use db) is shared across all connections. So changing the 
database on one connection changes the database for all connections. This is 
also the case for spark sql settings, but that's less of an issue. Is there a 
way (or a hack) to make the current database selection independent for each 
beeline connection? I'm not afraid to hack into the source code if there's a 
straightforward fix/workaround, but I could use some guidance on what to hack 
on if that's required.

Thank you!

Michael
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: RDD.aggregate versus accumulables...

2014-11-17 Thread lordjoe
I have been playing with using accumulators (despite the possible error with
multiple attempts) These provide a convenient way to get some numbers while
still performing business logic. 
I posted some sample code at http://lordjoesoftware.blogspot.com/.
Even if accumulators are not perfect today - future versions may improve
them and they are great ways to monitor execution and get a sense of
performance on lazily executed systems



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-versus-accumulables-tp19044p19102.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Assigning input files to spark partitions

2014-11-17 Thread Pala M Muthaia
Hi Daniel,

Yes that should work also. However, is it possible to setup so that each
RDD has exactly one partition, without repartitioning (and thus incurring
extra cost)? Is there a mechanism similar to MR where we can ensure each
partition is assigned some amount of data by size, by setting some block
size parameter?



On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann daniel.siegm...@velos.io
wrote:

 On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia 
 mchett...@rocketfuelinc.com wrote


 No i don't want separate RDD because each of these partitions are being
 processed the same way (in my case, each partition corresponds to HBase
 keys belonging to one region server, and i will do HBase lookups). After
 that i have aggregations too, hence all these partitions should be in the
 same RDD. The reason to follow the partition structure is to limit
 concurrent HBase lookups targeting a single region server.


 Neither of these is necessarily a barrier to using separate RDDs. You can
 define the function you want to use and then pass it to multiple map
 methods. Then you could union all the RDDs to do your aggregations. For
 example, it might look something like this:

 val paths: String = ... // the paths to the files you want to load
 def myFunc(t: T) = ... // the function to apply to every RDD
 val rdds = paths.map { path =
 sc.textFile(path).map(myFunc)
 }
 val completeRdd = sc.union(rdds)

 Does that make any sense?

 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io W: www.velos.io



Re: Assigning input files to spark partitions

2014-11-17 Thread Daniel Siegmann
I'm not aware of any such mechanism.

On Mon, Nov 17, 2014 at 2:55 PM, Pala M Muthaia mchett...@rocketfuelinc.com
 wrote:

 Hi Daniel,

 Yes that should work also. However, is it possible to setup so that each
 RDD has exactly one partition, without repartitioning (and thus incurring
 extra cost)? Is there a mechanism similar to MR where we can ensure each
 partition is assigned some amount of data by size, by setting some block
 size parameter?



 On Thu, Nov 13, 2014 at 1:05 PM, Daniel Siegmann daniel.siegm...@velos.io
  wrote:

 On Thu, Nov 13, 2014 at 3:24 PM, Pala M Muthaia 
 mchett...@rocketfuelinc.com wrote


 No i don't want separate RDD because each of these partitions are being
 processed the same way (in my case, each partition corresponds to HBase
 keys belonging to one region server, and i will do HBase lookups). After
 that i have aggregations too, hence all these partitions should be in the
 same RDD. The reason to follow the partition structure is to limit
 concurrent HBase lookups targeting a single region server.


 Neither of these is necessarily a barrier to using separate RDDs. You can
 define the function you want to use and then pass it to multiple map
 methods. Then you could union all the RDDs to do your aggregations. For
 example, it might look something like this:

 val paths: String = ... // the paths to the files you want to load
 def myFunc(t: T) = ... // the function to apply to every RDD
 val rdds = paths.map { path =
 sc.textFile(path).map(myFunc)
 }
 val completeRdd = sc.union(rdds)

 Does that make any sense?

 --
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning

 54 W 40th St, New York, NY 10018
 E: daniel.siegm...@velos.io W: www.velos.io





-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegm...@velos.io W: www.velos.io


Re: independent user sessions with a multi-user spark sql thriftserver (Spark 1.1)

2014-11-17 Thread Michael Armbrust
This is an unfortunate/known issue that we are hoping to address in the
next release: https://issues.apache.org/jira/browse/SPARK-2087

I'm not sure how straightforward a fix would be, but it would involve
keeping / setting the SessionState for each connection to the server.  It
would be great if you could share any findings on that JIRA.  Thanks!

On Mon, Nov 17, 2014 at 11:01 AM, Michael Allman mich...@videoamp.com
wrote:

 Hello,

 We're running a spark sql thriftserver that several users connect to with
 beeline. One limitation we've run into is that the current working database
 (set with use db) is shared across all connections. So changing the
 database on one connection changes the database for all connections. This
 is also the case for spark sql settings, but that's less of an issue. Is
 there a way (or a hack) to make the current database selection independent
 for each beeline connection? I'm not afraid to hack into the source code if
 there's a straightforward fix/workaround, but I could use some guidance on
 what to hack on if that's required.

 Thank you!

 Michael
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Exception in spark sql when running a group by query

2014-11-17 Thread Michael Armbrust
You are perhaps hitting an issue that was fixed by #3248
https://github.com/apache/spark/pull/3248?

On Mon, Nov 17, 2014 at 9:58 AM, Sadhan Sood sadhan.s...@gmail.com wrote:

 While testing sparkSQL, we were running this group by with expression
 query and got an exception. The same query worked fine on hive.

 SELECT from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200),
   '/MM/dd') as pst_date,
 count(*) as num_xyzs
   FROM
 all_matched_abc
   GROUP BY
 from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200),
   '/MM/dd')

 14/11/17 17:41:46 ERROR thriftserver.SparkSQLDriver: Failed in [SELECT 
 from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200),
   '/MM/dd') as pst_date,
 count(*) as num_xyzs
   FROM
 all_matched_abc
   GROUP BY
 from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200),
   '/MM/dd')
 ]
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: 
 HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor(((CAST(xyz#183.whenrequestreceived
  AS whenrequestreceived#187L, DoubleType) / 1000.0) - CAST(25200, 
 DoubleType))),/MM/dd) AS pst_date#179, tree:

 Aggregate 
 [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor(((CAST(xyz#183.whenrequestreceived,
  DoubleType) / 1000.0) - CAST(25200, DoubleType))),/MM/dd)], 
 [HiveSimpleUdf#org.apache.hadoop.hive.ql.udf.UDFFromUnixTime(HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor(((CAST(xyz#183.whenrequestreceived
  AS whenrequestreceived#187L, DoubleType) / 1000.0) - CAST(25200, 
 DoubleType))),/MM/dd) AS pst_date#179,COUNT(1) AS num_xyzs#180L]

  MetastoreRelation default, all_matched_abc, None
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:127)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$6.apply(Analyzer.scala:125)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:125)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113)
 at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
 at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
 at 
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
 at 
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
 at 
 scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
 at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
 at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:411)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:411)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:412)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:412)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:413)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:413)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
 at 
 

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Michael Armbrust
What version of Spark SQL?

On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com wrote:

 Hi all,

 We run SparkSQL on TPCDS benchmark Q19 with  spark.sql.codegen=true, we
 got exceptions as below, has anyone else saw these before?

 java.lang.ExceptionInInitializerError
 at
 org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92)
 at
 org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:51)
 at
 org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:48)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 Caused by: java.lang.NullPointerException
 at
 scala.reflect.internal.Types$TypeRef.computeHashCode(Types.scala:2358)
 at scala.reflect.internal.Types$UniqueType.init(Types.scala:1304)
 at scala.reflect.internal.Types$TypeRef.init(Types.scala:2341)
 at
 scala.reflect.internal.Types$NoArgsTypeRef.init(Types.scala:2137)
 at
 scala.reflect.internal.Types$TypeRef$$anon$6.init(Types.scala:2544)
 at scala.reflect.internal.Types$TypeRef$.apply(Types.scala:2544)
 at scala.reflect.internal.Types$class.typeRef(Types.scala:3615)
 at scala.reflect.internal.SymbolTable.typeRef(SymbolTable.scala:13)
 at
 scala.reflect.internal.Symbols$TypeSymbol.newTypeRef(Symbols.scala:2752)
 at
 scala.reflect.internal.Symbols$TypeSymbol.typeConstructor(Symbols.scala:2806)
 at
 scala.reflect.internal.Symbols$SymbolContextApiImpl.toTypeConstructor(Symbols.scala:103)
 at
 scala.reflect.internal.Symbols$TypeSymbol.toTypeConstructor(Symbols.scala:2698)
 at
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$typecreator1$1.apply(CodeGenerator.scala:46)
 at
 scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
 at
 scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
 at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
 at scala.reflect.api.Universe.typeOf(Universe.scala:59)
 at
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.init(CodeGenerator.scala:46)
 at
 org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.init(GenerateProjection.scala:29)
 at
 org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.clinit(GenerateProjection.scala)
 ... 15 more
 --
 Best Regards



RDD Blocks skewing to just few executors

2014-11-17 Thread mtimper
Hi I'm running a standalone cluster with 8 worker servers. 
I'm developing a streaming app that is adding new lines of text to several
different RDDs each batch interval. Each line has a well randomized unique
identifier that I'm trying to use for partitioning, since the data stream
does contain duplicates lines. I'm doing partitioning with this:

val eventsByKey =  streamRDD.map { event = (getUID(event), event)}
val partionedEventsRdd = sparkContext.parallelize(eventsByKey.toSeq)
   .partitionBy(new HashPartitioner(numPartions)).map(e = e._2)

I'm adding to the existing RDD like with this:

val mergedRDD = currentRDD.zipPartitions(partionedEventsRdd, true) {
(currentIter,batchIter) = 
val uniqEvents = ListBuffer[String]()
val uids = Map[String,Boolean]()
Array(currentIter, batchIter).foreach { iter = 
  iter.foreach { event =
val uid = getUID(event)
if (!uids.contains(uid)) {
uids(uid) = true
uniqEvents +=event
}
  }
}
uniqEvents.iterator
}

val count = mergedRDD.count

The reason I'm doing it this way is that when I was doing:

val mergedRDD = currentRDD.union(batchRDD).coalesce(numPartions).distinct
val count = mergedRDD.count

It would start taking a long time and a lot of shuffles.

The zipPartitions approach does perform better, though after running an hour
or so I start seeing this 
in the webUI.

http://apache-spark-user-list.1001560.n3.nabble.com/file/n19112/Executors.png 

As you can see most of the data is skewing to just 2 executors, with 1
getting more than half the Blocks. These become a hotspot and eventually I
start seeing OOM errors. I've tried this a half a dozen times and the 'hot'
executors changes, but not the skewing behavior.

Any idea what is going on here?

Thanks,

Mike
 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Blocks-skewing-to-just-few-executors-tp19112.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark streaming cannot receive any message from Kafka

2014-11-17 Thread Shao, Saisai
Hi Bill,

Would you mind describing what you found a little more specifically, I’m not 
sure there’s the a parameter in KafkaUtils.createStream you can specify the 
spark parallelism, also what is the exception stacks.

Thanks
Jerry

From: Bill Jay [mailto:bill.jaypeter...@gmail.com]
Sent: Tuesday, November 18, 2014 2:47 AM
To: Helena Edelson
Cc: Jay Vyas; u...@spark.incubator.apache.org; Tobias Pfeiffer; Shao, Saisai
Subject: Re: Spark streaming cannot receive any message from Kafka

Hi all,

I find the reason of this issue. It seems in the new version, if I do not 
specify spark.default.parallelism in KafkaUtils.createstream, there will be an 
exception since the kakfa stream creation stage. In the previous versions, it 
seems Spark will use the default value.

Thanks!

Bill

On Thu, Nov 13, 2014 at 5:00 AM, Helena Edelson 
helena.edel...@datastax.commailto:helena.edel...@datastax.com wrote:

I encounter no issues with streaming from kafka to spark in 1.1.0. Do you 
perhaps have a version conflict?

Helena
On Nov 13, 2014 12:55 AM, Jay Vyas 
jayunit100.apa...@gmail.commailto:jayunit100.apa...@gmail.com wrote:
Yup , very important that  n1 for spark streaming jobs, If local use 
local[2]

The thing to remember is that your spark receiver will take a thread to itself 
and produce data , so u need another thread to consume it .

In a cluster manager like yarn or mesos, the word thread Is not used anymore, I 
guess has different meaning- you need 2 or more free compute slots, and that 
should be guaranteed by looking to see how many free node managers are running 
etc.

On Nov 12, 2014, at 7:53 PM, Shao, Saisai 
saisai.s...@intel.commailto:saisai.s...@intel.com wrote:
Did you configure Spark master as local, it should be local[n], n  1 for local 
mode. Beside there’s a Kafka wordcount example in Spark Streaming example, you 
can try that. I’ve tested with latest master, it’s OK.

Thanks
Jerry

From: Tobias Pfeiffer [mailto:t...@preferred.jp]
Sent: Thursday, November 13, 2014 8:45 AM
To: Bill Jay
Cc: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org
Subject: Re: Spark streaming cannot receive any message from Kafka

Bill,

However, when I am currently using Spark 1.1.0. the Spark streaming job cannot 
receive any messages from Kafka. I have not made any change to the code.

Do you see any suspicious messages in the log output?

Tobias




Re: Using data in RDD to specify HDFS directory to write to

2014-11-17 Thread jschindler
Yes, thank you for suggestion.  The error I found below was in the worker
logs.

AssociationError [akka.tcp://sparkwor...@cloudera01.local.company.com:7078]
- [akka.tcp://sparkexecu...@cloudera01.local.company.com:33329]: Error
[Association failed with
[akka.tcp://sparkexecu...@cloudera01.local.company.com:33329]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkexecu...@cloudera01.local.company.com:33329]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: cloudera01.local.company.com/10.40.19.67:33329
]

I looked into suggestions for this type of error and before I found out the
real reason for the error I upgraded my CDH to 5.2 so I could try setting
the driver and executor ports rather than have Spark choose them at random. 
My boss later turned off iptables and I no longer get that error. I do get a
different one however.  I have gone back into my project and changed my
hadoop version to 2.5.0-cdh5.2.0 so that should not be a problem.

from the master logs

2014-11-17 18:09:49,707 ERROR akka.remote.EndpointWriter: AssociationError
[akka.tcp://sparkmas...@cloudera01.local.local.com:7077] -
[akka.tcp://spark@localhost:38181]: Error [Association failed with
[akka.tcp://spark@localhost:38181]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://spark@localhost:38181]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: localhost/127.0.0.1:38181
]

2014-11-17 18:19:08,271 INFO akka.actor.LocalActorRef: Message
[akka.remote.transport.AssociationHandle$Disassociated] from
Actor[akka://sparkMaster/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.40.19.67%3A37795-29#-1248895472]
was not delivered. [30] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
2014-11-17 18:19:28,251 ERROR Remoting:
org.apache.spark.deploy.ApplicationDescription; local class incompatible:
stream classdesc serialVersionUID = 583745679236071411, local class
serialVersionUID = 7674242335164700840
java.io.InvalidClassException:
org.apache.spark.deploy.ApplicationDescription; local class incompatible:
stream classdesc serialVersionUID = 583745679236071411, local class
serialVersionUID = 7674242335164700840
at
java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at
akka.serialization.JavaSerializer$$anonfun$1.apply(Serializer.scala:136)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at
akka.serialization.JavaSerializer.fromBinary(Serializer.scala:136)
at
akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
at scala.util.Try$.apply(Try.scala:161)
at
akka.serialization.Serialization.deserialize(Serialization.scala:98)
at
akka.remote.serialization.MessageContainerSerializer.fromBinary(MessageContainerSerializer.scala:58)
at
akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
at scala.util.Try$.apply(Try.scala:161)
at
akka.serialization.Serialization.deserialize(Serialization.scala:98)
at
akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)
at
akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:55)
at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:55)
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:73)
at
akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:764)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at

Re: Communication between Driver and Executors

2014-11-17 Thread Tobias Pfeiffer
Hi,

so I didn't manage to get the Broadcast variable with a new value
distributed to my executors in YARN mode. In local mode it worked fine, but
when running on YARN either nothing happened (when unpersist() was called
on the driver) or I got a TimeoutException (when called on the executor).
I finally dropped the use of broadcast variables and added a HTTP polling
mechanism from the executors to the driver. I find that a bit suboptimal,
in particular since there is this whole Akka infrastructure already running
and I should be able to just send messages around. However, Spark does not
seem to encourage this. (In general I find that private is a bit overused
in the Spark codebase...)

Thanks
Tobias


Re: Is there setup and cleanup function in spark?

2014-11-17 Thread Tobias Pfeiffer
Hi,

On Fri, Nov 14, 2014 at 2:49 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Ok, then we need another trick.

 let's have an *implicit lazy var connection/context* around our code. And
 setup() will trigger the eval and initialization.


Due to lazy evaluation, I think having setup/teardown is a bit tricky. In
particular teardown, because it is not easy to execute code after all
computation is done. You can check
http://apache-spark-user-list.1001560.n3.nabble.com/Keep-state-inside-map-function-tp10968p11009.html
for an example of what worked for me.

Tobias


Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Eric Zhen
Hi Michael,

We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4.

On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust mich...@databricks.com
wrote:

 What version of Spark SQL?

 On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com wrote:

 Hi all,

 We run SparkSQL on TPCDS benchmark Q19 with  spark.sql.codegen=true, we
 got exceptions as below, has anyone else saw these before?

 java.lang.ExceptionInInitializerError
 at
 org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92)
 at
 org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:51)
 at
 org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:48)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 Caused by: java.lang.NullPointerException
 at
 scala.reflect.internal.Types$TypeRef.computeHashCode(Types.scala:2358)
 at
 scala.reflect.internal.Types$UniqueType.init(Types.scala:1304)
 at scala.reflect.internal.Types$TypeRef.init(Types.scala:2341)
 at
 scala.reflect.internal.Types$NoArgsTypeRef.init(Types.scala:2137)
 at
 scala.reflect.internal.Types$TypeRef$$anon$6.init(Types.scala:2544)
 at scala.reflect.internal.Types$TypeRef$.apply(Types.scala:2544)
 at scala.reflect.internal.Types$class.typeRef(Types.scala:3615)
 at
 scala.reflect.internal.SymbolTable.typeRef(SymbolTable.scala:13)
 at
 scala.reflect.internal.Symbols$TypeSymbol.newTypeRef(Symbols.scala:2752)
 at
 scala.reflect.internal.Symbols$TypeSymbol.typeConstructor(Symbols.scala:2806)
 at
 scala.reflect.internal.Symbols$SymbolContextApiImpl.toTypeConstructor(Symbols.scala:103)
 at
 scala.reflect.internal.Symbols$TypeSymbol.toTypeConstructor(Symbols.scala:2698)
 at
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$typecreator1$1.apply(CodeGenerator.scala:46)
 at
 scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
 at
 scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
 at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
 at scala.reflect.api.Universe.typeOf(Universe.scala:59)
 at
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.init(CodeGenerator.scala:46)
 at
 org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.init(GenerateProjection.scala:29)
 at
 org.apache.spark.sql.catalyst.expressions.codegen.GenerateProjection$.clinit(GenerateProjection.scala)
 ... 15 more
 --
 Best Regards





-- 
Best Regards


Re: Status of MLLib exporting models to PMML

2014-11-17 Thread Manish Amde
Hi Charles,

I am not aware of other storage formats. Perhaps Sean or Sandy can
elaborate more given their experience with Oryx.

There is work by Smola et al at Google that talks about large scale model
update and deployment.
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu

-Manish

On Sunday, November 16, 2014, Charles Earl charles.ce...@gmail.com wrote:

 Manish and others,
 A follow up question on my mind is whether there are protobuf (or other
 binary format) frameworks in the vein of PMML. Perhaps scientific data
 storage frameworks like netcdf, root are possible also.
 I like the comprehensiveness of PMML but as you mention the complexity of
 management for large models is a concern.
 Cheers

 On Fri, Nov 14, 2014 at 1:35 AM, Manish Amde manish...@gmail.com
 javascript:_e(%7B%7D,'cvml','manish...@gmail.com'); wrote:

 @Aris, we are closely following the PMML work that is going on and as
 Xiangrui mentioned, it might be easier to migrate models such as logistic
 regression and then migrate trees. Some of the models get fairly large (as
 pointed out by Sung Chung) with deep trees as building blocks and we might
 have to consider a distributed storage and prediction strategy.


 On Tuesday, November 11, 2014, Xiangrui Meng men...@gmail.com
 javascript:_e(%7B%7D,'cvml','men...@gmail.com'); wrote:

 Vincenzo sent a PR and included k-means as an example. Sean is helping
 review it. PMML standard is quite large. So we may start with simple
 model export, like linear methods, then move forward to tree-based.
 -Xiangrui

 On Mon, Nov 10, 2014 at 11:27 AM, Aris arisofala...@gmail.com wrote:
  Hello Spark and MLLib folks,
 
  So a common problem in the real world of using machine learning is
 that some
  data analysis use tools like R, but the more data engineers out
 there will
  use more advanced systems like Spark MLLib or even Python Scikit Learn.
 
  In the real world, I want to have a system where multiple different
  modeling environments can learn from data / build models, represent the
  models in a common language, and then have a layer which just takes the
  model and run model.predict() all day long -- scores the models in
 other
  words.
 
  It looks like the project openscoring.io and jpmml-evaluator are some
  amazing systems for this, but they fundamentally use PMML as the model
  representation here.
 
  I have read some JIRA tickets that Xiangrui Meng is interested in
 getting
  PMML implemented to export MLLib models, is that happening? Further,
 would
  something like Manish Amde's boosted ensemble tree methods be
 representable
  in PMML?
 
  Thank you!!
  Aris

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 - Charles



Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Eric Zhen
Yes, it's always appears on a part of the whole tasks in a stage(i.e. 100/100
(65 failed)), and sometimes cause the stage to fail.

And there is another error that I'm not sure if there is a correlation.

java.lang.NoClassDefFoundError: Could not initialize class
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$
at
org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:114)
at
org.apache.spark.sql.execution.Filter.conditionEvaluator$lzycompute(basicOperators.scala:55)
at
org.apache.spark.sql.execution.Filter.conditionEvaluator(basicOperators.scala:55)
at
org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:58)
at
org.apache.spark.sql.execution.Filter$$anonfun$2.apply(basicOperators.scala:57)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

On Tue, Nov 18, 2014 at 11:41 AM, Michael Armbrust mich...@databricks.com
wrote:

 Interesting, I believe we have run that query with version 1.1.0 with
 codegen turned on and not much has changed there.  Is the error
 deterministic?

 On Mon, Nov 17, 2014 at 7:04 PM, Eric Zhen zhpeng...@gmail.com wrote:

 Hi Michael,

 We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4.

 On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust mich...@databricks.com
  wrote:

 What version of Spark SQL?

 On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com wrote:

 Hi all,

 We run SparkSQL on TPCDS benchmark Q19 with  spark.sql.codegen=true, we
 got exceptions as below, has anyone else saw these before?

 java.lang.ExceptionInInitializerError
 at
 org.apache.spark.sql.execution.SparkPlan.newProjection(SparkPlan.scala:92)
 at
 org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:51)
 at
 org.apache.spark.sql.execution.Exchange$$anonfun$execute$1$$anonfun$1.apply(Exchange.scala:48)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 at
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 Caused by: java.lang.NullPointerException
 at
 scala.reflect.internal.Types$TypeRef.computeHashCode(Types.scala:2358)
 at
 scala.reflect.internal.Types$UniqueType.init(Types.scala:1304)
 at scala.reflect.internal.Types$TypeRef.init(Types.scala:2341)
 at
 scala.reflect.internal.Types$NoArgsTypeRef.init(Types.scala:2137)
 at
 scala.reflect.internal.Types$TypeRef$$anon$6.init(Types.scala:2544)
 at scala.reflect.internal.Types$TypeRef$.apply(Types.scala:2544)
 at scala.reflect.internal.Types$class.typeRef(Types.scala:3615)
 at
 scala.reflect.internal.SymbolTable.typeRef(SymbolTable.scala:13)
 at
 scala.reflect.internal.Symbols$TypeSymbol.newTypeRef(Symbols.scala:2752)
 at
 scala.reflect.internal.Symbols$TypeSymbol.typeConstructor(Symbols.scala:2806)
 at
 

Re: Status of MLLib exporting models to PMML

2014-11-17 Thread Sean Owen
I'm just using PMML. I haven't hit any limitation of its
expressiveness, for the model types is supports. I don't think there
is a point in defining a new format for models, excepting that PMML
can get very big. Still, just compressing the XML gets it down to a
manageable size for just about any realistic model.*

I can imagine some kind of translation from PMML-in-XML to
PMML-in-something-else that is more compact. I've not seen anyone do
this.

* there still aren't formats for factored matrices and probably won't
ever quite be, since they're just too large for a file format.

On Tue, Nov 18, 2014 at 5:34 AM, Manish Amde manish...@gmail.com wrote:
 Hi Charles,

 I am not aware of other storage formats. Perhaps Sean or Sandy can elaborate
 more given their experience with Oryx.

 There is work by Smola et al at Google that talks about large scale model
 update and deployment.
 https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu

 -Manish


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is there setup and cleanup function in spark?

2014-11-17 Thread Jianshi Huang
I see. Agree that lazy eval is not suitable for proper setup and teardown.

We also abandoned it due to inherent incompatibility between implicit and
lazy. It was fun to come up this trick though.

Jianshi

On Tue, Nov 18, 2014 at 10:28 AM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Fri, Nov 14, 2014 at 2:49 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Ok, then we need another trick.

 let's have an *implicit lazy var connection/context* around our code.
 And setup() will trigger the eval and initialization.


 Due to lazy evaluation, I think having setup/teardown is a bit tricky. In
 particular teardown, because it is not easy to execute code after all
 computation is done. You can check
 http://apache-spark-user-list.1001560.n3.nabble.com/Keep-state-inside-map-function-tp10968p11009.html
 for an example of what worked for me.

 Tobias




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Running PageRank in GraphX

2014-11-17 Thread Deep Pradhan
Hi,
I just ran the PageRank code in GraphX with some sample data. What I am
seeing is that the total rank changes drastically if I change the number of
iterations from 10 to 100. Why is that so?

Thank You


Null pointer exception with larger datasets

2014-11-17 Thread Naveen Kumar Pokala
Hi,

I am having list Students and size is one Lakh and I am trying to save the 
file. It is throwing null pointer exception.

JavaRDDStudent distData = sc.parallelize(list);

distData.saveAsTextFile(hdfs://master/data/spark/instruments.txt);


14/11/18 01:33:21 WARN scheduler.TaskSetManager: Lost task 5.0 in stage 0.0 
(TID 5, master): java.lang.NullPointerException:
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158)
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)


How to handle this?

-Naveen


Probability in Naive Bayes

2014-11-17 Thread Samarth Mailinglist
I am trying to use Naive Bayes for a project of mine in Python and I want
to obtain the probability value after having built the model.

Suppose I have two classes - A and B. Currently there is an API to to find
which class a sample belongs to (predict). Now, I want to find the
probability of it belonging to Class A or Class B.

Can you please provide any details about how I can obtain the probability
values for it belonging to the either class A or class B?


Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Jianshi Huang
Any notable issues for using Scala 2.11? Is it stable now?

Or can I use Scala 2.11 in my spark application and use Spark dist build
with 2.10 ?

I'm looking forward to migrate to 2.11 for some quasiquote features.
Couldn't make it run in 2.10...

Cheers,
-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: Probability in Naive Bayes

2014-11-17 Thread Sean Owen
This was recently discussed on this mailing list. You can't get the
probabilities out directly now, but you can hack a bit to get the internal
data structures of NaiveBayesModel and compute it from there.

If you really mean the probability of either A or B, then if your classes
are exclusive it is just the sum of the class probabilities. You won't be
able to compute this otherwise from what Naive Bayes computes.
On Nov 18, 2014 7:42 AM, Samarth Mailinglist mailinglistsama...@gmail.com
wrote:

 I am trying to use Naive Bayes for a project of mine in Python and I want
 to obtain the probability value after having built the model.

 Suppose I have two classes - A and B. Currently there is an API to to find
 which class a sample belongs to (predict). Now, I want to find the
 probability of it belonging to Class A or Class B.

 Can you please provide any details about how I can obtain the probability
 values for it belonging to the either class A or class B?




Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Prashant Sharma
It is safe in the sense we would help you with the fix if you run into
issues. I have used it, but since I worked on the patch the opinion can be
biased. I am using scala 2.11 for day to day development. You should
checkout the build instructions here :
https://github.com/ScrapCodes/spark-1/blob/patch-3/docs/building-spark.md

Prashant Sharma



On Tue, Nov 18, 2014 at 12:19 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Any notable issues for using Scala 2.11? Is it stable now?

 Or can I use Scala 2.11 in my spark application and use Spark dist build
 with 2.10 ?

 I'm looking forward to migrate to 2.11 for some quasiquote features.
 Couldn't make it run in 2.10...

 Cheers,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



Re: Check your cluster UI to ensure that workers are registered and have sufficient memory

2014-11-17 Thread lin_qili
I occur to this issue with the spark on yarn version 1.0.2. Is there any
hints?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Check-your-cluster-UI-to-ensure-that-workers-are-registered-and-have-sufficient-memory-tp5358p19133.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How can I apply such an inner join in Spark Scala/Python

2014-11-17 Thread Akhil Das
Simple join would do it.

val a: List[(Int, Int)] = List((1,2),(2,4),(3,6))
val b: List[(Int, Int)] = List((1,3),(2,5),(3,6), (4,5),(5,6))

val A = sparkContext.parallelize(a)
val B = sparkContext.parallelize(b)

val ac = new PairRDDFunctions[Int, Int](A)

*val C = ac.join(B)*

C.foreach(println)


Thanks
Best Regards

On Mon, Nov 17, 2014 at 11:54 PM, Sean Owen so...@cloudera.com wrote:

 Just RDD.join() should be an inner join.

 On Mon, Nov 17, 2014 at 5:51 PM, Blind Faith person.of.b...@gmail.com
 wrote:
  So let us say I have RDDs A and B with the following values.
 
  A = [ (1, 2), (2, 4), (3, 6) ]
 
  B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ]
 
  I want to apply an inner join, such that I get the following as a result.
 
  C = [ (1, (2, 3)), (2, (4, 5)), (3, (6,6)) ]
 
  That is, those keys which are not present in A should disappear after the
  left inner join.
 
  How can I achieve that? I can see outerJoin functions but no innerJoin
  functions in the Spark RDD class.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Prashant Sharma
Looks like sbt/sbt -Pscala-2.11 is broken by a recent patch for improving
maven build.

Prashant Sharma



On Tue, Nov 18, 2014 at 12:57 PM, Prashant Sharma scrapco...@gmail.com
wrote:

 It is safe in the sense we would help you with the fix if you run into
 issues. I have used it, but since I worked on the patch the opinion can be
 biased. I am using scala 2.11 for day to day development. You should
 checkout the build instructions here :
 https://github.com/ScrapCodes/spark-1/blob/patch-3/docs/building-spark.md

 Prashant Sharma



 On Tue, Nov 18, 2014 at 12:19 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Any notable issues for using Scala 2.11? Is it stable now?

 Or can I use Scala 2.11 in my spark application and use Spark dist build
 with 2.10 ?

 I'm looking forward to migrate to 2.11 for some quasiquote features.
 Couldn't make it run in 2.10...

 Cheers,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





Re: Null pointer exception with larger datasets

2014-11-17 Thread Akhil Das
Make sure your list is not null, if that is null then its more like doing:

JavaRDDStudent distData = sc.parallelize(*null*)

distData.foreach(println)



Thanks
Best Regards

On Tue, Nov 18, 2014 at 12:07 PM, Naveen Kumar Pokala 
npok...@spcapitaliq.com wrote:

 Hi,



 I am having list Students and size is one Lakh and I am trying to save the
 file. It is throwing null pointer exception.



 JavaRDDStudent distData = sc.parallelize(list);



 distData.saveAsTextFile(hdfs://master/data/spark/instruments.txt);





 14/11/18 01:33:21 WARN scheduler.TaskSetManager: Lost task 5.0 in stage
 0.0 (TID 5, master): java.lang.NullPointerException:


 
 org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158)


 org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1158)

 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)


 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:984)


 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)

 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

 org.apache.spark.scheduler.Task.run(Task.scala:54)


 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)


 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)


 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

 java.lang.Thread.run(Thread.java:745)





 How to handle this?



 -Naveen



Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Ye Xianjin
Hi Prashant Sharma, 

It's not even ok to build with scala-2.11 profile on my machine.

Just check out the master(c6e0c2ab1c29c184a9302d23ad75e4ccd8060242)
run sbt/sbt -Pscala-2.11 clean assembly:

.. skip the normal part
info] Resolving org.scalamacros#quasiquotes_2.11;2.0.1 ...
[warn] module not found: org.scalamacros#quasiquotes_2.11;2.0.1
[warn]  local: tried
[warn]   
/Users/yexianjin/.ivy2/local/org.scalamacros/quasiquotes_2.11/2.0.1/ivys/ivy.xml
[warn]  public: tried
[warn]   
https://repo1.maven.org/maven2/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom
[warn]  central: tried
[warn]   
https://repo1.maven.org/maven2/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom
[warn]  apache-repo: tried
[warn]   
https://repository.apache.org/content/repositories/releases/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom
[warn]  jboss-repo: tried
[warn]   
https://repository.jboss.org/nexus/content/repositories/releases/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom
[warn]  mqtt-repo: tried
[warn]   
https://repo.eclipse.org/content/repositories/paho-releases/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom
[warn]  cloudera-repo: tried
[warn]   
https://repository.cloudera.com/artifactory/cloudera-repos/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom
[warn]  mapr-repo: tried
[warn]   
http://repository.mapr.com/maven/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom
[warn]  spring-releases: tried
[warn]   
https://repo.spring.io/libs-release/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom
[warn]  spark-staging: tried
[warn]   
https://oss.sonatype.org/content/repositories/orgspark-project-1085/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom
[warn]  spark-staging-hive13: tried
[warn]   
https://oss.sonatype.org/content/repositories/orgspark-project-1089/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom
[warn]  apache.snapshots: tried
[warn]   
http://repository.apache.org/snapshots/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom
[warn]  Maven2 Local: tried
[warn]   
file:/Users/yexianjin/.m2/repository/org/scalamacros/quasiquotes_2.11/2.0.1/quasiquotes_2.11-2.0.1.pom
[info] Resolving jline#jline;2.12 ...
[warn] ::
[warn] ::  UNRESOLVED DEPENDENCIES ::
[warn] ::
[warn] :: org.scalamacros#quasiquotes_2.11;2.0.1: not found
[warn] ::
[info] Resolving org.scala-lang#scala-library;2.11.2 ...
[warn]
[warn] Note: Unresolved dependencies path:
[warn] org.scalamacros:quasiquotes_2.11:2.0.1 
((com.typesafe.sbt.pom.MavenHelper) MavenHelper.scala#L76)
[warn]  +- org.apache.spark:spark-catalyst_2.11:1.2.0-SNAPSHOT
[info] Resolving jline#jline;2.12 ...
[info] Done updating.
[info] Updating {file:/Users/yexianjin/spark/}streaming-twitter...
[info] Updating {file:/Users/yexianjin/spark/}streaming-zeromq...
[info] Updating {file:/Users/yexianjin/spark/}streaming-flume...
[info] Updating {file:/Users/yexianjin/spark/}streaming-mqtt...
[info] Resolving jline#jline;2.12 ...
[info] Done updating.
[info] Resolving com.esotericsoftware.minlog#minlog;1.2 ...
[info] Updating {file:/Users/yexianjin/spark/}streaming-kafka...
[info] Resolving jline#jline;2.12 ...
[info] Done updating.
[info] Resolving jline#jline;2.12 ...
[info] Done updating.
[info] Resolving jline#jline;2.12 ...
[info] Done updating.
[info] Resolving org.apache.kafka#kafka_2.11;0.8.0 ...
[warn] module not found: org.apache.kafka#kafka_2.11;0.8.0
[warn]  local: tried
[warn]   
/Users/yexianjin/.ivy2/local/org.apache.kafka/kafka_2.11/0.8.0/ivys/ivy.xml
[warn]  public: tried
[warn]   
https://repo1.maven.org/maven2/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom
[warn]  central: tried
[warn]   
https://repo1.maven.org/maven2/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom
[warn]  apache-repo: tried
[warn]   
https://repository.apache.org/content/repositories/releases/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom
[warn]  jboss-repo: tried
[warn]   
https://repository.jboss.org/nexus/content/repositories/releases/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom
[warn]  mqtt-repo: tried
[warn]   
https://repo.eclipse.org/content/repositories/paho-releases/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom
[warn]  cloudera-repo: tried
[warn]   
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom
[warn]  mapr-repo: tried
[warn]   
http://repository.mapr.com/maven/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom
[warn]  spring-releases: tried
[warn]   
https://repo.spring.io/libs-release/org/apache/kafka/kafka_2.11/0.8.0/kafka_2.11-0.8.0.pom
[warn]  

Spark On Yarn Issue: Initial job has not accepted any resources

2014-11-17 Thread LinCharlie
Hi All:I was submitting a spark_program.jar to `spark on yarn cluster` on a 
driver machine with yarn-client mode. Here is the spark-submit command I used:
./spark-submit --master yarn-client --class 
com.charlie.spark.grax.OldFollowersExample --queue dt_spark 
~/script/spark-flume-test-0.1-SNAPSHOT-hadoop2.0.0-mr1-cdh4.2.1.jarThe queue 
`dt_spark` was free, and the program was submitted succesfully and running on 
the cluster.  But on console, it showed repeatedly that:
14/11/18 15:11:48 WARN YarnClientClusterScheduler: Initial job has not accepted 
any resources; check your cluster UI to ensure that workers are registered and 
have sufficient memory
Checked the cluster UI logs, I find no errors:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/home/disk5/yarn/usercache/linqili/filecache/6957209742046754908/spark-assembly-1.0.2-hadoop2.0.0-cdh4.2.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/home/hadoop/hadoop-2.0.0-cdh4.2.1/share/hadoop/common/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/11/18 14:28:16 INFO SecurityManager: Changing view acls to: hadoop,linqili
14/11/18 14:28:16 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(hadoop, linqili)
14/11/18 14:28:17 INFO Slf4jLogger: Slf4jLogger started
14/11/18 14:28:17 INFO Remoting: Starting remoting
14/11/18 14:28:17 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkyar...@longzhou-hdp3.lz.dscc:37187]
14/11/18 14:28:17 INFO Remoting: Remoting now listens on addresses: 
[akka.tcp://sparkyar...@longzhou-hdp3.lz.dscc:37187]
14/11/18 14:28:17 INFO ExecutorLauncher: ApplicationAttemptId: 
appattempt_1415961020140_0325_01
14/11/18 14:28:17 INFO ExecutorLauncher: Connecting to ResourceManager at 
longzhou-hdpnn.lz.dscc/192.168.19.107:12032
14/11/18 14:28:17 INFO ExecutorLauncher: Registering the ApplicationMaster
14/11/18 14:28:18 INFO ExecutorLauncher: Waiting for spark driver to be 
reachable.
14/11/18 14:28:18 INFO ExecutorLauncher: Master now available: 
192.168.59.90:36691
14/11/18 14:28:18 INFO ExecutorLauncher: Listen to driver: 
akka.tcp://spark@192.168.59.90:36691/user/CoarseGrainedScheduler
14/11/18 14:28:18 INFO ExecutorLauncher: Allocating 1 executors.
14/11/18 14:28:18 INFO YarnAllocationHandler: Allocating 1 executor containers 
with 1408 of memory each.
14/11/18 14:28:18 INFO YarnAllocationHandler: ResourceRequest (host : *, num 
containers: 1, priority = 1 , capability : memory: 1408)
14/11/18 14:28:18 INFO YarnAllocationHandler: Allocating 1 executor containers 
with 1408 of memory each.
14/11/18 14:28:18 INFO YarnAllocationHandler: ResourceRequest (host : *, num 
containers: 1, priority = 1 , capability : memory: 1408)
14/11/18 14:28:18 INFO RackResolver: Resolved longzhou-hdp3.lz.dscc to /rack1
14/11/18 14:28:18 INFO YarnAllocationHandler: launching container on 
container_1415961020140_0325_01_02 host longzhou-hdp3.lz.dscc
14/11/18 14:28:18 INFO ExecutorRunnable: Starting Executor Container
14/11/18 14:28:18 INFO ExecutorRunnable: Connecting to ContainerManager at 
longzhou-hdp3.lz.dscc:12040
14/11/18 14:28:18 INFO ExecutorRunnable: Setting up ContainerLaunchContext
14/11/18 14:28:18 INFO ExecutorRunnable: Preparing Local resources
14/11/18 14:28:18 INFO ExecutorLauncher: All executors have launched.
14/11/18 14:28:18 INFO ExecutorLauncher: Started progress reporter thread - 
sleep time : 5000
14/11/18 14:28:18 INFO YarnAllocationHandler: ResourceRequest (host : *, num 
containers: 0, priority = 1 , capability : memory: 1408)
14/11/18 14:28:18 INFO ExecutorRunnable: Prepared Local resources 
Map(__spark__.jar - resource {, scheme: hdfs, host: 
longzhou-hdpnn.lz.dscc, port: 11000, file: 
/user/linqili/.sparkStaging/application_1415961020140_0325/spark-assembly-1.0.2-hadoop2.0.0-cdh4.2.1.jar,
 }, size: 134859131, timestamp: 1416292093988, type: FILE, visibility: PRIVATE, 
)
14/11/18 14:28:18 INFO ExecutorRunnable: Setting up executor with commands: 
List($JAVA_HOME/bin/java, -server, -XX:OnOutOfMemoryError='kill %p', -Xms1024m 
-Xmx1024m , 
-Djava.security.krb5.conf=/home/linqili/proc/spark_client/hadoop/kerberos5-client/etc/krb5.conf
 
-Djava.library.path=/home/linqili/proc/spark_client/hadoop/lib/native/Linux-amd64-64,
 -Djava.io.tmpdir=$PWD/tmp,  
-Dlog4j.configuration=log4j-spark-container.properties, 
org.apache.spark.executor.CoarseGrainedExecutorBackend, 
akka.tcp://spark@192.168.59.90:36691/user/CoarseGrainedScheduler, 1, 
longzhou-hdp3.lz.dscc, 3, 1, LOG_DIR/stdout, 2, LOG_DIR/stderr)
14/11/18 14:28:23 INFO YarnAllocationHandler: ResourceRequest (host : *, num 
containers: 0, priority = 1 , capability : memory: 1408)
14/11/18 14:28:23 INFO YarnAllocationHandler: Completed container