date:20140828

If just want arbitrary unique id attached to each record in a dstream (no
ordering etc), then why not create generate and attach an UUID to each
record?



On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:

 I see a issue here.

 If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG.

 I wish there was DStream mapPartitionsWithIndex.


 On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng men...@gmail.com wrote:

 You can use RDD id as the seed, which is unique in the same spark
 context. Suppose none of the RDDs would contain more than 1 billion
 records. Then you can use

 rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid)

 Just a hack ..

 On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar
 kumar.soumi...@gmail.com wrote:
  So, I guess zipWithUniqueId will be similar.
 
  Is there a way to get unique index?
 
 
  On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng men...@gmail.com
 wrote:
 
  No. The indices start at 0 for every RDD. -Xiangrui
 
  On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
  kumar.soumi...@gmail.com wrote:
   Hello,
  
   If I do:
  
   DStream transform {
   rdd.zipWithIndex.map {
  
   Is the index guaranteed to be unique across all RDDs here?
  
   }
   }
  
   Thanks,
   -Soumitra.

sbt package assembly run spark examples

2014-08-28 Thread filipus

hi guys,

can someone explain or give the stupid user like me a link where i can get
the right usage of sbt and spark in order to run the examples as a stand
alone app

I got to the point running the app by sbt run path-to-the-data but still
get some error because i probably didnt tell the app that the master is
local (--master local) in the SparkContext method

in the BinaryClassification.scala programm it is set by

val conf = new SparkConf().setAppName(sBinaryClassification with $params)

so... how to adapt the code

in the docu it is written

val sc = new SparkContext(local, Simple App, YOUR_SPARK_HOME,
  List(target/scala-2.10/simple-project_2.10-1.0.jar))

I got the following error

 sbt run ~/git/spark/data/mllib/sample_binary_classification_data.txt
[info] Set current project to Simple Project (in build
file:/home/filip/spark-sample/)
[info] Running BinaryClassification
~/git/spark/data/mllib/sample_binary_classification_data.txt
[error] (run-main-0) org.apache.spark.SparkException: A master URL must be
set in your configuration
org.apache.spark.SparkException: A master URL must be set in your
configuration
at org.apache.spark.SparkContext.init(SparkContext.scala:166)
at BinaryClassification$.run(BinaryClassification.scala:107)
at
BinaryClassification$$anonfun$main$1.apply(BinaryClassification.scala:99)
at
BinaryClassification$$anonfun$main$1.apply(BinaryClassification.scala:98)
at scala.Option.map(Option.scala:145)
at BinaryClassification$.main(BinaryClassification.scala:98)
at BinaryClassification.main(BinaryClassification.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
[trace] Stack trace suppressed: run last compile:run for the full output.
java.lang.RuntimeException: Nonzero exit code: 1
at scala.sys.package$.error(package.scala:27)
[trace] Stack trace suppressed: run last compile:run for the full output.
[error] (compile:run) Nonzero exit code: 1
[error] Total time: 7 s, completed Aug 28, 2014 11:04:44 AM




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-package-assembly-run-spark-examples-tp13000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: sbt package assembly run spark examples

2014-08-28 Thread filipus

got it when I read the class refference

https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.SparkConf

conf.setMaster(local[2])

set the master to local with 2 threads

but still get some warnings and the result (see below) is also not right i
think

ps: by the way ... first i run like this

sbt run ~/git/spark/data/mllib/sample_binary_classification_data.txt

but the app didnt find the file because it startet at the local directory an
pointed to

/home/filip/spark-sample/~/git/spark/data/mllib/sample_binary_classification_data.txt

any explanation???

ps2: i guess many of us from the user side have problems with scala sbt and
the class lib. has any of you a suggestion how i can overcome this. it is
pretty time consuming trial and error :-( 



14/08/28 11:47:37 INFO ui.SparkUI: Started SparkUI at
http://filip-VirtualBox.localdomain:4040
14/08/28 11:47:41 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
14/08/28 11:47:41 WARN snappy.LoadSnappy: Snappy native library not loaded
Training: 84, test: 16.
14/08/28 11:47:47 WARN netlib.BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
14/08/28 11:47:47 WARN netlib.BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeRefBLAS
Test areaUnderPR = 1.0.
Test areaUnderROC = 1.0.






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sbt-package-assembly-run-spark-examples-tp13000p13001.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark SQL : how to find element where a field is in a given set

2014-08-28 Thread Jaonary Rabarisoa

Hi all,

What is the expression that I should use with spark sql DSL if I need to
retreive
data with a field in a given set.
For example :

I have the following schema

case class Person(name: String, age: Int)

And I need to do something like :

personTable.where('name in Seq(foo, bar)) ?


Cheers.


Jaonary

Re: Trying to run SparkSQL over Spark Streaming

2014-08-28 Thread praveshjain1991

Thanks for the reply. Sorry I could not ask more earlier.

Trying to use a parquet file is not working at all.

case class Rec(name:String,pv:Int)
val sqlContext=new org.apache.spark.sql.SQLContext(sc)

import sqlContext.createSchemaRDD
val d1=sc.parallelize(Array((a,10),(b,3))).map(e=Rec(e._1,e._2))
d1.saveAsParquetFile(p1.parquet)
val
d2=sc.parallelize(Array((a,10),(b,3),(c,5))).map(e=Rec(e._1,e._2))
d2.saveAsParquetFile(p2.parquet)
val f1=sqlContext.parquetFile(p1.parquet)
val f2=sqlContext.parquetFile(p2.parquet)
f1.registerAsTable(logs)
f2.insertInto(logs)

is giving the error :

 java.lang.AssertionError: assertion failed: No plan for InsertIntoTable
Map(), false 

same as the one that occured while trying to insert into from rdd to table.
So i guess inserting into parquet tables is also not supported?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-run-SparkSQL-over-Spark-Streaming-tp12530p13004.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to join two PairRDD together?

2014-08-28 Thread Yanbo Liang

Maybe you can refer sliding method of RDD, but it's right now mllib private
method.
Look at org.apache.spark.mllib.rdd.RDDFunctions.


2014-08-26 12:59 GMT+08:00 Vida Ha v...@databricks.com:

 Can you paste the code?  It's unclear to me how/when the out of memory is
 occurring without seeing the code.




 On Sun, Aug 24, 2014 at 11:37 PM, Gefei Li gefeili.2...@gmail.com wrote:

 Hello everyone,
 I am transplanting a clustering algorithm to spark platform, and I
 meet a problem confusing me for a long time, can someone help me?

 I have a PairRDDInteger, Integer named patternRDD, which the key
 represents a number and the value stores an information of the key. And I
 want to use two of the VALUEs to calculate a kendall number, and if the
 number is greater than 0.6, then output the two KEYs.

 I have tried to transform the PairRDD to a RDDTuple2Integer,
 Integer, and add a common key zero to them, and join two together then
 get a PairRDD0, IterableTuple2Tuple2key1, value1, Tuple2key2,
 value2, and tried to use values() method and map the keys out, but it
 gives me an out of memory error. I think the out of memory error is
 caused by the few entries of my RDD, but I have no idea how to solve it.

  Can you help me?

 Regards,
 Gefei Li

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

Hi,triedmvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree  dep.txtAttached the dep. txt for your information.[WARNING] 
[WARNING] Some problems were encountered while building the effective settings
[WARNING] Unrecognised tag: 'mirrors' (position: START_TAG seen .../mirror\n  
   --\n\n\n  mirrors... @161:12)  @ 
/opt/apache-maven-3.1.1/conf/settings.xml, line 161, column 12
[WARNING] 
[INFO] Scanning for projects...
[INFO] 
[INFO] 
[INFO] Building Spark Project Examples 1.0.2
[INFO] 
[INFO] 
[INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ spark-examples_2.10 
---
[INFO] org.apache.spark:spark-examples_2.10:jar:1.0.2
[INFO] +- org.apache.spark:spark-core_2.10:jar:1.0.2:provided
[INFO] |  +- org.apache.hadoop:hadoop-client:jar:2.4.1:provided
[INFO] |  |  +- org.apache.hadoop:hadoop-common:jar:2.4.1:provided
[INFO] |  |  |  \- org.apache.hadoop:hadoop-auth:jar:2.4.1:provided
[INFO] |  |  +- org.apache.hadoop:hadoop-hdfs:jar:2.4.1:provided
[INFO] |  |  +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.4.1:provided
[INFO] |  |  |  +- 
org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.4.1:provided
[INFO] |  |  |  |  +- org.apache.hadoop:hadoop-yarn-client:jar:2.4.1:provided
[INFO] |  |  |  |  |  \- com.sun.jersey:jersey-client:jar:1.9:provided
[INFO] |  |  |  |  \- 
org.apache.hadoop:hadoop-yarn-server-common:jar:2.4.1:provided
[INFO] |  |  |  \- 
org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.4.1:provided
[INFO] |  |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.4.1:provided
[INFO] |  |  +- 
org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.4.1:provided
[INFO] |  |  |  \- org.apache.hadoop:hadoop-yarn-common:jar:2.4.1:provided
[INFO] |  |  +- 
org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.4.1:provided
[INFO] |  |  \- org.apache.hadoop:hadoop-annotations:jar:2.4.1:provided
[INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.9.0:runtime
[INFO] |  |  +- org.apache.httpcomponents:httpclient:jar:4.1.2:compile
[INFO] |  |  +- org.apache.httpcomponents:httpcore:jar:4.1.2:compile
[INFO] |  |  \- com.jamesmurty.utils:java-xmlbuilder:jar:0.4:runtime
[INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:provided
[INFO] |  |  \- org.apache.curator:curator-framework:jar:2.4.0:provided
[INFO] |  | \- org.apache.curator:curator-client:jar:2.4.0:provided
[INFO] |  +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:provided
[INFO] |  |  +- 
org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:provided
[INFO] |  |  +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:provided
[INFO] |  |  |  +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:provided
[INFO] |  |  |  \- org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:provided
[INFO] |  |  \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:provided
[INFO] |  | \- 
org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:provided
[INFO] |  |\- 
org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:provided
[INFO] |  +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:provided
[INFO] |  +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile
[INFO] |  +- com.google.guava:guava:jar:14.0.1:compile
[INFO] |  +- org.apache.commons:commons-lang3:jar:3.3.2:provided
[INFO] |  +- com.google.code.findbugs:jsr305:jar:1.3.9:provided
[INFO] |  +- org.slf4j:slf4j-api:jar:1.7.5:compile
[INFO] |  +- org.slf4j:jul-to-slf4j:jar:1.7.5:provided
[INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.5:provided
[INFO] |  +- log4j:log4j:jar:1.2.17:compile
[INFO] |  +- org.slf4j:slf4j-log4j12:jar:1.7.5:compile
[INFO] |  +- com.ning:compress-lzf:jar:1.0.0:provided
[INFO] |  +- org.xerial.snappy:snappy-java:jar:1.0.5:compile
[INFO] |  +- com.twitter:chill_2.10:jar:0.3.6:provided
[INFO] |  |  \- com.esotericsoftware.kryo:kryo:jar:2.21:provided
[INFO] |  | +- 
com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:provided
[INFO] |  | +- com.esotericsoftware.minlog:minlog:jar:1.2:provided
[INFO] |  | \- org.objenesis:objenesis:jar:1.2:provided
[INFO] |  +- com.twitter:chill-java:jar:0.3.6:provided
[INFO] |  +- commons-net:commons-net:jar:2.2:compile
[INFO] |  +- 
org.spark-project.akka:akka-remote_2.10:jar:2.2.3-shaded-protobuf:provided
[INFO] |  |  +- 
org.spark-project.akka:akka-actor_2.10:jar:2.2.3-shaded-protobuf:compile
[INFO] |  |  |  \- com.typesafe:config:jar:1.0.2:compile
[INFO] |  |  +- io.netty:netty:jar:3.6.6.Final:provided
[INFO] |  |  +- 
org.spark-project.protobuf:protobuf-java:jar:2.4.1-shaded:compile
[INFO] |  |  \- org.uncommons.maths:uncommons-maths:jar:1.2.2a:provided
[INFO] |  +- 
org.spark-project.akka:akka-slf4j_2.10:jar:2.2.3-shaded-protobuf:provided
[INFO] |  +- org.scala-lang:scala-library:jar:2.10.4:compile

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

I see 0.98.5 in dep.txt

You should be good to go.


On Thu, Aug 28, 2014 at 3:16 AM, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,

 tried
 mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests
 dependency:tree  dep.txt

 Attached the dep. txt for your information.


 Regards
 Arthur

 On 28 Aug, 2014, at 12:22 pm, Ted Yu yuzhih...@gmail.com wrote:

 I forgot to include '-Dhadoop.version=2.4.1' in the command below.

 The modified command passed.

 You can verify the dependence on hbase 0.98 through this command:

 mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests
 dependency:tree  dep.txt

 Cheers


 On Wed, Aug 27, 2014 at 8:58 PM, Ted Yu yuzhih...@gmail.com wrote:

 Looks like the patch given by that URL only had the last commit.

 I have attached pom.xml for spark-1.0.2 to SPARK-1297
 You can download it and replace examples/pom.xml with the downloaded pom

 I am running this command locally:

 mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean package

 Cheers


 On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi Ted,

 Thanks.

 Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.)
 Is this normal?

 Regards
 Arthur


 patch -p1 -i 1893.patch
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 succeeded at 94 (offset -16 lines).
 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 patching file examples/pom.xml
 Hunk #1 FAILED at 54.
 Hunk #2 FAILED at 72.
  Hunk #3 succeeded at 122 (offset -49 lines).
 2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 patching file docs/building-with-maven.md
 patching file examples/pom.xml
 Hunk #1 succeeded at 122 (offset -40 lines).
 Hunk #2 succeeded at 195 (offset -40 lines).


 On 28 Aug, 2014, at 10:53 am, Ted Yu yuzhih...@gmail.com wrote:

 Can you use this command ?

 patch -p1 -i 1893.patch

 Cheers


 On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi Ted,

 I tried the following steps to apply the patch 1893 but got Hunk
 FAILED, can you please advise how to get thru this error? or is my
 spark-1.0.2 source not the correct one?

 Regards
 Arthur

 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2
 wget https://github.com/apache/spark/pull/1893.patch
 patch   1893.patch
 patching file pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 FAILED at 110.
 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej
 patching file pom.xml
 Hunk #1 FAILED at 54.
 Hunk #2 FAILED at 72.
 Hunk #3 FAILED at 171.
 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej
 can't find file to patch at input line 267
 Perhaps you should have used the -p or --strip option?
 The text leading up to this was:
 --
 |
 |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001
 |From: tedyu yuzhih...@gmail.com
 |Date: Mon, 11 Aug 2014 15:57:46 -0700
 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add
 | description to building-with-maven.md
 |
 |---
 | docs/building-with-maven.md | 3 +++
 | 1 file changed, 3 insertions(+)
 |
 |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
 |index 672d0ef..f8bcd2b 100644
 |--- a/docs/building-with-maven.md
 |+++ b/docs/building-with-maven.md
 --
 File to patch:



 On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote:

 You can get the patch from this URL:
 https://github.com/apache/spark/pull/1893.patch

 BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the
 pom.xml

 Cheers


 On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi Ted,

 Thank you so much!!

 As I am new to Spark, can you please advise the steps about how to
 apply this patch to my spark-1.0.2 source folder?

 Regards
 Arthur


 On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote:

 See SPARK-1297

  The pull request is here:
 https://github.com/apache/spark/pull/1893


 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 (correction: Compilation Error:  Spark 1.0.2 with HBase 0.98” ,
 please ignore if duplicated)


 Hi,

 I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2
 with HBase 0.98,

 My steps:
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2

 edit project/SparkBuild.scala, set HBASE_VERSION
   // HBase version; set as appropriate.
   val HBASE_VERSION = 0.98.2


 edit pom.xml with following values
 hadoop.version2.4.1/hadoop.version
 protobuf.version2.5.0/protobuf.version
 yarn.version${hadoop.version}/yarn.version
 hbase.version0.98.5/hbase.version
 zookeeper.version3.4.6/zookeeper.version
 hive.version0.13.1/hive.version


 SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
 but it

Spark-submit not running

The file is compiling properly but when I try to run the jar file using 
spark-submit, it is giving some errors. I am running spark locally and have 
downloaded a pre-built version of Spark named For Hadoop 2 (HDP2, CDH5). AI 
don't know if it is a dependency problem but I don't want to have Hadoop in my 
system. The error says:

14/08/28 12:34:36 ERROR util.Shell: Failed to locate the winutils binary in the 
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
Hadoop binaries.

Vineet

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread GADV

Not sure if this make sense, but maybe would be nice to have a kind of flag
available within the code that tells me if I'm running in a normal
situation or during a recovery.
To better explain this, let's consider the following scenario:
I am processing data, let's say from a Kafka streaming, and I am updating a
database based on the computations. During the recovery I don't want to
update again the database (for many reasons, let's just assume that) but I
want my system to be in the same status as before, thus I would like to know
if my code is running for the first time or during a recovery so I can avoid
to update the database again.
More generally I want to know this in case I'm interacting with external
entities.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13009.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

Hi,

I tried to start Spark but failed:

$ ./sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to 
/mnt/hadoop/spark-1.0.2/sbin/../logs/spark-edhuser-org.apache.spark.deploy.master.Master-1-m133.out
failed to launch org.apache.spark.deploy.master.Master:
  Failed to find Spark assembly in 
/mnt/hadoop/spark-1.0.2/assembly/target/scala-2.10/

$ ll assembly/
total 20
-rw-rw-r--. 1 hduser hadoop 11795 Jul 26 05:50 pom.xml
-rw-rw-r--. 1 hduser hadoop   507 Jul 26 05:50 README
drwxrwxr-x. 4 hduser hadoop  4096 Jul 26 05:50 src



Regards
Arthur



On 28 Aug, 2014, at 6:19 pm, Ted Yu yuzhih...@gmail.com wrote:

 I see 0.98.5 in dep.txt
 
 You should be good to go.
 
 
 On Thu, Aug 28, 2014 at 3:16 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 tried 
 mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests 
 dependency:tree  dep.txt
 
 Attached the dep. txt for your information. 
 
 
 Regards
 Arthur
 
 On 28 Aug, 2014, at 12:22 pm, Ted Yu yuzhih...@gmail.com wrote:
 
 I forgot to include '-Dhadoop.version=2.4.1' in the command below.
 
 The modified command passed.
 
 You can verify the dependence on hbase 0.98 through this command:
 
 mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests 
 dependency:tree  dep.txt
 
 Cheers
 
 
 On Wed, Aug 27, 2014 at 8:58 PM, Ted Yu yuzhih...@gmail.com wrote:
 Looks like the patch given by that URL only had the last commit.
 
 I have attached pom.xml for spark-1.0.2 to SPARK-1297
 You can download it and replace examples/pom.xml with the downloaded pom
 
 I am running this command locally:
 
 mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean package
 
 Cheers
 
 
 On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi Ted, 
 
 Thanks. 
 
 Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.)
 Is this normal?
 
 Regards
 Arthur
 
 
 patch -p1 -i 1893.patch
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 succeeded at 94 (offset -16 lines).
 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 patching file examples/pom.xml
 Hunk #1 FAILED at 54.
 Hunk #2 FAILED at 72.
 Hunk #3 succeeded at 122 (offset -49 lines).
 2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 patching file docs/building-with-maven.md
 patching file examples/pom.xml
 Hunk #1 succeeded at 122 (offset -40 lines).
 Hunk #2 succeeded at 195 (offset -40 lines).
 
 
 On 28 Aug, 2014, at 10:53 am, Ted Yu yuzhih...@gmail.com wrote:
 
 Can you use this command ?
 
 patch -p1 -i 1893.patch
 
 Cheers
 
 
 On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi Ted,
 
 I tried the following steps to apply the patch 1893 but got Hunk FAILED, 
 can you please advise how to get thru this error? or is my spark-1.0.2 
 source not the correct one?
 
 Regards
 Arthur
  
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2
 wget https://github.com/apache/spark/pull/1893.patch
 patch   1893.patch
 patching file pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 FAILED at 110.
 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej
 patching file pom.xml
 Hunk #1 FAILED at 54.
 Hunk #2 FAILED at 72.
 Hunk #3 FAILED at 171.
 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej
 can't find file to patch at input line 267
 Perhaps you should have used the -p or --strip option?
 The text leading up to this was:
 --
 |
 |From cd58437897bf02b644c2171404ccffae5d12a2be Mon Sep 17 00:00:00 2001
 |From: tedyu yuzhih...@gmail.com
 |Date: Mon, 11 Aug 2014 15:57:46 -0700
 |Subject: [PATCH 3/4] SPARK-1297 Upgrade HBase dependency to 0.98 - add
 | description to building-with-maven.md
 |
 |---
 | docs/building-with-maven.md | 3 +++
 | 1 file changed, 3 insertions(+)
 |
 |diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
 |index 672d0ef..f8bcd2b 100644
 |--- a/docs/building-with-maven.md
 |+++ b/docs/building-with-maven.md
 --
 File to patch:
 
 
 
 On 28 Aug, 2014, at 10:24 am, Ted Yu yuzhih...@gmail.com wrote:
 
 You can get the patch from this URL:
 https://github.com/apache/spark/pull/1893.patch
 
 BTW 0.98.5 has been released - you can specify 0.98.5-hadoop2 in the 
 pom.xml
 
 Cheers
 
 
 On Wed, Aug 27, 2014 at 7:18 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi Ted,
 
 Thank you so much!!
 
 As I am new to Spark, can you please advise the steps about how to apply 
 this patch to my spark-1.0.2 source folder?
 
 Regards
 Arthur
 
 
 On 28 Aug, 2014, at 10:13 am, Ted Yu yuzhih...@gmail.com wrote:
 
 See SPARK-1297
 
 The pull request is here:
 https://github.com/apache/spark/pull/1893
 
 
 On Wed, Aug 27, 2014 at 6:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 (correction: Compilation Error:  Spark 1.0.2 with HBase 0.98” , please 
 ignore if duplicated)

how to filter value in spark

2014-08-28 Thread marylucy

fileA=1 2 3 4  one number a line,save in /sparktest/1/
fileB=3 4 5 6  one number a line,save in /sparktest/2/
I want to get 3 and 4

var a = sc.textFile(/sparktest/1/).map((_,1))
var b = sc.textFile(/sparktest/2/).map((_,1))

a.filter(param={b.lookup(param._1).length0}).map(_._1).foreach(println)

Error throw
Scala.MatchError:Null
PairRDDFunctions.lookup...


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark-submit not running

You need to set HADOOP_HOME. Is Spark officially supposed to work on
Windows or not at this stage? I know the build doesn't quite yet.


On Thu, Aug 28, 2014 at 11:37 AM, Hingorani, Vineet 
vineet.hingor...@sap.com wrote:

  The file is compiling properly but when I try to run the jar file using
 spark-submit, it is giving some errors. I am running spark locally and have
 downloaded a pre-built version of Spark named “For Hadoop 2 (HDP2, CDH5)”.
 AI don’t know if it is a dependency problem but I don’t want to have Hadoop
 in my system. The error says:



 14/08/28 12:34:36 ERROR util.Shell: Failed to locate the winutils binary
 in the hadoop binary path

 java.io.IOException: Could not locate executable null\bin\winutils.exe in
 the Hadoop binaries.



 Vineet

Re: how to filter value in spark

2014-08-28 Thread Matthew Farrellee


On 08/28/2014 07:20 AM, marylucy wrote:

fileA=1 2 3 4  one number a line,save in /sparktest/1/
fileB=3 4 5 6  one number a line,save in /sparktest/2/
I want to get 3 and 4

var a = sc.textFile(/sparktest/1/).map((_,1))
var b = sc.textFile(/sparktest/2/).map((_,1))

a.filter(param={b.lookup(param._1).length0}).map(_._1).foreach(println)

Error throw
Scala.MatchError:Null
PairRDDFunctions.lookup...


the issue is nesting of the b rdd inside a transformation of the a rdd

consider using intersection, it's more idiomatic

a.intersection(b).foreach(println)

but not that intersection will remove duplicates

best,


matt

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

how to specify columns in groupby

2014-08-28 Thread MEETHU MATHEW

Hi all,

I have an RDD  which has values in the  format id,date,cost.

I want to group the elements based on the id and date columns and get the sum 
of the cost  for each group.

Can somebody tell me how to do this?

 
Thanks  Regards, 
Meethu M

Re: How to join two PairRDD together?

It sounds like you are adding the same key to every element, and joining,
in order to accomplish a full cartesian join? I can imagine doing it that
way would blow up somewhere. There is a cartesian() method to do this maybe
more efficiently.

However if your data set is large, this sort of algorithm for computing
Kendall's tau is going to be very slow since it's N^2 and would create an
unspeakably large shuffle. There are faster algorithms for this statistic.
Also consider sampling your data and computing the join over a small sample
to estimate the statistic.


On Thu, Aug 28, 2014 at 11:15 AM, Yanbo Liang yanboha...@gmail.com wrote:

 Maybe you can refer sliding method of RDD, but it's right now mllib
 private method.
 Look at org.apache.spark.mllib.rdd.RDDFunctions.


 2014-08-26 12:59 GMT+08:00 Vida Ha v...@databricks.com:

 Can you paste the code?  It's unclear to me how/when the out of memory is
 occurring without seeing the code.




 On Sun, Aug 24, 2014 at 11:37 PM, Gefei Li gefeili.2...@gmail.com
 wrote:

 Hello everyone,
 I am transplanting a clustering algorithm to spark platform, and I
 meet a problem confusing me for a long time, can someone help me?

 I have a PairRDDInteger, Integer named patternRDD, which the key
 represents a number and the value stores an information of the key. And I
 want to use two of the VALUEs to calculate a kendall number, and if the
 number is greater than 0.6, then output the two KEYs.

 I have tried to transform the PairRDD to a RDDTuple2Integer,
 Integer, and add a common key zero to them, and join two together then
 get a PairRDD0, IterableTuple2Tuple2key1, value1, Tuple2key2,
 value2, and tried to use values() method and map the keys out, but it
 gives me an out of memory error. I think the out of memory error is
 caused by the few entries of my RDD, but I have no idea how to solve it.

  Can you help me?

 Regards,
 Gefei Li

Re: Compilaon Error: Spark 1.0.2 with HBase 0.98

0.98.2 is not an HBase version, but 0.98.2-hadoop2 is:

http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.hbase%22%20AND%20a%3A%22hbase%22


On Thu, Aug 28, 2014 at 2:54 AM, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,

 I need to use Spark with HBase 0.98 and tried to compile Spark 1.0.2 with
 HBase 0.98,

 My steps:
 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2

 edit project/SparkBuild.scala, set HBASE_VERSION
   // HBase version; set as appropriate.
   val HBASE_VERSION = 0.98.2


 edit pom.xml with following values
 hadoop.version2.4.1/hadoop.version
 protobuf.version2.5.0/protobuf.version
 yarn.version${hadoop.version}/yarn.version
 hbase.version0.98.5/hbase.version
 zookeeper.version3.4.6/zookeeper.version
 hive.version0.13.1/hive.version


 SPARK_HADOOP_VERSION=2.4.1 SPARK_YARN=true sbt/sbt clean assembly
 but it fails because of UNRESOLVED DEPENDENCIES hbase;0.98.2

 Can you please advise how to compile Spark 1.0.2 with HBase 0.98? or
 should I set HBASE_VERSION back to “0.94.6?

 Regards
 Arthur




 [warn] ::
 [warn] ::  UNRESOLVED DEPENDENCIES ::
 [warn] ::
 [warn] :: org.apache.hbase#hbase;0.98.2: not found
 [warn] ::

 sbt.ResolveException: unresolved dependency:
 org.apache.hbase#hbase;0.98.2: not found
 at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
  at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
  at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
 at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
  at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:116)
  at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:104)
 at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:51)
  at sbt.IvySbt$$anon$3.call(Ivy.scala:60)
 at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
  at
 xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
  at
 xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
 at xsbt.boot.Using$.withResource(Using.scala:11)
  at xsbt.boot.Using$.apply(Using.scala:10)
 at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
  at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
 at xsbt.boot.Locks$.apply0(Locks.scala:31)
  at xsbt.boot.Locks$.apply(Locks.scala:28)
 at sbt.IvySbt.withDefaultLogger(Ivy.scala:60)
  at sbt.IvySbt.withIvy(Ivy.scala:101)
 at sbt.IvySbt.withIvy(Ivy.scala:97)
  at sbt.IvySbt$Module.withModule(Ivy.scala:116)
 at sbt.IvyActions$.update(IvyActions.scala:125)
  at
 sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1170)
  at
 sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1168)
 at
 sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1191)
  at
 sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$73.apply(Defaults.scala:1189)
  at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
 at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1193)
  at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1188)
  at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
 at sbt.Classpaths$.cachedUpdate(Defaults.scala:1196)
  at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1161)
  at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1139)
 at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
  at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)
  at sbt.std.Transform$$anon$4.work(System.scala:64)
 at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
  at
 sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
  at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
 at sbt.Execute.work(Execute.scala:244)
  at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
 at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
  at
 sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160)
  at sbt.CompletionService$$anon$2.call(CompletionService.scala:30)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
  at
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
  at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
  at java.lang.Thread.run(Thread.java:662)
 [error] (examples/*:update) sbt.ResolveException: unresolved dependency:
 org.apache.hbase#hbase;0.98.2: not found
 [error] Total time: 270 s, completed Aug 28, 2014 9:42:05 AM

Re: how to specify columns in groupby

2014-08-28 Thread Yanbo Liang

For your reference:

val d1 = textFile.map(line = {
  val fileds = line.split(,)
  ((fileds(0),fileds(1)), fileds(2).toDouble)
})

val d2 = d1.reduceByKey(_+_)
d2.foreach(println)


2014-08-28 20:04 GMT+08:00 MEETHU MATHEW meethu2...@yahoo.co.in:

 Hi all,

 I have an RDD  which has values in the  format id,date,cost.

 I want to group the elements based on the id and date columns and get the
 sum of the cost  for each group.

 Can somebody tell me how to do this?


 Thanks  Regards,
 Meethu M

Re: Key-Value Operations

If you mean your values are all a Seq or similar already, then you just
take the top 1 ordered by the size of the value:

rdd.top(1)(Ordering.by(_._2.size))


On Thu, Aug 28, 2014 at 9:34 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Hi,
 I have a RDD of key-value pairs. Now I want to find the key for which
 the values has the largest number of elements. How should I do that?
 Basically I want to select the key for which the number of items in values
 is the largest.
 Thank You

RE: Spark-submit not running

How can I set HADOOP_HOME if I am running the Spark on my local machine without 
anything else? Do I have to install some other pre-built file? I am on Windows 
7 and Spark’s official site says that it is available on Windows, I added Java 
path in the PATH variable.

Vineet

From: Sean Owen [mailto:so...@cloudera.com]
Sent: Donnerstag, 28. August 2014 13:49
To: Hingorani, Vineet
Cc: user@spark.apache.org
Subject: Re: Spark-submit not running

You need to set HADOOP_HOME. Is Spark officially supposed to work on Windows or 
not at this stage? I know the build doesn't quite yet.

On Thu, Aug 28, 2014 at 11:37 AM, Hingorani, Vineet 
vineet.hingor...@sap.commailto:vineet.hingor...@sap.com wrote:
The file is compiling properly but when I try to run the jar file using 
spark-submit, it is giving some errors. I am running spark locally and have 
downloaded a pre-built version of Spark named “For Hadoop 2 (HDP2, CDH5)”. AI 
don’t know if it is a dependency problem but I don’t want to have Hadoop in my 
system. The error says:

14/08/28 12:34:36 ERROR util.Shell: Failed to locate the winutils binary in the 
hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
Hadoop binaries.

Vineet

Re: Spark-submit not running

2014-08-28 Thread Guru Medasani

Can you copy the exact spark-submit command that you are running?

You should be able to run it locally without installing hadoop. 

Here is an example on how to run the job locally.

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \
  100


On Aug 28, 2014, at 7:54 AM, Hingorani, Vineet vineet.hingor...@sap.com wrote:

 How can I set HADOOP_HOME if I am running the Spark on my local machine 
 without anything else? Do I have to install some other pre-built file? I am 
 on Windows 7 and Spark’s official site says that it is available on Windows, 
 I added Java path in the PATH variable.
  
 Vineet
  
 From: Sean Owen [mailto:so...@cloudera.com] 
 Sent: Donnerstag, 28. August 2014 13:49
 To: Hingorani, Vineet
 Cc: user@spark.apache.org
 Subject: Re: Spark-submit not running
  
 You need to set HADOOP_HOME. Is Spark officially supposed to work on Windows 
 or not at this stage? I know the build doesn't quite yet.
  
 
 On Thu, Aug 28, 2014 at 11:37 AM, Hingorani, Vineet 
 vineet.hingor...@sap.com wrote:
 The file is compiling properly but when I try to run the jar file using 
 spark-submit, it is giving some errors. I am running spark locally and have 
 downloaded a pre-built version of Spark named “For Hadoop 2 (HDP2, CDH5)”.AI 
 don’t know if it is a dependency problem but I don’t want to have Hadoop in 
 my system. The error says:
  
 14/08/28 12:34:36 ERROR util.Shell: Failed to locate the winutils binary in 
 the hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
 Hadoop binaries.
  
 Vineet

Re: Spark-submit not running

Yes, but I think at the moment there is still a dependency on Hadoop even
when not using it. See https://issues.apache.org/jira/browse/SPARK-2356


On Thu, Aug 28, 2014 at 2:14 PM, Guru Medasani gdm...@outlook.com wrote:

 Can you copy the exact spark-submit command that you are running?

 You should be able to run it locally without installing hadoop.

 Here is an example on how to run the job locally.

 # Run application locally on 8 cores
 ./bin/spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --master local[8] \
   /path/to/examples.jar \
   100

SPARK on YARN, containers fails

2014-08-28 Thread Control

Hi there, 

I'm trying to run JavaSparkPi example on YARN with master = yarn-client but
I have a problem. 
It runs smoothly with submitting application, first container for
Application Master works too. 
When job is starting and there are some tasks to do I'm getting this warning
on console (I'm using windows cmd if this makes any difference): 
WARN cluster.YarnClientClusterScheduler: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory 

When I'm checking logs for container with Application Masters it is
launching containers for executors properly, then goes with: 
INFO YarnAllocationHandler: Completed container
container_1409217202587_0003_01_02 (state: COMPLETE, exit status: 1) 
INFO YarnAllocationHandler: Container marked as failed:
container_1409217202587_0003_01_02 

And tries to re-launch them. 

On failed container log there is only this: 
Error: Could not find or load main class
pwd..sp...@gbv06758291.my.secret.address.net:63680.user.CoarseGrainedScheduler 


Any hints? 
I run YARN on 1 machine with 16GB RAM and 2 cores so it's definietly not
case of propagation of spark assembly jar, cause without it firs container
with Application Master would fail too.  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-on-YARN-containers-fails-tp13024.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread Yana Kadiyska

Can you clarify the scenario:

 val ssc = new StreamingContext(sparkConf, Seconds(10))
 ssc.checkpoint(checkpointDirectory)

 val stream = KafkaUtils.createStream(...)
 val wordCounts = lines.flatMap(_.split( )).map(x = (x, 1L))

 val wordDstream= wordCounts.updateStateByKey[Int](updateFunc)

wordDstream.foreachRDD(

   rdd=rdd.foreachPartition( //sync state to external store//)
)



My impression is that during recovery from checkpoint, your wordDstream
would be in the state that it was before the crash +1 batch interval
forward when you get to the foreachRDD part -- even if re-creating the
pre-crash RDD is really slow. So if your driver goes down at 10:20 and you
restart at 10:30, I thought at the time of the DB write wordDstream would
have exactly the state of 10:20 + 10seconds worth of aggregated stream data?


I don't really understand what you mean by Upon metadata checkpoint
recovery (before the data checkpoint occurs) but it sounds like you're
observing the same DB write happening twice?

I don't have any advice for you but I am interested in understanding better
what happens in the recovery scenario so just trying to clarify what you
observe.


On Thu, Aug 28, 2014 at 6:42 AM, GADV giulio_devec...@yahoo.com wrote:

 Not sure if this make sense, but maybe would be nice to have a kind of
 flag
 available within the code that tells me if I'm running in a normal
 situation or during a recovery.
 To better explain this, let's consider the following scenario:
 I am processing data, let's say from a Kafka streaming, and I am updating a
 database based on the computations. During the recovery I don't want to
 update again the database (for many reasons, let's just assume that) but I
 want my system to be in the same status as before, thus I would like to
 know
 if my code is running for the first time or during a recovery so I can
 avoid
 to update the database again.
 More generally I want to know this in case I'm interacting with external
 entities.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13009.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

I didn't see that problem.
Did you run this command ?

mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests
clean package

Here is what I got:

TYus-MacBook-Pro:spark-1.0.2 tyu$ sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to
/Users/tyu/spark-1.0.2/sbin/../logs/spark-tyu-org.apache.spark.deploy.master.Master-1-TYus-MacBook-Pro.local.out
localhost: ssh: connect to host localhost port 22: Connection refused

TYus-MacBook-Pro:spark-1.0.2 tyu$ vi
logs/spark-tyu-org.apache.spark.deploy.master.Master-1-TYus-MacBook-Pro.local.out
TYus-MacBook-Pro:spark-1.0.2 tyu$ jps
11563 Master
11635 Jps

TYus-MacBook-Pro:spark-1.0.2 tyu$ ps aux | grep 11563
tyu 11563   0.7  0.8  196 142444 s003  S 6:52AM
0:02.72
/Library/Java/JavaVirtualMachines/jdk1.7.0_60.jdk/Contents/Home/bin/java
-cp
::/Users/tyu/spark-1.0.2/conf:/Users/tyu/spark-1.0.2/assembly/target/scala-2.10/spark-assembly-1.0.2-hadoop2.4.1.jar
-XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m
org.apache.spark.deploy.master.Master --ip TYus-MacBook-Pro.local --port
7077 --webui-port 8080

TYus-MacBook-Pro:spark-1.0.2 tyu$ ls -l
assembly/target/scala-2.10/spark-assembly-1.0.2-hadoop2.4.1.jar
-rw-r--r--  1 tyu  staff  121182305 Aug 27 21:13
assembly/target/scala-2.10/spark-assembly-1.0.2-hadoop2.4.1.jar

Cheers


On Thu, Aug 28, 2014 at 3:42 AM, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,

 I tried to start Spark but failed:

 $ ./sbin/start-all.sh
 starting org.apache.spark.deploy.master.Master, logging to
 /mnt/hadoop/spark-1.0.2/sbin/../logs/spark-edhuser-org.apache.spark.deploy.master.Master-1-m133.out
 failed to launch org.apache.spark.deploy.master.Master:
   Failed to find Spark assembly in
 /mnt/hadoop/spark-1.0.2/assembly/target/scala-2.10/

 $ ll assembly/
 total 20
 -rw-rw-r--. 1 hduser hadoop 11795 Jul 26 05:50 pom.xml
 -rw-rw-r--. 1 hduser hadoop   507 Jul 26 05:50 README
 drwxrwxr-x. 4 hduser hadoop  4096 Jul 26 05:50 *src*



 Regards
 Arthur



 On 28 Aug, 2014, at 6:19 pm, Ted Yu yuzhih...@gmail.com wrote:

 I see 0.98.5 in dep.txt

 You should be good to go.


 On Thu, Aug 28, 2014 at 3:16 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi,

 tried
 mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests
 dependency:tree  dep.txt

 Attached the dep. txt for your information.


 Regards
 Arthur

 On 28 Aug, 2014, at 12:22 pm, Ted Yu yuzhih...@gmail.com wrote:

 I forgot to include '-Dhadoop.version=2.4.1' in the command below.

 The modified command passed.

 You can verify the dependence on hbase 0.98 through this command:

 mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests
 dependency:tree  dep.txt

 Cheers


 On Wed, Aug 27, 2014 at 8:58 PM, Ted Yu yuzhih...@gmail.com wrote:

 Looks like the patch given by that URL only had the last commit.

 I have attached pom.xml for spark-1.0.2 to SPARK-1297
 You can download it and replace examples/pom.xml with the downloaded pom

 I am running this command locally:

 mvn -Phbase-hadoop2,hadoop-2.4,yarn -DskipTests clean package

 Cheers


 On Wed, Aug 27, 2014 at 7:57 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi Ted,

 Thanks.

 Tried [patch -p1 -i 1893.patch](Hunk #1 FAILED at 45.)
 Is this normal?

 Regards
 Arthur


 patch -p1 -i 1893.patch
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 succeeded at 94 (offset -16 lines).
 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 patching file examples/pom.xml
 Hunk #1 FAILED at 54.
 Hunk #2 FAILED at 72.
  Hunk #3 succeeded at 122 (offset -49 lines).
 2 out of 3 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 patching file docs/building-with-maven.md
 patching file examples/pom.xml
 Hunk #1 succeeded at 122 (offset -40 lines).
 Hunk #2 succeeded at 195 (offset -40 lines).


 On 28 Aug, 2014, at 10:53 am, Ted Yu yuzhih...@gmail.com wrote:

 Can you use this command ?

 patch -p1 -i 1893.patch

 Cheers


 On Wed, Aug 27, 2014 at 7:41 PM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:

 Hi Ted,

 I tried the following steps to apply the patch 1893 but got Hunk
 FAILED, can you please advise how to get thru this error? or is my
 spark-1.0.2 source not the correct one?

 Regards
 Arthur

 wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2.tgz
 tar -vxf spark-1.0.2.tgz
 cd spark-1.0.2
 wget https://github.com/apache/spark/pull/1893.patch
 patch   1893.patch
 patching file pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 FAILED at 110.
 2 out of 2 hunks FAILED -- saving rejects to file pom.xml.rej
 patching file pom.xml
 Hunk #1 FAILED at 54.
 Hunk #2 FAILED at 72.
 Hunk #3 FAILED at 171.
 3 out of 3 hunks FAILED -- saving rejects to file pom.xml.rej
 can't find file to patch at input line 267
 Perhaps you should have used the -p or --strip option?
 The text leading up to this was:
 --
 |
 |From

repartitioning an RDD yielding imbalance

2014-08-28 Thread Rok Roskar

I've got an RDD where each element is a long string (a whole document). I'm 
using pyspark so some of the handy partition-handling functions aren't 
available, and I count the number of elements in each partition with: 

def count_partitions(id, iterator): 
c = sum(1 for _ in iterator)
yield (id,c) 

 rdd.mapPartitionsWithSplit(count_partitions).collectAsMap()

This returns the following: 

{0: 866, 1: 1158, 2: 828, 3: 876}

But if I do: 

 rdd.repartition(8).mapPartitionsWithSplit(count_partitions).collectAsMap()

I get

{0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 3594, 7: 134}

Why this strange redistribution of elements? I'm obviously misunderstanding how 
spark does the partitioning -- is it a problem with having a list of strings as 
an RDD? 

Help vey much appreciated! 

Thanks,

Rok


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Graphx: undirected graph support

2014-08-28 Thread FokkoDriesprong

A bit in analogy with a linked-list a double linked-list. It might introduce
overhead in terms of memory usage, but you could use two directed edges to
substitute the uni-directed edge.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-undirected-graph-support-tp2142p13028.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark-submit not running

2014-08-28 Thread Guru Medasani

Thanks Sean.

Looks like there is a workaround as per the JIRA 
https://issues.apache.org/jira/browse/SPARK-2356 .

http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7.

May be that's worth a shot?

On Aug 28, 2014, at 8:15 AM, Sean Owen so...@cloudera.com wrote:

 Yes, but I think at the moment there is still a dependency on Hadoop even 
 when not using it. See https://issues.apache.org/jira/browse/SPARK-2356  
 
 
 On Thu, Aug 28, 2014 at 2:14 PM, Guru Medasani gdm...@outlook.com wrote:
 Can you copy the exact spark-submit command that you are running?
 
 You should be able to run it locally without installing hadoop. 
 
 Here is an example on how to run the job locally.
 
 
 # Run application locally on 8 cores
 ./bin/spark-submit \
   --class org.apache.spark.examples.SparkPi \
   --master local[8] \
   /path/to/examples.jar \
   100

RE: Spark-submit not running

Thank you Sean and Guru for giving the information. Btw I have to put this 
line, but where should I add it? In my scala file below other ‘import …’ lines 
are written?
System.setProperty(hadoop.home.dir, d:\\winutil\\)
Thank you

Vineet

From: Sean Owen [mailto:so...@cloudera.com]
Sent: Donnerstag, 28. August 2014 15:16
To: Guru Medasani
Cc: Hingorani, Vineet; user@spark.apache.org
Subject: Re: Spark-submit not running

Yes, but I think at the moment there is still a dependency on Hadoop even when 
not using it. See https://issues.apache.org/jira/browse/SPARK-2356

On Thu, Aug 28, 2014 at 2:14 PM, Guru Medasani 
gdm...@outlook.commailto:gdm...@outlook.com wrote:
Can you copy the exact spark-submit command that you are running?

You should be able to run it locally without installing hadoop.

Here is an example on how to run the job locally.






# Run application locally on 8 cores

./bin/spark-submit \

  --class org.apache.spark.examples.SparkPi \

  --master local[8] \

  /path/to/examples.jar \

  100

Re: Spark-submit not running

You should set this as early as possible in your program, before other
code runs.

On Thu, Aug 28, 2014 at 3:27 PM, Hingorani, Vineet
vineet.hingor...@sap.com wrote:
 Thank you Sean and Guru for giving the information. Btw I have to put this
 line, but where should I add it? In my scala file below other ‘import …’
 lines are written?

 System.setProperty(hadoop.home.dir, d:\\winutil\\)


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Spark-submit not running

The following error is given when I try to add this line:

[info] Set current project to Simple Project (in build 
file:/C:/Users/D062844/Desktop/Hand
sOnSpark/Install/spark-1.0.2-bin-hadoop2/)
[info] Compiling 1 Scala source to 
C:\Users\D062844\Desktop\HandsOnSpark\Install\spark-1.0
.2-bin-hadoop2\target\scala-2.10\classes...
[error] 
C:\Users\D062844\Desktop\HandsOnSpark\Install\spark-1.0.2-bin-hadoop2\src\main\sca
la\SimpleApp.scala:1: expected class or object definition
[error] System.setProperty(hadoop.home.dir, c:\\winutil\\)
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed

Vineet

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Donnerstag, 28. August 2014 16:30
To: Hingorani, Vineet
Cc: user@spark.apache.org
Subject: Re: Spark-submit not running

You should set this as early as possible in your program, before other
code runs.

On Thu, Aug 28, 2014 at 3:27 PM, Hingorani, Vineet
vineet.hingor...@sap.com wrote:
 Thank you Sean and Guru for giving the information. Btw I have to put this
 line, but where should I add it? In my scala file below other ‘import …’
 lines are written?

 System.setProperty(hadoop.home.dir, d:\\winutil\\)

Change delimiter when collecting SchemaRDD

2014-08-28 Thread yadid ayzenberg

Hi All,

Is there any way to change the delimiter from being a comma ?
Some of the strings in my data contain commas as well, making it very
difficult to parse the results.

Yadid

Print to spark log

2014-08-28 Thread jamborta

Hi all,

Just wondering if there is a way to use logging to print to spark logs some
additional info (similar to debug in scalding).

Thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Print-to-spark-log-tp13035.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: CUDA in spark, especially in MLlib?

2014-08-28 Thread Debasish Das

Breeze author David also has a github project on cuda binding in
scalado you prefer using java or scala ?
 On Aug 27, 2014 2:05 PM, Frank van Lankvelt f.vanlankv...@onehippo.com
wrote:

 you could try looking at ScalaCL[1], it's targeting OpenCL rather than
 CUDA, but that might be close enough?

 cheers, Frank

 1. https://github.com/ochafik/ScalaCL


 On Wed, Aug 27, 2014 at 7:33 PM, Wei Tan w...@us.ibm.com wrote:

 Thank you all. Actually I was looking at JCUDA. Function wise this may be
 a perfect solution to offload computation to GPU. Will see how performance
 it will be, especially with the Java binding.

 Best regards,
 Wei

 -
 Wei Tan, PhD
 Research Staff Member
 IBM T. J. Watson Research Center
 *http://researcher.ibm.com/person/us-wtan*
 http://researcher.ibm.com/person/us-wtan



 From:Chen He airb...@gmail.com
 To:Antonio Jesus Navarro ajnava...@stratio.com,
 Cc:Matei Zaharia matei.zaha...@gmail.com, user@spark.apache.org,
 Wei Tan/Watson/IBM@IBMUS
 Date:08/27/2014 11:03 AM
 Subject:Re: CUDA in spark, especially in MLlib?
 --



 JCUDA can let you do that in Java
 *http://www.jcuda.org* http://www.jcuda.org/


 On Wed, Aug 27, 2014 at 1:48 AM, Antonio Jesus Navarro 
 *ajnava...@stratio.com* ajnava...@stratio.com wrote:
 Maybe this would interest you:

 CPU and GPU-accelerated Machine Learning Library:

 *https://github.com/BIDData/BIDMach* https://github.com/BIDData/BIDMach


 2014-08-27 4:08 GMT+02:00 Matei Zaharia *matei.zaha...@gmail.com*
 matei.zaha...@gmail.com:

 You should try to find a Java-based library, then you can call it from
 Scala.

 Matei

 On August 26, 2014 at 6:58:11 PM, Wei Tan (*w...@us.ibm.com*
 w...@us.ibm.com) wrote:

 Hi I am trying to find a CUDA library in Scala, to see if some matrix
 manipulation in MLlib can be sped up.

 I googled a few but found no active projects on Scala+CUDA. Python is
 supported by CUDA though. Any suggestion on whether this idea makes any
 sense?

 Best regards,
 Wei






 --
 Amsterdam - Oosteinde 11, 1017 WT Amsterdam
 Boston - 1 Broadway, Cambridge, MA 02142

 US +1 877 414 4776 (toll free)
 Europe +31(0)20 522 4466
 www.onehippo.com

SPARK-1297 patch error (spark-1297-v4.txt )

Hi,

I have just tried to apply the patch of SPARK-1297: 
https://issues.apache.org/jira/browse/SPARK-1297

There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt 
respectively.

When applying the 2nd one, I got Hunk #1 FAILED at 45

Can you please advise how to fix it in order to make the compilation of Spark 
Project Examples success?
(Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2)

Regards
Arthur



patch -p1 -i spark-1297-v4.txt 
patching file examples/pom.xml
Hunk #1 FAILED at 45.
Hunk #2 succeeded at 94 (offset -16 lines).
1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej

below is the content of examples/pom.xml.rej:
+++ examples/pom.xml
@@ -45,6 +45,39 @@
 /dependency
   /dependencies
 /profile
+profile
+  idhbase-hadoop2/id
+  activation
+property
+  namehbase.profile/name
+  valuehadoop2/value
+/property
+  /activation
+  properties
+protobuf.version2.5.0/protobuf.version
+hbase.version0.98.4-hadoop2/hbase.version
+  /properties
+  dependencyManagement
+dependencies
+/dependencies
+  /dependencyManagement
+/profile
+profile
+  idhbase-hadoop1/id
+  activation
+property
+  name!hbase.profile/name
+/property
+  /activation
+  properties
+hbase.version0.98.4-hadoop1/hbase.version
+  /properties
+  dependencyManagement
+dependencies
+/dependencies
+  /dependencyManagement
+/profile
+
   /profiles
   
   dependencies


This caused the related compilation failed:
[INFO] Spark Project Examples  FAILURE [0.102s]

Re: SPARK-1297 patch error (spark-1297-v4.txt )

I attached patch v5 which corresponds to the pull request.

Please try again.


On Thu, Aug 28, 2014 at 9:50 AM, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,

 I have just tried to apply the patch of SPARK-1297:
 https://issues.apache.org/jira/browse/SPARK-1297

 There are two files in it, named spark-1297-v2.txt
 and spark-1297-v4.txt respectively.

 When applying the 2nd one, I got Hunk #1 FAILED at 45

 Can you please advise how to fix it in order to make the compilation
 of Spark Project Examples success?
 (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2)

 Regards
 Arthur



 patch -p1 -i spark-1297-v4.txt
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 succeeded at 94 (offset -16 lines).
 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej

 below is the content of examples/pom.xml.rej:
 +++ examples/pom.xml
 @@ -45,6 +45,39 @@
  /dependency
/dependencies
  /profile
 +profile
 +  idhbase-hadoop2/id
 +  activation
 +property
 +  namehbase.profile/name
 +  valuehadoop2/value
 +/property
 +  /activation
 +  properties
 +protobuf.version2.5.0/protobuf.version
 +hbase.version0.98.4-hadoop2/hbase.version
 +  /properties
 +  dependencyManagement
 +dependencies
 +/dependencies
 +  /dependencyManagement
 +/profile
 +profile
 +  idhbase-hadoop1/id
 +  activation
 +property
 +  name!hbase.profile/name
 +/property
 +  /activation
 +  properties
 +hbase.version0.98.4-hadoop1/hbase.version
 +  /properties
 +  dependencyManagement
 +dependencies
 +/dependencies
 +  /dependencyManagement
 +/profile
 +
/profiles


dependencies


 This caused the related compilation failed:
 [INFO] Spark Project Examples  FAILURE [0.102s]

Re: SPARK-1297 patch error (spark-1297-v4.txt )

Hi,

patch -p1 -i spark-1297-v5.txt 
can't find file to patch at input line 5
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--
|diff --git docs/building-with-maven.md docs/building-with-maven.md
|index 672d0ef..f8bcd2b 100644
|--- docs/building-with-maven.md
|+++ docs/building-with-maven.md
--
File to patch: 

Please advise
Regards
Arthur



On 29 Aug, 2014, at 12:50 am, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,
 
 I have just tried to apply the patch of SPARK-1297: 
 https://issues.apache.org/jira/browse/SPARK-1297
 
 There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt 
 respectively.
 
 When applying the 2nd one, I got Hunk #1 FAILED at 45
 
 Can you please advise how to fix it in order to make the compilation of Spark 
 Project Examples success?
 (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2)
 
 Regards
 Arthur
 
 
 
 patch -p1 -i spark-1297-v4.txt 
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 succeeded at 94 (offset -16 lines).
 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 
 below is the content of examples/pom.xml.rej:
 +++ examples/pom.xml
 @@ -45,6 +45,39 @@
  /dependency
/dependencies
  /profile
 +profile
 +  idhbase-hadoop2/id
 +  activation
 +property
 +  namehbase.profile/name
 +  valuehadoop2/value
 +/property
 +  /activation
 +  properties
 +protobuf.version2.5.0/protobuf.version
 +hbase.version0.98.4-hadoop2/hbase.version
 +  /properties
 +  dependencyManagement
 +dependencies
 +/dependencies
 +  /dependencyManagement
 +/profile
 +profile
 +  idhbase-hadoop1/id
 +  activation
 +property
 +  name!hbase.profile/name
 +/property
 +  /activation
 +  properties
 +hbase.version0.98.4-hadoop1/hbase.version
 +  /properties
 +  dependencyManagement
 +dependencies
 +/dependencies
 +  /dependencyManagement
 +/profile
 +
/profiles

dependencies
 
 
 This caused the related compilation failed:
 [INFO] Spark Project Examples  FAILURE [0.102s]

New SparkR mailing list, JIRA

2014-08-28 Thread Shivaram Venkataraman

Hi

I'd like to announce a couple of updates to the SparkR project. In order to
facilitate better collaboration for new features and development we have a
new mailing list, issue tracker for SparkR.

- The new JIRA is hosted at https://sparkr.atlassian.net/browse/SPARKR/ and
we have migrated all existing Github issues to the JIRA. Please submit any
bugs / improvements to this JIRA going forward.

- There is a new mailing list sparkr-...@googlegroups.com that will be used
for design discussions for new features and development related issues. We
will still be answering to user issues on Apache Spark mailing lists.

Please let me know if have any questions.

Thanks
Shivaram

Converting a DStream's RDDs to SchemaRDDs

2014-08-28 Thread Verma, Rishi (398J)

Hi Folks,

I’d like to find out tips on how to convert the RDDs inside a Spark Streaming 
DStream to a set of SchemaRDDs.

My DStream contains JSON data pushed over from Kafka, and I’d like to use 
SparkSQL’s JSON import function (i.e. jsonRDD) to register the JSON dataset as 
a table, and perform queries on it.

Here’s a code snippet of my latest attempt (in Scala):
…
val sc = new SparkContext(conf)
val ssc = new StreamingContext(local, this.getClass.getName, Seconds(1))
ssc.checkpoint(checkpoint)

val stream = KafkaUtils.createStream(ssc, localhost:2181, “group, 
Map(“topic - 10)).map(_._2)
val sql = new SQLContext(sc)

stream.foreachRDD(rdd = {
if (rdd.count  0) {
// message received
val sqlRDD = sql.jsonRDD(rdd)
sqlRDD.printSchema()
} else {
println(“No message received)
}
})
…

This compiles and runs when I submit it to Spark (local-mode); however, I never 
seem to be able to successfully see a schema printed on my console, via the 
“sqlRDD.printSchema()” method when Kafka is streaming my JSON messages to the 
“topic” topic name. I know my JSON is valid and my Kafka connection works fine, 
I’ve been able to print the stream messages in their raw format, just not as 
SchemaRDDs.  

Any tips? Suggestions?

Thanks much,
---
Rishi Verma
NASA Jet Propulsion Laboratory
California Institute of Technology





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark webUI - application details page

2014-08-28 Thread Brad Miller

Hi All,

@Andrew
Thanks for the tips. I just built the master branch of Spark last
night, but am still having problems viewing history through the
standalone UI. I dug into the Spark job events directories as you
suggested, and I see at a minimum 'SPARK_VERSION_1.0.0' and
'EVENT_LOG_1'; for applications that call 'sc.stop()' I also see
'APPLICATION_COMPLETE'. The version and application complete files
are empty; the event log file contains the information one would need
to repopulate the web UI.

The follow may be helpful in debugging this:
-Each job directory (e.g.
'/tmp/spark-events/testhistoryjob-1409246088110') and the files within
are owned by the user who ran the job with permissions 770. This
prevents the 'spark' user from accessing the contents.

-When I make a directory and contents accessible to the spark user,
the history server (invoked as 'sbin/start-history-server.sh
/tmp/spark-events') is able to display the history, but the standalone
web UI still produces the following error: 'No event logs found for
application HappyFunTimes in
file:///tmp/spark-events/testhistoryjob-1409246088110. Did you specify
the correct logging directory?'

-Incase it matters, I'm running pyspark.

Do you know what may be causing this? When you attempt to reproduce
locally, who do you observe owns the files in /tmp/spark-events?

best,
-Brad

On Tue, Aug 26, 2014 at 8:51 AM, SK skrishna...@gmail.com wrote:
I have already tried setting the history server and accessing it on
master-url:18080 as per the link. But the page does not list any completed
applications. As I mentioned in my previous mail, I am running Spark in
standalone mode on the cluster (as well as on my local machine). According
to the link, it appears that the history server is required only in mesos or
yarn mode, not in standalone mode.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p12834.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread RodrigoB

Hi Yana,

The fact is that the DB writing is happening on the node level and not on
Spark level. One of the benefits of distributed computing nature of Spark is
enabling IO distribution as well. For example, is much faster to have the
nodes to write to Cassandra instead of having them all collected at the
driver level and sending the writes from there.

The problem is that nodes computations which get redone upon recovery. If
these lambda functions send events to other systems these events would get
resent upon re-computation causing overall system instability.

Hope this helps you understand the problematic.

tnks,
Rod 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13043.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SPARK-1297 patch error (spark-1297-v4.txt )

bq.  Spark 1.0.2

For the above release, you can download pom.xml attached to the JIRA and
place it in examples directory

I verified that the build against 0.98.4 worked using this command:

mvn -Dhbase.profile=hadoop2 -Phadoop-2.4,yarn -Dhadoop.version=2.4.1
-DskipTests clean package

Patch v5 is @ level 0 - you don't need to use -p1 in the patch command.

Cheers


On Thu, Aug 28, 2014 at 9:50 AM, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,

 I have just tried to apply the patch of SPARK-1297:
 https://issues.apache.org/jira/browse/SPARK-1297

 There are two files in it, named spark-1297-v2.txt
 and spark-1297-v4.txt respectively.

 When applying the 2nd one, I got Hunk #1 FAILED at 45

 Can you please advise how to fix it in order to make the compilation
 of Spark Project Examples success?
 (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2)

 Regards
 Arthur



 patch -p1 -i spark-1297-v4.txt
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 succeeded at 94 (offset -16 lines).
 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej

 below is the content of examples/pom.xml.rej:
 +++ examples/pom.xml
 @@ -45,6 +45,39 @@
  /dependency
/dependencies
  /profile
 +profile
 +  idhbase-hadoop2/id
 +  activation
 +property
 +  namehbase.profile/name
 +  valuehadoop2/value
 +/property
 +  /activation
 +  properties
 +protobuf.version2.5.0/protobuf.version
 +hbase.version0.98.4-hadoop2/hbase.version
 +  /properties
 +  dependencyManagement
 +dependencies
 +/dependencies
 +  /dependencyManagement
 +/profile
 +profile
 +  idhbase-hadoop1/id
 +  activation
 +property
 +  name!hbase.profile/name
 +/property
 +  /activation
 +  properties
 +hbase.version0.98.4-hadoop1/hbase.version
 +  /properties
 +  dependencyManagement
 +dependencies
 +/dependencies
 +  /dependencyManagement
 +/profile
 +
/profiles


dependencies


 This caused the related compilation failed:
 [INFO] Spark Project Examples  FAILURE [0.102s]

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-28 Thread DNoteboom

Sorry for the extremely late reply. It turns out that the same error occurred
when running on yarn. However, I recently updated my project to depend on
cdh5 and the issue I was having disappeared and I am no longer setting the
userClassPathFirst to true. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-files-userClassPathFirst-true-Not-Working-Correctly-tp11917p13045.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Low Level Kafka Consumer for Spark

2014-08-28 Thread Chris Fregly

@bharat-

overall, i've noticed a lot of confusion about how Spark Streaming scales -
as well as how it handles failover and checkpointing, but we can discuss
that separately.

there's actually 2 dimensions to scaling here: receiving and processing.

*Receiving*
receiving can be scaled out by submitting new DStreams/Receivers to the
cluster as i've done in the Kinesis example. in fact, i purposely chose to
submit multiple receivers in my Kinesis example because i feel it should be
the norm and not the exception - particularly for partitioned and
checkpoint-capable streaming systems like Kafka and Kinesis. it's the
only way to scale.

a side note here is that each receiver running in the cluster will
immediately replicates to 1 other node for fault-tolerance of that specific
receiver. this is where the confusion lies. this 2-node replication is
mainly for failover in case the receiver dies while data is in flight.
there's still chance for data loss as there's no write ahead log on the
hot path, but this is being addressed.

this in mentioned in the docs here:
https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

*Processing*
once data is received, tasks are scheduled across the Spark cluster just
like any other non-streaming task where you can specify the number of
partitions for reduces, etc. this is the part of scaling that is sometimes
overlooked - probably because it works just like regular Spark, but it is
worth highlighting.

Here's a blurb in the docs:
https://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-processing

the other thing that's confusing with Spark Streaming is that in Scala, you
need to explicitly

import org.apache.spark.streaming.StreamingContext.toPairDStreamFunctions

in order to pick up the implicits that allow DStream.reduceByKey and such
(versus DStream.transform(rddBatch = rddBatch.reduceByKey())

in other words, DStreams appear to be relatively featureless until you
discover this implicit. otherwise, you need to operate on the underlying
RDD's explicitly which is not ideal.

the Kinesis example referenced earlier in the thread uses the DStream
implicits.

side note to all of this - i've recently convinced my publisher for my
upcoming book, Spark In Action, to let me jump ahead and write the Spark
Streaming chapter ahead of other more well-understood libraries. early
release is in a month or so. sign up @ http://sparkinaction.com if you
wanna get notified.

shameless plug that i wouldn't otherwise do, but i really think it will
help clear a lot of confusion in this area as i hear these questions asked
a lot in my talks and such. and i think a clear, crisp story on scaling
and fault-tolerance will help Spark Streaming's adoption.

hope that helps!

-chris

On Wed, Aug 27, 2014 at 6:32 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:

I agree. This issue should be fixed in Spark rather rely on replay of
Kafka messages.

Dib
On Aug 28, 2014 6:45 AM, RodrigoB rodrigo.boav...@aspect.com wrote:

Dibyendu,

Tnks for getting back.

I believe you are absolutely right. We were under the assumption that the
raw data was being computed again and that's not happening after further
tests. This applies to Kafka as well.

The issue is of major priority fortunately.

Regarding your suggestion, I would maybe prefer to have the problem
resolved
within Spark's internals since once the data is replicated we should be
able
to access it once more and not having to pool it back again from Kafka or
any other stream that is being affected by this issue. If for example
there
is a big amount of batches to be recomputed I would rather have them done
distributed than overloading the batch interval with huge amount of Kafka
messages.

I do not have yet enough know how on where is the issue and about the
internal Spark code so I can't really how much difficult will be the
implementation.

tnks,
Rod

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p12966.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SPARK-1297 patch error (spark-1297-v4.txt )

Hi Ted,

I downloaded pom.xml to examples directory.
It works, thanks!!

Regards
Arthur


[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM .. SUCCESS [2.119s]
[INFO] Spark Project Core  SUCCESS [1:27.100s]
[INFO] Spark Project Bagel ... SUCCESS [10.261s]
[INFO] Spark Project GraphX .. SUCCESS [31.332s]
[INFO] Spark Project ML Library .. SUCCESS [35.226s]
[INFO] Spark Project Streaming ... SUCCESS [39.135s]
[INFO] Spark Project Tools ... SUCCESS [6.469s]
[INFO] Spark Project Catalyst  SUCCESS [36.521s]
[INFO] Spark Project SQL . SUCCESS [35.488s]
[INFO] Spark Project Hive  SUCCESS [35.296s]
[INFO] Spark Project REPL  SUCCESS [18.668s]
[INFO] Spark Project YARN Parent POM . SUCCESS [0.583s]
[INFO] Spark Project YARN Stable API . SUCCESS [15.989s]
[INFO] Spark Project Assembly  SUCCESS [11.497s]
[INFO] Spark Project External Twitter  SUCCESS [8.777s]
[INFO] Spark Project External Kafka .. SUCCESS [9.688s]
[INFO] Spark Project External Flume .. SUCCESS [10.411s]
[INFO] Spark Project External ZeroMQ . SUCCESS [9.511s]
[INFO] Spark Project External MQTT ... SUCCESS [8.451s]
[INFO] Spark Project Examples  SUCCESS [1:40.240s]
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 8:33.350s
[INFO] Finished at: Fri Aug 29 01:58:00 HKT 2014
[INFO] Final Memory: 82M/1086M
[INFO] 


On 29 Aug, 2014, at 1:36 am, Ted Yu yuzhih...@gmail.com wrote:

 bq.  Spark 1.0.2
 
 For the above release, you can download pom.xml attached to the JIRA and 
 place it in examples directory
 
 I verified that the build against 0.98.4 worked using this command:
 
 mvn -Dhbase.profile=hadoop2 -Phadoop-2.4,yarn -Dhadoop.version=2.4.1 
 -DskipTests clean package
 
 Patch v5 is @ level 0 - you don't need to use -p1 in the patch command.
 
 Cheers
 
 
 On Thu, Aug 28, 2014 at 9:50 AM, arthur.hk.c...@gmail.com 
 arthur.hk.c...@gmail.com wrote:
 Hi,
 
 I have just tried to apply the patch of SPARK-1297: 
 https://issues.apache.org/jira/browse/SPARK-1297
 
 There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt 
 respectively.
 
 When applying the 2nd one, I got Hunk #1 FAILED at 45
 
 Can you please advise how to fix it in order to make the compilation of Spark 
 Project Examples success?
 (Here: Hadoop 2.4.1, HBase 0.98.5, Spark 1.0.2)
 
 Regards
 Arthur
 
 
 
 patch -p1 -i spark-1297-v4.txt 
 patching file examples/pom.xml
 Hunk #1 FAILED at 45.
 Hunk #2 succeeded at 94 (offset -16 lines).
 1 out of 2 hunks FAILED -- saving rejects to file examples/pom.xml.rej
 
 below is the content of examples/pom.xml.rej:
 +++ examples/pom.xml
 @@ -45,6 +45,39 @@
  /dependency
/dependencies
  /profile
 +profile
 +  idhbase-hadoop2/id
 +  activation
 +property
 +  namehbase.profile/name
 +  valuehadoop2/value
 +/property
 +  /activation
 +  properties
 +protobuf.version2.5.0/protobuf.version
 +hbase.version0.98.4-hadoop2/hbase.version
 +  /properties
 +  dependencyManagement
 +dependencies
 +/dependencies
 +  /dependencyManagement
 +/profile
 +profile
 +  idhbase-hadoop1/id
 +  activation
 +property
 +  name!hbase.profile/name
 +/property
 +  /activation
 +  properties
 +hbase.version0.98.4-hadoop1/hbase.version
 +  /properties
 +  dependencyManagement
 +dependencies
 +/dependencies
 +  /dependencyManagement
 +/profile
 +
/profiles

dependencies
 
 
 This caused the related compilation failed:
 [INFO] Spark Project Examples  FAILURE [0.102s]

org.apache.hadoop.io.compress.SnappyCodec not found

Hi,

I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and 
HBase.
With default setting in Spark 1.0.2, when trying to load a file I got Class 
org.apache.hadoop.io.compress.SnappyCodec not found

Can you please advise how to enable snappy in Spark?

Regards
Arthur


scala inFILE.first()
java.lang.RuntimeException: Error in configuring object
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.RDD.take(RDD.scala:983)
at org.apache.spark.rdd.RDD.first(RDD.scala:1015)
at $iwC$$iwC$$iwC$$iwC.init(console:15)
at $iwC$$iwC$$iwC.init(console:20)
at $iwC$$iwC.init(console:22)
at $iwC.init(console:24)
at init(console:26)
at .init(console:30)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:106)
... 55 more
Caused by: java.lang.IllegalArgumentException: Compression codec   
org.apache.hadoop.io.compress.SnappyCodec not found.
at 
org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135)
at

Re: CUDA in spark, especially in MLlib?

2014-08-28 Thread Wei Tan

Thank you Debasish.

I am fine with either Scala or Java. I would like to get a quick 
evaluation on the performance gain, e.g., ALS on GPU. I would like to try 
whichever library does the business :)

Best regards,
Wei

-
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:   Debasish Das debasish.da...@gmail.com
To: Frank van Lankvelt f.vanlankv...@onehippo.com, 
Cc: Matei Zaharia matei.zaha...@gmail.com, user 
user@spark.apache.org, Antonio Jesus Navarro ajnava...@stratio.com, 
Chen He airb...@gmail.com, Wei Tan/Watson/IBM@IBMUS
Date:   08/28/2014 12:20 PM
Subject:Re: CUDA in spark, especially in MLlib?



Breeze author David also has a github project on cuda binding in 
scalado you prefer using java or scala ?
On Aug 27, 2014 2:05 PM, Frank van Lankvelt f.vanlankv...@onehippo.com 
wrote:
you could try looking at ScalaCL[1], it's targeting OpenCL rather than 
CUDA, but that might be close enough?

cheers, Frank

1. https://github.com/ochafik/ScalaCL


On Wed, Aug 27, 2014 at 7:33 PM, Wei Tan w...@us.ibm.com wrote:
Thank you all. Actually I was looking at JCUDA. Function wise this may be 
a perfect solution to offload computation to GPU. Will see how performance 
it will be, especially with the Java binding. 

Best regards, 
Wei 

- 
Wei Tan, PhD 
Research Staff Member 
IBM T. J. Watson Research Center 
http://researcher.ibm.com/person/us-wtan 



From:Chen He airb...@gmail.com 
To:Antonio Jesus Navarro ajnava...@stratio.com, 
Cc:Matei Zaharia matei.zaha...@gmail.com, user@spark.apache.org, 
Wei Tan/Watson/IBM@IBMUS 
Date:08/27/2014 11:03 AM 
Subject:Re: CUDA in spark, especially in MLlib? 




JCUDA can let you do that in Java 
http://www.jcuda.org 


On Wed, Aug 27, 2014 at 1:48 AM, Antonio Jesus Navarro 
ajnava...@stratio.com wrote: 
Maybe this would interest you: 

CPU and GPU-accelerated Machine Learning Library: 

https://github.com/BIDData/BIDMach 


2014-08-27 4:08 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com: 

You should try to find a Java-based library, then you can call it from 
Scala. 

Matei 
On August 26, 2014 at 6:58:11 PM, Wei Tan (w...@us.ibm.com) wrote: 
Hi I am trying to find a CUDA library in Scala, to see if some matrix 
manipulation in MLlib can be sped up.

I googled a few but found no active projects on Scala+CUDA. Python is 
supported by CUDA though. Any suggestion on whether this idea makes any 
sense?

Best regards,
Wei






-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: org.apache.hadoop.io.compress.SnappyCodec not found

Hi,

my check native result:

hadoop checknative
14/08/29 02:54:51 WARN bzip2.Bzip2Factory: Failed to load/initialize 
native-bzip2 library system-native, will use pure-Java version
14/08/29 02:54:51 INFO zlib.ZlibFactory: Successfully loaded  initialized 
native-zlib library
Native library checking:
hadoop: true 
/mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libhadoop.so
zlib:   true /lib64/libz.so.1
snappy: true 
/mnt/hadoop/hadoop-2.4.1_snappy/lib/native/Linux-amd64-64/libsnappy.so.1
lz4:true revision:99
bzip2:  false

Any idea how to enable or disable  snappy in Spark?

Regards
Arthur


On 29 Aug, 2014, at 2:39 am, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,
 
 I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and 
 HBase.
 With default setting in Spark 1.0.2, when trying to load a file I got Class 
 org.apache.hadoop.io.compress.SnappyCodec not found
 
 Can you please advise how to enable snappy in Spark?
 
 Regards
 Arthur
 
 
 scala inFILE.first()
 java.lang.RuntimeException: Error in configuring object
   at 
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
   at 
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
   at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.RDD.take(RDD.scala:983)
   at org.apache.spark.rdd.RDD.first(RDD.scala:1015)
   at $iwC$$iwC$$iwC$$iwC.init(console:15)
   at $iwC$$iwC$$iwC.init(console:20)
   at $iwC$$iwC.init(console:22)
   at $iwC.init(console:24)
   at init(console:26)
   at .init(console:30)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.reflect.InvocationTargetException
   at

org.apache.spark.examples.xxx

2014-08-28 Thread filipus

hey guys

i still try to get used to compile and run the example code

why does the run_example code submit the class with an
org.apache.spark.examples in front of the class itself?

probably a stupid question but i would be glad some one of you explains

by the way.. how was the spark...example...jar file build? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-spark-examples-xxx-tp13052.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Print to spark log

2014-08-28 Thread Control

I'm not sure if this is the case, but basic monitoring is described here:
https://spark.apache.org/docs/latest/monitoring.html
If it comes to something more sophisticated I was for example able to save
some messages into local logs and view them in YARN UI via http by editing
spark source code (use logInfo() method inside where needed) and build
assembly again with maven.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Print-to-spark-log-tp13035p13053.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

What happens if I have a function like a PairFunction but which might return multiple values

2014-08-28 Thread Steve Lewis

In many cases when I work with Map Reduce my mapper or my reducer might
take a single value and map it to multiple keys -
The reducer might also take a single key and emit multiple values

I don't think that functions like flatMap and reduceByKey will work or are
there tricks I am not aware of

Re: Spark webUI - application details page

I was able to recently solve this problem for standalone mode. For this mode,
I did not use a history server. Instead, I set spark.eventLog.dir (in
conf/spark-defaults.conf) to a directory in hdfs (basically this directory
should be in a place that is writable by the master and accessible globally
to all the nodes). 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p13055.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: OutofMemoryError when generating output

Hi,
Thanks for the response. I tried to use countByKey. But I am not able to
write the output to console or to a file. Neither collect() nor
saveAsTextFile() work for the Map object that is generated after
countByKey(). 

valx = sc.textFile(baseFile)).map { line =
val fields = line.split(\t)
   (fields(11), fields(6)) // extract (month, user_id)
  }.distinct().countByKey()

x.saveAsTextFile(...)  // does not work. generates an error that
saveAstextFile is not defined for Map object


Is there a way to convert the Map object to an object that I can output to
console and to a file?

thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutofMemoryError-when-generating-output-tp12847p13056.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Print to spark log

2014-08-28 Thread jamborta

thanks for the reply. 

I was looking for something for the case when it's running outside of the
spark framework. if I declare a sparkcontext or and rdd that could print
some messages in the log? 

The problem I have that if I print something from the scala object that runs
the spark app, it does not print to the console for some reason.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Print-to-spark-log-tp13035p13057.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Soumitra Kumar

Yes, that is an option.

I started with a function of batch time, and index to generate id as long. This 
may be faster than generating UUID, with added benefit of sorting based on time.

- Original Message -
From: Tathagata Das tathagata.das1...@gmail.com
To: Soumitra Kumar kumar.soumi...@gmail.com
Cc: Xiangrui Meng men...@gmail.com, user@spark.apache.org
Sent: Thursday, August 28, 2014 2:19:38 AM
Subject: Re: Spark Streaming: DStream - zipWithIndex

If just want arbitrary unique id attached to each record in a dstream (no 
ordering etc), then why not create generate and attach an UUID to each record? 

On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar  kumar.soumi...@gmail.com  
wrote: 

I see a issue here. 

If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG. 

I wish there was DStream mapPartitionsWithIndex. 

On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng  men...@gmail.com  wrote: 

You can use RDD id as the seed, which is unique in the same spark 
context. Suppose none of the RDDs would contain more than 1 billion 
records. Then you can use 

rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid) 

Just a hack .. 

On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar 

 kumar.soumi...@gmail.com  wrote: 
 So, I guess zipWithUniqueId will be similar. 

 Is there a way to get unique index? 

 On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng  men...@gmail.com  wrote: 

 No. The indices start at 0 for every RDD. -Xiangrui 

 On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar 
  kumar.soumi...@gmail.com  wrote: 
  Hello, 

  If I do: 

  DStream transform { 
  rdd.zipWithIndex.map { 

  Is the index guaranteed to be unique across all RDDs here? 

  } 
  } 

  Thanks, 
  -Soumitra. 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark Streaming: DStream - zipWithIndex

But then if you want to generate ids that are unique across ALL the records
that you are going to see in a stream (which can be potentially infinite),
then you definitely need a number space larger than long :)

TD


On Thu, Aug 28, 2014 at 12:48 PM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:

 Yes, that is an option.

 I started with a function of batch time, and index to generate id as long.
 This may be faster than generating UUID, with added benefit of sorting
 based on time.

 - Original Message -
 From: Tathagata Das tathagata.das1...@gmail.com
 To: Soumitra Kumar kumar.soumi...@gmail.com
 Cc: Xiangrui Meng men...@gmail.com, user@spark.apache.org
 Sent: Thursday, August 28, 2014 2:19:38 AM
 Subject: Re: Spark Streaming: DStream - zipWithIndex


 If just want arbitrary unique id attached to each record in a dstream (no
 ordering etc), then why not create generate and attach an UUID to each
 record?





 On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar  kumar.soumi...@gmail.com
  wrote:



 I see a issue here.


 If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG.


 I wish there was DStream mapPartitionsWithIndex.





 On Wed, Aug 27, 2014 at 3:04 PM, Xiangrui Meng  men...@gmail.com  wrote:


 You can use RDD id as the seed, which is unique in the same spark
 context. Suppose none of the RDDs would contain more than 1 billion
 records. Then you can use

 rdd.zipWithUniqueId().mapValues(uid = rdd.id * 1e9.toLong + uid)

 Just a hack ..

 On Wed, Aug 27, 2014 at 2:59 PM, Soumitra Kumar


  kumar.soumi...@gmail.com  wrote:
  So, I guess zipWithUniqueId will be similar.
 
  Is there a way to get unique index?
 
 
  On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng  men...@gmail.com 
 wrote:
 
  No. The indices start at 0 for every RDD. -Xiangrui
 
  On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
   kumar.soumi...@gmail.com  wrote:
   Hello,
  
   If I do:
  
   DStream transform {
   rdd.zipWithIndex.map {
  
   Is the index guaranteed to be unique across all RDDs here?
  
   }
   }
  
   Thanks,
   -Soumitra.

Q on downloading spark for standalone cluster

2014-08-28 Thread Sanjeev Sagar


Hello there,

I've a basic question on the downloadthat which option I need to 
downloadfor standalone cluster.


I've a private cluster of three machineson Centos. When I click on 
download it shows me following:



   Download Spark

The latest release is Spark 1.0.2, released August 5, 2014 (release 
notes) http://spark.apache.org/releases/spark-release-1-0-2.html (git 
tag) 
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f


Pre-built packages:

 * For Hadoop 1 (HDP1, CDH3): find an Apache mirror
   
http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-hadoop1.tgz
   or direct file download
   http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop1.tgz
 * For CDH4: find an Apache mirror
   
http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-cdh4.tgz
   or direct file download
   http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-cdh4.tgz
 * For Hadoop 2 (HDP2, CDH5): find an Apache mirror
   
http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
   or direct file download
   http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz

Pre-built packages, third-party (NOTE: may include non ASF-compatible 
licenses):


 * For MapRv3: direct file download (external)
   http://package.mapr.com/tools/apache-spark/1.0.2/spark-1.0.2-bin-mapr3.tgz
 * For MapRv4: direct file download (external)
   http://package.mapr.com/tools/apache-spark/1.0.2/spark-1.0.2-bin-mapr4.tgz


From the above it looks like that I've to donwload Hadoop or CDH4 first 
in order to use Spark ? I've a standalone cluster and my data size is 
also like hundreds of Gig or close to Terabyte.


I don't get it that which one I need to download from the above list.

Could some one assist me that which one I need to download for 
standalone cluster and for big data foot print ?


or Hadoop is needed or mandatory for using Spark? that's not the 
understanding I've. My understanding is that you can use spark with 
Hadoop if you like from yarn2 but you could use spark standalone also 
without hadoop.


Please assist. I'm confused !

-Sanjeev


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Converting a DStream's RDDs to SchemaRDDs

Try using local[n] with n  1, instead of local. Since receivers take up
1 slot, and local is basically 1 slot, there is no slot left to process
the data. That's why nothing gets printed.

TD


On Thu, Aug 28, 2014 at 10:28 AM, Verma, Rishi (398J) 
rishi.ve...@jpl.nasa.gov wrote:

 Hi Folks,

 I’d like to find out tips on how to convert the RDDs inside a Spark
 Streaming DStream to a set of SchemaRDDs.

 My DStream contains JSON data pushed over from Kafka, and I’d like to use
 SparkSQL’s JSON import function (i.e. jsonRDD) to register the JSON dataset
 as a table, and perform queries on it.

 Here’s a code snippet of my latest attempt (in Scala):
 …
 val sc = new SparkContext(conf)
 val ssc = new StreamingContext(local, this.getClass.getName, Seconds(1))
 ssc.checkpoint(checkpoint)

 val stream = KafkaUtils.createStream(ssc, localhost:2181, “group,
 Map(“topic - 10)).map(_._2)
 val sql = new SQLContext(sc)

 stream.foreachRDD(rdd = {
 if (rdd.count  0) {
 // message received
 val sqlRDD = sql.jsonRDD(rdd)
 sqlRDD.printSchema()
 } else {
 println(“No message received)
 }
 })
 …

 This compiles and runs when I submit it to Spark (local-mode); however, I
 never seem to be able to successfully see a schema printed on my console,
 via the “sqlRDD.printSchema()” method when Kafka is streaming my JSON
 messages to the “topic” topic name. I know my JSON is valid and my Kafka
 connection works fine, I’ve been able to print the stream messages in their
 raw format, just not as SchemaRDDs.

 Any tips? Suggestions?

 Thanks much,
 ---
 Rishi Verma
 NASA Jet Propulsion Laboratory
 California Institute of Technology





 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: repartitioning an RDD yielding imbalance

2014-08-28 Thread Davies Liu

On Thu, Aug 28, 2014 at 7:00 AM, Rok Roskar rokros...@gmail.com wrote:
 I've got an RDD where each element is a long string (a whole document). I'm 
 using pyspark so some of the handy partition-handling functions aren't 
 available, and I count the number of elements in each partition with:

 def count_partitions(id, iterator):
 c = sum(1 for _ in iterator)
 yield (id,c)

 rdd.mapPartitionsWithSplit(count_partitions).collectAsMap()

 This returns the following:

 {0: 866, 1: 1158, 2: 828, 3: 876}

 But if I do:

 rdd.repartition(8).mapPartitionsWithSplit(count_partitions).collectAsMap()

 I get

 {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 3594, 7: 134}

 Why this strange redistribution of elements? I'm obviously misunderstanding 
 how spark does the partitioning -- is it a problem with having a list of 
 strings as an RDD?

This imbalance was introduce by BatchedDeserializer. By default,
Python elements in RDD are serialized by pickle in batch (1024
elements in one batch),
so in the view of Scala, it only see one or two element of Array[Byte]
in the RDD, then imbalance happened.

To fix this, you could change the default batchSize to 10 (or less) or
reserialize your RDD as in unbatched mode, for example:

sc = SparkContext(batchSize=10)
rdd = sc.textFile().repartition(8)

OR

rdd._reserialize(PickleSerializer()).repartition(8)

PS: _reserialize() is not an public API, so it may be changed in the future.

Davies

 Help vey much appreciated!

 Thanks,

 Rok


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Q on downloading spark for standalone cluster

2014-08-28 Thread Daniel Siegmann

If you aren't using Hadoop, I don't think it matters which you download.
I'd probably just grab the Hadoop 2 package.

Out of curiosity, what are you using as your data store? I get the
impression most Spark users are using HDFS or something built on top.


On Thu, Aug 28, 2014 at 4:07 PM, Sanjeev Sagar 
sanjeev.sa...@mypointscorp.com wrote:

 Hello there,

 I've a basic question on the downloadthat which option I need to
 downloadfor standalone cluster.

 I've a private cluster of three machineson Centos. When I click on
 download it shows me following:


Download Spark

 The latest release is Spark 1.0.2, released August 5, 2014 (release notes)
 http://spark.apache.org/releases/spark-release-1-0-2.html (git tag) 
 https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
 8fb6f00e195fb258f3f70f04756e07c259a2351f

 Pre-built packages:

  * For Hadoop 1 (HDP1, CDH3): find an Apache mirror
http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/
 spark-1.0.2-bin-hadoop1.tgz
or direct file download
http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop1.tgz
  * For CDH4: find an Apache mirror
http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/
 spark-1.0.2-bin-cdh4.tgz
or direct file download
http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-cdh4.tgz
  * For Hadoop 2 (HDP2, CDH5): find an Apache mirror
http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/
 spark-1.0.2-bin-hadoop2.tgz
or direct file download
http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz

 Pre-built packages, third-party (NOTE: may include non ASF-compatible
 licenses):

  * For MapRv3: direct file download (external)
http://package.mapr.com/tools/apache-spark/1.0.2/
 spark-1.0.2-bin-mapr3.tgz
  * For MapRv4: direct file download (external)
http://package.mapr.com/tools/apache-spark/1.0.2/
 spark-1.0.2-bin-mapr4.tgz


 From the above it looks like that I've to donwload Hadoop or CDH4 first in
 order to use Spark ? I've a standalone cluster and my data size is also
 like hundreds of Gig or close to Terabyte.

 I don't get it that which one I need to download from the above list.

 Could some one assist me that which one I need to download for standalone
 cluster and for big data foot print ?

 or Hadoop is needed or mandatory for using Spark? that's not the
 understanding I've. My understanding is that you can use spark with Hadoop
 if you like from yarn2 but you could use spark standalone also without
 hadoop.

 Please assist. I'm confused !

 -Sanjeev


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Re: Where to save intermediate results?

2014-08-28 Thread Daniel Siegmann

I assume your on-demand calculations are a streaming flow? If your data
aggregated from batch isn't too large, maybe you should just save it to
disk; when your streaming flow starts you can read the aggregations back
from disk and perhaps just broadcast them. Though I guess you'd have to
restart your streaming flow when these aggregations are updated.

For something more sophisticated, maybe look at Redis http://redis.io/ or
some distributed database? Your ETL can update that store, and your
on-demand job can query it.

On Thu, Aug 28, 2014 at 4:30 PM, huylv huy.le...@insight-centre.org wrote:

Hi,

I'm building a system for near real-time data analytics. My plan is to have
an ETL batch job which calculates aggregations running periodically. User
queries are then parsed for on-demand calculations, also in Spark. Where
are
the pre-calculated results supposed to be saved? I mean, after finishing
aggregations, the ETL job will terminate, so caches are wiped out of
memory.
How can I use these results to calculate on-demand queries? Or more
generally, could you please give me a good way to organize the data flow
and
jobs in order to achieve this?

I'm new to Spark so sorry if this might sound like a dumb question.

Thank you.
Huy

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Where-to-save-intermediate-results-tp13062.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

Re: What happens if I have a function like a PairFunction but which might return multiple values

To emulate a Mapper, flatMap() is exactly what you want. Since it
flattens, it means you return an Iterable of values instead of 1
value. That can be a Collection containing many values, or 1, or 0.

For a reducer, to really reproduce what a Reducer does in Java, I
think you will need groupByKey() followed by flatMapValues().

In many cases when I work with Map Reduce my mapper or my reducer
might take a single value and map it to multiple keys -
The reducer might also take a single key and emit multiple values

I don't think that functions like flatMap and reduceByKey will work or
are there tricks I am not aware of

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Q on downloading spark for standalone cluster

2014-08-28 Thread Sagar, Sanjeev

Hello Daniel, If you’re not using Hadoop then why you want to grab the Hadoop 
package? CDH5 will download all the Hadoop packages and cloudera manager too.

Just curious what happen if you start spark on EC2 cluster, what it choose for 
the data store as default?

-Sanjeev

From: Daniel Siegmann [mailto:daniel.siegm...@velos.io]
Sent: Thursday, August 28, 2014 2:04 PM
To: Sagar, Sanjeev
Cc: user@spark.apache.org
Subject: Re: Q on downloading spark for standalone cluster

If you aren't using Hadoop, I don't think it matters which you download. I'd 
probably just grab the Hadoop 2 package.
Out of curiosity, what are you using as your data store? I get the impression 
most Spark users are using HDFS or something built on top.

On Thu, Aug 28, 2014 at 4:07 PM, Sanjeev Sagar 
sanjeev.sa...@mypointscorp.commailto:sanjeev.sa...@mypointscorp.com wrote:
Hello there,

I've a basic question on the downloadthat which option I need to downloadfor 
standalone cluster.

I've a private cluster of three machineson Centos. When I click on download it 
shows me following:


   Download Spark

The latest release is Spark 1.0.2, released August 5, 2014 (release notes) 
http://spark.apache.org/releases/spark-release-1-0-2.html (git tag) 
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f

Pre-built packages:

 * For Hadoop 1 (HDP1, CDH3): find an Apache mirror
   
http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-hadoop1.tgz
   or direct file download
   http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop1.tgz
 * For CDH4: find an Apache mirror
   
http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-cdh4.tgz
   or direct file download
   http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-cdh4.tgz
 * For Hadoop 2 (HDP2, CDH5): find an Apache mirror
   
http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.2/spark-1.0.2-bin-hadoop2.tgz
   or direct file download
   http://d3kbcqa49mib13.cloudfront.net/spark-1.0.2-bin-hadoop2.tgz

Pre-built packages, third-party (NOTE: may include non ASF-compatible licenses):

 * For MapRv3: direct file download (external)
   http://package.mapr.com/tools/apache-spark/1.0.2/spark-1.0.2-bin-mapr3.tgz
 * For MapRv4: direct file download (external)
   http://package.mapr.com/tools/apache-spark/1.0.2/spark-1.0.2-bin-mapr4.tgz


From the above it looks like that I've to donwload Hadoop or CDH4 first in 
order to use Spark ? I've a standalone cluster and my data size is also like 
hundreds of Gig or close to Terabyte.

I don't get it that which one I need to download from the above list.

Could some one assist me that which one I need to download for standalone 
cluster and for big data foot print ?

or Hadoop is needed or mandatory for using Spark? that's not the understanding 
I've. My understanding is that you can use spark with Hadoop if you like from 
yarn2 but you could use spark standalone also without hadoop.

Please assist. I'm confused !

-Sanjeev


-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org



--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.iomailto:daniel.siegm...@velos.io W: 
www.velos.iohttp://www.velos.io

Re: Failed to run runJob at ReceiverTracker.scala

Do you see this error right in the beginning or after running for sometime?

The root cause seems to be that somehow your Spark executors got killed,
which killed receivers and caused further errors. Please try to take a look
at the executor logs of the lost executor to find what is the root cause
that caused the executor to fail.

TD


On Thu, Aug 28, 2014 at 3:54 PM, Tim Smith secs...@gmail.com wrote:

 Hi,

 Have a Spark-1.0.0 (CDH5) streaming job reading from kafka that died with:

 14/08/28 22:28:15 INFO DAGScheduler: Failed to run runJob at
 ReceiverTracker.scala:275
 Exception in thread Thread-59 14/08/28 22:28:15 INFO
 YarnClientClusterScheduler: Cancelling stage 2
 14/08/28 22:28:15 INFO DAGScheduler: Executor lost: 5 (epoch 4)
 14/08/28 22:28:15 INFO BlockManagerMasterActor: Trying to remove executor
 5 from BlockManagerMaster.
 14/08/28 22:28:15 INFO BlockManagerMaster: Removed 5 successfully in
 removeExecutor
 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 2.0:0 failed 4 times, most recent failure: TID 6481 on host
 node-dn1-1.ops.sfdc.net failed for unknown reason
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Any insights into this error?

 Thanks,

 Tim

Re: DStream repartitioning, performance tuning processing

If you are repartitioning to 8 partitions, and your node happen to have at
least 4 cores each, its possible that all 8 partitions are assigned to only
2 nodes. Try increasing the number of partitions. Also make sure you have
executors (allocated by YARN) running on more than two nodes if you want to
use all 11 nodes in your yarn cluster.

If you are using Spark 1.x, then you dont need to set the ttl for running
Spark Streaming. In case you are using older version, why do you want to
reduce it? You could reduce it, but it does increase the risk of the
premature cleaning, if once in a while things get delayed by 20 seconds. I
dont see much harm in keeping the ttl at 60 seconds (a bit of extra garbage
shouldnt hurt performance).

TD


On Thu, Aug 28, 2014 at 3:16 PM, Tim Smith secs...@gmail.com wrote:

 Hi,

 In my streaming app, I receive from kafka where I have tried setting the
 partitions when calling createStream or later, by calling repartition -
 in both cases, the number of nodes running the tasks seems to be stubbornly
 stuck at 2. Since I have 11 nodes in my cluster, I was hoping to use more
 nodes.

 I am starting the job as:
 nohup spark-submit --class logStreamNormalizer --master yarn
 log-stream-normalizer_2.10-1.0.jar --jars
 spark-streaming-kafka_2.10-1.0.0.jar,kafka_2.10-0.8.1.1.jar,zkclient-0.3.jar,metrics-core-2.2.0.jar,json4s-jackson_2.10-3.2.10.jar
 --executor-memory 30G --spark.cleaner.ttl 60 --executor-cores 8
 --num-executors 8 normRunLog-6.log 2normRunLogError-6.log  echo $! 
 run-6.pid

 My main code is:
  val sparkConf = new SparkConf().setAppName(SparkKafkaTest)
  val ssc = new StreamingContext(sparkConf,Seconds(5))
  val kInMsg =
 KafkaUtils.createStream(ssc,node-nn1-1:2181/zk_kafka,normApp,Map(rawunstruct
 - 16))

  val propsMap = Map(metadata.broker.list -
 node-dn1-6:9092,node-dn1-7:9092,node-dn1-8:9092, serializer.class -
 kafka.serializer.StringEncoder, producer.type - async,
 request.required.acks - 1)
  val to_topic = normStruct
  val writer = new KafkaOutputService(to_topic, propsMap)


  if (!configMap.keySet.isEmpty)
  {
   //kInMsg.repartition(8)
   val outdata = kInMsg.map(x=normalizeLog(x._2,configMap))
   outdata.foreachRDD((rdd,time) = { rdd.foreach(rec = {
 writer.output(rec) }) } )
  }

  ssc.start()
  ssc.awaitTermination()

 In terms of total delay, with a 5 second batch, the delays usually stay
 under 5 seconds, but sometimes jump to ~10 seconds. As a performance tuning
 question, does this mean, I can reduce my cleaner ttl from 60 to say 25
 (still more than double of the peak delay)?

 Thanks

 Tim

Re: Kinesis receiver spark streaming partition

2014-08-28 Thread Chris Fregly

great question, wei.  this is very important to understand from a
performance perspective.  and this extends is beyond kinesis - it's for any
streaming source that supports shards/partitions.

i need to do a little research into the internals to confirm my theory.

lemme get back to you!

-chris


On Tue, Aug 26, 2014 at 11:37 AM, Wei Liu wei@stellarloyalty.com
wrote:

 We are exploring using Kinesis and spark streaming together. I took at a
 look at the kinesis receiver code in 1.1.0. I have a question regarding
 kinesis partition  spark streaming partition. It seems to be pretty
 difficult to align these partitions.

 Kinesis partitions a stream of data into shards, if we follow the example,
 we will have multiple kinesis receivers reading from the same stream in
 spark streaming. It seems like kinesis workers will coordinate among
 themselves and assign shards to themselves dynamically. For a particular
 shard, it can be consumed by different kinesis workers (thus different
 spark workers) dynamically (not at the same time). Blocks are generated
 based on time intervals, RDD are created based on blocks. RDDs are
 partitioned based on blocks. At the end, the data for a given shard will be
 spread into multiple blocks (possible located on different spark worker
 nodes).

 We will probably need to group these data again for a given shard and
 shuffle data around to achieve the same partition we had in Kinesis.

 Is there a better way to achieve this to avoid data reshuffling?

 Thanks,
 Wei

Re: org.apache.hadoop.io.compress.SnappyCodec not found

Hi,

If change my etc/hadoop/core-site.xml 

from 
   property
nameio.compression.codecs/name
value
  org.apache.hadoop.io.compress.SnappyCodec,
  org.apache.hadoop.io.compress.GzipCodec,
  org.apache.hadoop.io.compress.DefaultCodec,
  org.apache.hadoop.io.compress.BZip2Codec
/value
   /property

to 
   property
nameio.compression.codecs/name
value
  org.apache.hadoop.io.compress.GzipCodec,
  org.apache.hadoop.io.compress.SnappyCodec,
  org.apache.hadoop.io.compress.DefaultCodec,
  org.apache.hadoop.io.compress.BZip2Codec
/value
   /property



and run the test again, I found this time it cannot find 
org.apache.hadoop.io.compress.GzipCodec

scala inFILE.first()
java.lang.RuntimeException: Error in configuring object
at 
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.RDD.take(RDD.scala:983)
at org.apache.spark.rdd.RDD.first(RDD.scala:1015)
at $iwC$$iwC$$iwC$$iwC.init(console:15)
at $iwC$$iwC$$iwC.init(console:20)
at $iwC$$iwC.init(console:22)
at $iwC.init(console:24)
at init(console:26)
at .init(console:30)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tim Smith

Appeared after running for a while. I re-ran the job and this time, it
crashed with:
14/08/29 00:18:50 WARN ReceiverTracker: Error reported by receiver for
stream 0: Error in block pushing thread - java.net.SocketException: Too
many open files

Shouldn't the failed receiver get re-spawned on a different worker?



On Thu, Aug 28, 2014 at 4:12 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 Do you see this error right in the beginning or after running for sometime?

 The root cause seems to be that somehow your Spark executors got killed,
 which killed receivers and caused further errors. Please try to take a look
 at the executor logs of the lost executor to find what is the root cause
 that caused the executor to fail.

 TD


 On Thu, Aug 28, 2014 at 3:54 PM, Tim Smith secs...@gmail.com wrote:

 Hi,

 Have a Spark-1.0.0 (CDH5) streaming job reading from kafka that died with:

 14/08/28 22:28:15 INFO DAGScheduler: Failed to run runJob at
 ReceiverTracker.scala:275
 Exception in thread Thread-59 14/08/28 22:28:15 INFO
 YarnClientClusterScheduler: Cancelling stage 2
 14/08/28 22:28:15 INFO DAGScheduler: Executor lost: 5 (epoch 4)
 14/08/28 22:28:15 INFO BlockManagerMasterActor: Trying to remove executor
 5 from BlockManagerMaster.
 14/08/28 22:28:15 INFO BlockManagerMaster: Removed 5 successfully in
 removeExecutor
 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 2.0:0 failed 4 times, most recent failure: TID 6481 on host
 node-dn1-1.ops.sfdc.net failed for unknown reason
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Any insights into this error?

 Thanks,

 Tim

Re: DStream repartitioning, performance tuning processing

2014-08-28 Thread Tim Smith

TD - Apologies, didn't realize I was replying to you instead of the list.

What does numPartitions refer to when calling createStream? I read an
earlier thread that seemed to suggest that numPartitions translates to
partitions created on the Spark side?
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201407.mbox/%3ccaph-c_o04j3njqjhng5ho281mqifnf3k_r6coqxpqh5bh6a...@mail.gmail.com%3E

Actually, I re-tried with 64 numPartitions in createStream and that didn't
work. I will manually set repartition to 64/128 and see how that goes.

Thanks.




On Thu, Aug 28, 2014 at 5:42 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 Having 16 partitions in KafkaUtils.createStream does not translate to the
 RDDs in Spark / Spark Streaming having 16 partitions. Repartition is the
 best way to distribute the received data between all the nodes, as long as
 there are sufficient number of partitions (try setting it to 2x the number
 cores given to the application).

 Yeah, in 1.0.0, ttl should be unnecessary.



 On Thu, Aug 28, 2014 at 5:17 PM, Tim Smith secs...@gmail.com wrote:

 On Thu, Aug 28, 2014 at 4:19 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 If you are repartitioning to 8 partitions, and your node happen to have
 at least 4 cores each, its possible that all 8 partitions are assigned to
 only 2 nodes. Try increasing the number of partitions. Also make sure you
 have executors (allocated by YARN) running on more than two nodes if you
 want to use all 11 nodes in your yarn cluster.


 If you look at the code, I commented out the manual re-partitioning to 8.
 Instead, I am created 16 partitions when I call createStream. But I will
 increase the partitions to, say, 64 and see if I get better parallelism.



 If you are using Spark 1.x, then you dont need to set the ttl for
 running Spark Streaming. In case you are using older version, why do you
 want to reduce it? You could reduce it, but it does increase the risk of
 the premature cleaning, if once in a while things get delayed by 20
 seconds. I dont see much harm in keeping the ttl at 60 seconds (a bit of
 extra garbage shouldnt hurt performance).


 I am running 1.0.0 (CDH5) so ttl setting is redundant? But you are right,
 unless I have memory issues, more aggressive pruning won't help.

 Thanks,

 Tim




  TD


 On Thu, Aug 28, 2014 at 3:16 PM, Tim Smith secs...@gmail.com wrote:

 Hi,

 In my streaming app, I receive from kafka where I have tried setting
 the partitions when calling createStream or later, by calling repartition
 - in both cases, the number of nodes running the tasks seems to be
 stubbornly stuck at 2. Since I have 11 nodes in my cluster, I was hoping to
 use more nodes.

 I am starting the job as:
 nohup spark-submit --class logStreamNormalizer --master yarn
 log-stream-normalizer_2.10-1.0.jar --jars
 spark-streaming-kafka_2.10-1.0.0.jar,kafka_2.10-0.8.1.1.jar,zkclient-0.3.jar,metrics-core-2.2.0.jar,json4s-jackson_2.10-3.2.10.jar
 --executor-memory 30G --spark.cleaner.ttl 60 --executor-cores 8
 --num-executors 8 normRunLog-6.log 2normRunLogError-6.log  echo $! 
 run-6.pid

 My main code is:
  val sparkConf = new SparkConf().setAppName(SparkKafkaTest)
  val ssc = new StreamingContext(sparkConf,Seconds(5))
  val kInMsg =
 KafkaUtils.createStream(ssc,node-nn1-1:2181/zk_kafka,normApp,Map(rawunstruct
 - 16))

  val propsMap = Map(metadata.broker.list -
 node-dn1-6:9092,node-dn1-7:9092,node-dn1-8:9092, serializer.class -
 kafka.serializer.StringEncoder, producer.type - async,
 request.required.acks - 1)
  val to_topic = normStruct
  val writer = new KafkaOutputService(to_topic, propsMap)


  if (!configMap.keySet.isEmpty)
  {
   //kInMsg.repartition(8)
   val outdata = kInMsg.map(x=normalizeLog(x._2,configMap))
   outdata.foreachRDD((rdd,time) = { rdd.foreach(rec = {
 writer.output(rec) }) } )
  }

  ssc.start()
  ssc.awaitTermination()

 In terms of total delay, with a 5 second batch, the delays usually stay
 under 5 seconds, but sometimes jump to ~10 seconds. As a performance tuning
 question, does this mean, I can reduce my cleaner ttl from 60 to say 25
 (still more than double of the peak delay)?

 Thanks

 Tim

Re: Change delimiter when collecting SchemaRDD

2014-08-28 Thread Michael Armbrust

The comma is just the way the default toString works for Row objects.
 Since SchemaRDDs are also RDDs, you can do arbitrary transformations on
the Row objects that are returned.

For example, if you'd rather the delimiter was '|':

sql(SELECT * FROM src).map(_.mkString(|)).collect()


On Thu, Aug 28, 2014 at 7:58 AM, yadid ayzenberg ya...@media.mit.edu
wrote:

 Hi All,

 Is there any way to change the delimiter from being a comma ?
 Some of the strings in my data contain commas as well, making it very
 difficult to parse the results.

 Yadid

Memory statistics in the Application detail UI

Hi,

I am using a cluster where each node has 16GB (this is the executor memory).
After I complete an MLlib job, the executor tab shows the following:

Memory: 142.6 KB Used (95.5 GB Total) 

and individual worker nodes have the Memory Used values as 17.3 KB / 8.6 GB 
(this is different for different nodes). What does the second number signify
(i.e.  8.6 GB and 95.5 GB)? If 17.3 KB was used out of the total memory of
the node, should it not be 17.3 KB/16 GB?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-statistics-in-the-Application-detail-UI-tp13082.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: org.apache.hadoop.io.compress.SnappyCodec not found

Hi,
 
I fixed the issue by copying libsnappy.so to Java ire.

Regards
Arthur

On 29 Aug, 2014, at 8:12 am, arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.com wrote:

 Hi,
 
 If change my etc/hadoop/core-site.xml 
 
 from 
property
 nameio.compression.codecs/name
 value
   org.apache.hadoop.io.compress.SnappyCodec,
   org.apache.hadoop.io.compress.GzipCodec,
   org.apache.hadoop.io.compress.DefaultCodec,
   org.apache.hadoop.io.compress.BZip2Codec
 /value
/property
 
 to 
property
 nameio.compression.codecs/name
 value
   org.apache.hadoop.io.compress.GzipCodec,
   org.apache.hadoop.io.compress.SnappyCodec,
   org.apache.hadoop.io.compress.DefaultCodec,
   org.apache.hadoop.io.compress.BZip2Codec
 /value
/property
 
 
 
 and run the test again, I found this time it cannot find 
 org.apache.hadoop.io.compress.GzipCodec
 
 scala inFILE.first()
 java.lang.RuntimeException: Error in configuring object
   at 
 org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
   at 
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:75)
   at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
   at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:158)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:171)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
   at org.apache.spark.rdd.RDD.take(RDD.scala:983)
   at org.apache.spark.rdd.RDD.first(RDD.scala:1015)
   at $iwC$$iwC$$iwC$$iwC.init(console:15)
   at $iwC$$iwC$$iwC.init(console:20)
   at $iwC$$iwC.init(console:22)
   at $iwC.init(console:24)
   at init(console:26)
   at .init(console:30)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 Caused by: java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at

The concurrent model of spark job/stage/task

2014-08-28 Thread 35597...@qq.com

hi, guys

  I am trying to understand how spark work on the concurrent model. I read 
below from https://spark.apache.org/docs/1.0.2/job-scheduling.html 

quote
 Inside a given Spark application (SparkContext instance), multiple parallel 
jobs can run simultaneously if they were submitted from separate threads. By 
“job”, in this section, we mean a Spark action (e.g. save, collect) and any 
tasks that need to run to evaluate that action. Spark’s scheduler is fully 
thread-safe and supports this use case to enable applications that serve 
multiple requests (e.g. queries for multiple users).

I searched everywhere but not get:
1. how to start 2 or more jobs in one spark driver, in java code.. I wrote 2 
actions in the code, but the job still staged in index 0, 1, 2, 3... looks they 
run secquencly.
2. are the stages run currently? because they always number in order 0, 1. 2. 
3.. I obverserved on the spark stage UI.
3. Can I retrieve the data out of RDD? like populate a pojo myself and compute 
on it.

Thanks in advance, guys.



35597...@qq.com

Problem using accessing HiveContext

2014-08-28 Thread Zitser, Igor

Hi,
While using HiveContext.

If hive table created as test_datatypes(testbigint bigint, ss bigint )  select 
below works fine.

For create table test_datatypes(testbigint bigint, testdec decimal(5,2) )

scala val dataTypes=hiveContext.hql(select * from test_datatypes)
14/08/28 21:18:44 INFO parse.ParseDriver: Parsing command: select * from 
test_datatypes
14/08/28 21:18:44 INFO parse.ParseDriver: Parse Completed
14/08/28 21:18:44 INFO analysis.Analyzer: Max iterations (2) reached for batch 
MultiInstanceRelations
14/08/28 21:18:44 INFO analysis.Analyzer: Max iterations (2) reached for batch 
CaseInsensitiveAttributeReferences
java.lang.IllegalArgumentException: Error: ',', ':', or ';' expected at 
position 14 from 'bigint:decimal(5,2)' [0:bigint, 6::, 7:decimal, 14:(, 15:5, 
16:,, 17:2, 18:)]
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:312)
at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeString(TypeInfoUtils.java:716)
at 
org.apache.hadoop.hive.serde2.lazy.LazyUtils.extractColumnInfo(LazyUtils.java:364)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initSerdeParams(LazySimpleSerDe.java:288)
at 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:187)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:218)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:272)
at 
org.apache.hadoop.hive.ql.metadata.Table.checkValidity(Table.java:175)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:991)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:924)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:58)
at 
org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:143)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:122)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:122)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:122)
at 
org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:149)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$2.applyOrElse(Analyzer.scala:83)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$2.applyOrElse(Analyzer.scala:81)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)


Same exception happens with create table test_datatypes(testbigint bigint, 
testdate date ) ...


Thanks, Igor.

FW: Reference Accounts Large Node Deployments

2014-08-28 Thread Steve Nunez

Anyone? No customers using streaming at scale?

From:  Steve Nunez snu...@hortonworks.com
Date:  Wednesday, August 27, 2014 at 9:08
To:  user@spark.apache.org user@spark.apache.org
Subject:  Reference Accounts  Large Node Deployments

 All,

 Does anyone have specific references to customers, use cases and large-scale
 deployments of Spark Streaming? By OElarge scale¹ I mean both through-put and
 number of nodes. I¹m attempting an objective comparison of Streaming and Storm
 and while this data is known for Storm, there appears to be little for Spark
 Streaming. If you know of any such deployments, please post them here because
 I am sure I¹m not the only one wondering about this. If customer
 confidentially prevents mentioning them by name, consider identifying them by
 industry, e.g. OEtelco doing X with streaming using Y nodes¹.

 Any information at all will be welcome. I¹ll feed back a summary and/or update
 a wiki page once I collate the information.

 Cheers,
 - Steve

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Odd saveAsSequenceFile bug

2014-08-28 Thread Shay Seng

Hey Sparkies...

I have an odd bug.

I am running Spark 0.9.2 on Amazon EC2 machines as a job (i.e. not in REPL)

After a bunch of processing, I tell spark to save my rdd to S3 using:
rdd.saveAsSequenceFile(uri,codec)

That line of code hangs. By hang I mean
(a) Spark stages UI shows no update on that task succeeded
(b) Pushing into that stage shows No task have reported metrics yet
(c) Ganglia shows the cpu, network access etc at nil
(d) No error logs on master or slave.
The system just does nothing.

What makes it weirder is if I modify the code to either..
(1) rdd.coalesce(41, true)  // i.e. increase the num partitions by 1
OR
(2) rdd.coalesce(39, false) // decrease by 1 with no shuffle

It runs through like a charm...

?? Any ideas for debugging?

shay

RE: Sorting Reduced/Groupd Values without Explicit Sorting

2014-08-28 Thread fluke777

Hi list,

Any change on this one? I think I have seen a lot of work being done on this
lately but I am unable to forge a working solution from jira tickets. Any
example would be highly appreciated.

Tomas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sorting-Reduced-Groupd-Values-without-Explicit-Sorting-tp8508p13088.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: OutofMemoryError when generating output

2014-08-28 Thread Burak Yavuz

Yeah, saveAsTextFile is an RDD specific method. If you really want to use that 
method, just turn the map into an RDD:

`sc.parallelize(x.toSeq).saveAsTextFile(...)`

Reading through the api-docs will present you many more alternate solutions!

Best,
Burak

- Original Message -
From: SK skrishna...@gmail.com
To: u...@spark.incubator.apache.org
Sent: Thursday, August 28, 2014 12:45:22 PM
Subject: Re: OutofMemoryError when generating output

Hi,
Thanks for the response. I tried to use countByKey. But I am not able to
write the output to console or to a file. Neither collect() nor
saveAsTextFile() work for the Map object that is generated after
countByKey(). 

valx = sc.textFile(baseFile)).map { line =
val fields = line.split(\t)
   (fields(11), fields(6)) // extract (month, user_id)
  }.distinct().countByKey()

x.saveAsTextFile(...)  // does not work. generates an error that
saveAstextFile is not defined for Map object


Is there a way to convert the Map object to an object that I can output to
console and to a file?

thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/OutofMemoryError-when-generating-output-tp12847p13056.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Memory statistics in the Application detail UI

Each executor reserves some memory for storing RDDs in memory, and
some for executor operations like shuffling. The number you see is
memory reserved for storing RDDs, and defaults to about 0.6 of the
total (spark.storage.memoryFraction).

On Fri, Aug 29, 2014 at 2:32 AM, SK skrishna...@gmail.com wrote:
 Hi,

 I am using a cluster where each node has 16GB (this is the executor memory).
 After I complete an MLlib job, the executor tab shows the following:

 Memory: 142.6 KB Used (95.5 GB Total)

 and individual worker nodes have the Memory Used values as 17.3 KB / 8.6 GB
 (this is different for different nodes). What does the second number signify
 (i.e.  8.6 GB and 95.5 GB)? If 17.3 KB was used out of the total memory of
 the node, should it not be 17.3 KB/16 GB?

 thanks



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Memory-statistics-in-the-Application-detail-UI-tp13082.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Memory statistics in the Application detail UI

2014-08-28 Thread Burak Yavuz

Hi,

Spark uses by default approximately 60% of the executor heap memory to store 
RDDs. That's why you have 8.6GB instead of 16GB. 95.5 is therefore the sum of 
all the 8.6 GB of executor memory + the driver memory.

Best,
Burak

- Original Message -
From: SK skrishna...@gmail.com
To: u...@spark.incubator.apache.org
Sent: Thursday, August 28, 2014 6:32:32 PM
Subject: Memory statistics in the Application detail UI

Hi,

I am using a cluster where each node has 16GB (this is the executor memory).
After I complete an MLlib job, the executor tab shows the following:

Memory: 142.6 KB Used (95.5 GB Total) 

and individual worker nodes have the Memory Used values as 17.3 KB / 8.6 GB 
(this is different for different nodes). What does the second number signify
(i.e.  8.6 GB and 95.5 GB)? If 17.3 KB was used out of the total memory of
the node, should it not be 17.3 KB/16 GB?

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-statistics-in-the-Application-detail-UI-tp13082.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark / Thrift / ODBC connectivity

2014-08-28 Thread Denny Lee

I’m currently using the Spark 1.1 branch and have been able to get the Thrift 
service up and running.  The quick questions were whether I should able to use 
the Thrift service to connect to SparkSQL generated tables and/or Hive tables?  

As well, by any chance do we have any documents that point to how we can 
connect something like Tableau to Spark SQL Thrift - similar to the SAP ODBC 
connectivity http://www.saphana.com/docs/DOC-472?

Thanks!
Denny

How to debug this error?

2014-08-28 Thread Gary Zhao

Hello

I'm new to Spark and playing around, but saw the following error. Could
anyone to help on it?

Thanks
Gary



scala c
res15: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[7] at flatMap at
console:23

scala group
res16: org.apache.spark.rdd.RDD[(String, Iterable[String])] =
MappedValuesRDD[5] at groupByKey at console:19

val d = c.map(i=group.filter(_._1 ==i ))

d.first

14/08/29 04:39:33 INFO TaskSchedulerImpl: Cancelling stage 28
14/08/29 04:39:33 INFO DAGScheduler: Failed to run first at console:28
org.apache.spark.SparkException: Job aborted due to stage failure: Task
28.0:180 failed 4 times, most recent failure: Exception failure in TID 3605
on host mcs-spark-slave1-staging.snc1: java.lang.NullPointerException
org.apache.spark.rdd.RDD.filter(RDD.scala:282)
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:25)
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:25)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)

scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)

scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$28.apply(RDD.scala:1003)
org.apache.spark.rdd.RDD$$anonfun$28.apply(RDD.scala:1003)

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)

org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Spark Hive max key length is 767 bytes

(Please ignore if duplicated) 


Hi,

I use Spark 1.0.2 with Hive 0.13.1

I have already set the hive mysql database to latine1; 

mysql:
alter database hive character set latin1;

Spark:
scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala hiveContext.hql(create table test_datatype1 (testbigint bigint ))
scala hiveContext.hql(drop table test_datatype1)


14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
embedded-only so does not have its own datastore table.
14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so 
does not have its own datastore table.
14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MFieldSchema is tagged as 
embedded-only so does not have its own datastore table.
14/08/29 12:31:55 INFO DataNucleus.Datastore: The class 
org.apache.hadoop.hive.metastore.model.MOrder is tagged as embedded-only so 
does not have its own datastore table.
14/08/29 12:31:59 ERROR DataNucleus.Datastore: An exception was thrown while 
adding/validating class(es) : Specified key was too long; max key length is 767 
bytes
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was 
too long; max key length is 767 bytes
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:408)
at com.mysql.jdbc.Util.getInstance(Util.java:383)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1062)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4226)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4158)

Can you please advise what would be wrong?

Regards
Arthur

Re: Memory statistics in the Application detail UI

Hi,
Thanks for the responses. I understand that  the second values in the Memory
Used column for the executors add up to 95.5 GB and the first values add up
to 17.3 KB. If 95.5 GB is the memory used to store the RDDs, then what is 
17.3 KB ? is that the memory used for shuffling operations? For non MLlib
applications I get 0.0 for the first number  - i.e memory used is 0.0 (95.5
GB Total). 

Is the total memory used the sum of the two numbers or is the first number
included in the second number  (i.e is 17.3 KB included in the 95.5 GB)? 

thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Memory-statistics-in-the-Application-detail-UI-tp13082p13095.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Failed to run runJob at ReceiverTracker.scala

It did. It got failed and respawned 4 times.
In this case, the too many open files is a sign that you need increase the
system-wide limit of open files.
Try adding ulimit -n 16000 to your conf/spark-env.sh.

TD


On Thu, Aug 28, 2014 at 5:29 PM, Tim Smith secs...@gmail.com wrote:

 Appeared after running for a while. I re-ran the job and this time, it
 crashed with:
 14/08/29 00:18:50 WARN ReceiverTracker: Error reported by receiver for
 stream 0: Error in block pushing thread - java.net.SocketException: Too
 many open files

 Shouldn't the failed receiver get re-spawned on a different worker?



 On Thu, Aug 28, 2014 at 4:12 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:

 Do you see this error right in the beginning or after running for
 sometime?

 The root cause seems to be that somehow your Spark executors got killed,
 which killed receivers and caused further errors. Please try to take a look
 at the executor logs of the lost executor to find what is the root cause
 that caused the executor to fail.

 TD


 On Thu, Aug 28, 2014 at 3:54 PM, Tim Smith secs...@gmail.com wrote:

 Hi,

 Have a Spark-1.0.0 (CDH5) streaming job reading from kafka that died
 with:

 14/08/28 22:28:15 INFO DAGScheduler: Failed to run runJob at
 ReceiverTracker.scala:275
 Exception in thread Thread-59 14/08/28 22:28:15 INFO
 YarnClientClusterScheduler: Cancelling stage 2
 14/08/28 22:28:15 INFO DAGScheduler: Executor lost: 5 (epoch 4)
 14/08/28 22:28:15 INFO BlockManagerMasterActor: Trying to remove
 executor 5 from BlockManagerMaster.
 14/08/28 22:28:15 INFO BlockManagerMaster: Removed 5 successfully in
 removeExecutor
 org.apache.spark.SparkException: Job aborted due to stage failure: Task
 2.0:0 failed 4 times, most recent failure: TID 6481 on host
 node-dn1-1.ops.sfdc.net failed for unknown reason
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Any insights into this error?

 Thanks,

 Tim

Re: Memory statistics in the Application detail UI