Re: Migrate Relational to Distributed

2015-05-23 Thread Dmitry Tolpeko
Hi Brant,

Let me partially answer to your concerns: please follow a new open source
project PL/HQL (www.plhql.org) aimed at allowing you to reuse existing
logic and leverage existing skills at some extent, so you do not need to
rewrite everything to Scala/Java and can do this gradually. I hope it can
help.

Thanks,

Dmitry

On Sat, May 23, 2015 at 1:22 AM, Brant Seibert brantseib...@hotmail.com
wrote:

 Hi,  The healthcare industry can do wonderful things with Apache Spark.
 But,
 there is already a very large base of data and applications firmly rooted
 in
 the relational paradigm and they are resistent to change - stuck on Oracle.

 **
 QUESTION 1 - Migrate legacy relational data (plus new transactions) to
 distributed storage?

 DISCUSSION 1 - The primary advantage I see is not having to engage in the
 lengthy (1+ years) process of creating a relational data warehouse and
 cubes.  Just store the data in a distributed system and analyze first in
 memory with Spark.

 **
 QUESTION 2 - Will we have to re-write the enormous amount of logic that is
 already built for the old relational system?

 DISCUSSION 2 - If we move the data to distributed, can we simply run that
 existing relational logic as SparkSQL queries?  [existing SQL -- Spark
 Context -- Cassandra -- process in SparkSQL -- display in existing UI].
 Can we create an RDD that uses existing SQL?  Or do we need to rewrite all
 our SQL?

 **
 DATA SIZE - We are adding many new data sources to a system that already
 manages health care data for over a million people.  The number of rows may
 not be enormous right now compared to the advertising industry, for
 example,
 but the number of dimensions runs well into the thousands.  If we add to
 this, IoT data for each health care patient, that creates billions of
 events
 per day, and the number of rows then grows exponentially.  We would like to
 be prepared to handle that huge data scenario.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Migrate-Relational-to-Distributed-tp22999.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Is anyone using Amazon EC2?

2015-05-23 Thread Vadim Bichutskiy
Yes, we're running Spark on EC2. Will transition to EMR soon. -Vadim
ᐧ

On Sat, May 23, 2015 at 2:22 PM, Johan Beisser j...@caustic.org wrote:

 Yes.

 We're looking at bootstrapping in EMR...

 On Sat, May 23, 2015 at 07:21 Joe Wass jw...@crossref.org wrote:

 I used Spark on EC2 a while ago




Re: Is anyone using Amazon EC2?

2015-05-23 Thread Joe Wass
Sorry guys, my email submitted before I finished writing it. Check my other
message (with the same subject)!

On 23 May 2015 at 20:25, Shafaq s.abdullah...@gmail.com wrote:

 Yes-Spark EC2 cluster . Looking into migrating to spark emr.
 Adding more ec2 is not possible afaik.
 On May 23, 2015 11:22 AM, Johan Beisser j...@caustic.org wrote:

 Yes.

 We're looking at bootstrapping in EMR...
 On Sat, May 23, 2015 at 07:21 Joe Wass jw...@crossref.org wrote:

 I used Spark on EC2 a while ago




??????spark.executor.extraClassPath - Values not picked up by executors

2015-05-23 Thread wesley.miao
My experience is don't put any application specific settings into 
spark-defaults.conf which is applied to all applications.


Instead, you can either set them programmatically as what you did below or 
through spark-submit.


Also, if you still like to do it via spark-defaults.conf, you will have to 
change that on all of your worker nodes when you go distributed one day. This 
is not scalable and not right either as you will have to put your app specific 
class path to all of your spark worker nodes' spark-defaults.conf

iPhone

--  --
??: Todd Nist tsind...@gmail.com
: 2015??05??24?? 02:14
??: yana.kadiyska yana.kadiy...@gmail.com
: user@spark.apache.org user@spark.apache.org
: Re: spark.executor.extraClassPath - Values not picked up by executors



Hi Yana,

Yes typeo in the eamil, file name is correct spark-defaults.conf; thanks 
though.  So it appears to work if in the driver is specify it as part of the 
sparkConf:

 
val conf = new SparkConf().setAppName(getClass.getSimpleName) 
  .set(spark.executor.extraClassPath, 
/projects/spark-cassandra-connector/spark-cassandra-connetor/target/scala-2.10/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar)

I thought the spark-defaults would be applied regardless of weather it was a 
spark-submit (driver) or a custom driver as in my case, but apparently I am 
mistaken.  This will work fine as I can ensure that all hosts participating in 
the cluster have access to a common directory with the dependencies and then 
just set the spark.executor.extraClassPath to 
/some/shared/directory/lib/*.jar.  

If there is a better way to address this, let me know.

As for the spark-cassandra-connector 1.3.0-SNAPSHOT, I am building that from 
master.  Haven't hit any issue with it yet.

-Todd



On Fri, May 22, 2015 at 9:39 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote:
Todd, I don't have any answers for you...other than the file is actually named 
spark-defaults.conf (not sure if you made a typo in the email or misnamed the 
file...). Do any other options from that file get read?

I also wanted to ask if you built the 
spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar from trunk or if they 
published a 1.3 drop somewhere -- I'm just starting out with Cassandra and 
discovered
https://datastax-oss.atlassian.net/browse/SPARKC-98 is still open...



On Fri, May 22, 2015 at 6:15 PM, Todd Nist tsind...@gmail.com wrote:
I'm using the spark-cassandra-connector from DataStax in a spark streaming job 
launched from my own driver.  It is connecting a a standalone cluster on my 
local box which has two worker running.

This is Spark 1.3.1 and spark-cassandra-connector-1.3.0-SNAPSHOT.  I have added 
the following entry to my $SPARK_HOME/conf/spark-default.conf:


spark.executor.extraClassPath 
/projects/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar



When I start the master with, $SPARK_HOME/sbin/start-master.sh, it comes up 
just fine.  As do the two workers with the following command:


Worker 1, port 8081:

radtech:spark $ ./bin/spark-class org.apache.spark.deploy.worker.Worker 
spark://radtech.io:7077 --webui-port 8081 --cores 2
Worker 2, port 8082

radtech:spark $ ./bin/spark-class org.apache.spark.deploy.worker.Worker 
spark://radtech.io:7077 --webui-port 8082 --cores 2

When I execute the Driver connecting the the master:


sbt app/run -Dspark.master=spark://radtech.io:7077

It starts up, but when the executors are launched they do not include the entry 
in the spark.executor.extraClassPath:


15/05/22 17:35:26 INFO Worker: Asked to launch executor 
app-20150522173526-/0 for KillrWeatherApp$ 15/05/22 17:35:26 INFO 
ExecutorRunner: Launch command: java -cp 
/usr/local/spark/conf:/usr/local/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/spark/conf:/usr/local/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar
 -Dspark.driver.port=55932 -Xms512M -Xmx512M 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
akka.tcp://sparkDriver@192.168.1.3:55932/user/CoarseGrainedScheduler 
--executor-id 0 --hostname 192.168.1.3 --cores 2 --app-id 
app-20150522173526- --worker-url 
akka.tcp://sparkWorker@192.168.1.3:55923/user/Worker





which will then cause the executor to fail with a ClassNotFoundException, which 
I would expect:

[WARN] [2015-05-22 17:38:18,035] [org.apache.spark.scheduler.TaskSetManager]: 
Lost task 0.0 in stage 2.0 (TID 23, 192.168.1.3): 
java.lang.ClassNotFoundException: 
com.datastax.spark.connector.rdd.partitioner.CassandraPartition at 

Strange ClassNotFound exeption

2015-05-23 Thread boci
Hi guys!

I have a small spark application. It's query some data from postgres,
enrich it and write to elasticsearch. When I deployed into spark container
I got a very fustrating error:
https://gist.github.com/b0c1/66527e00bada1e4c0dc3

Spark version: 1.3.1
Hadoop version: 2.6.0
Additional info:
  serialization: kryo
  rdd: custom rdd to query

I not understand
1. akka.actor.SelectionPath doesn't exists in 1.3.1
2. I checked all dependencies in my project, I only have
org.spark-project.akka:akka-*_2.10:2.3.4-spark:jar doesn't have
3. I not found any reference for this...
4. I created own RDD, it's work, but I need to register to kryo? (mapRow
using ResultSet, I need to create
5. I used some months ago and it's already worked with spark 1.2... I
recompiled with 1.3.1 but I got this strange error

Any idea?

--
Skype: boci13, Hangout: boci.b...@gmail.com


Re: DataFrame groupBy vs RDD groupBy

2015-05-23 Thread ayan guha
Hi Michael

This is great info. I am currently using repartitionandsort function to
achieve the same. Is this the recommended way till 1.3 or is there any
better way?
On 23 May 2015 07:38, Michael Armbrust mich...@databricks.com wrote:

 DataFrames have a lot more information about the data, so there is a whole
 class of optimizations that are possible there that we cannot do in RDDs.
 This is why we are focusing a lot of effort on this part of the project.
 In Spark 1.4 you can accomplish what you want using the new window function
 feature.  This can be done with SQL as you described or directly on a
 DataFrame:

 import org.apache.spark.sql.functions._
 import org.apache.spark.sql.expressions._

 val df = Seq((a, 1), (b, 1), (c, 2), (d, 2)).toDF(x, y)
 df.select('x, 'y,
 rowNumber.over(Window.partitionBy(y).orderBy(x)).as(number)).show

 +-+-+--+
 |x|y|number|
 +-+-+--+
 |a|1| 1|
 |b|1| 2|
 |c|2| 1|
 |d|2| 2|
 +-+-+--+

 On Fri, May 22, 2015 at 3:35 AM, gtanguy g.tanguy.claravi...@gmail.com
 wrote:

 Hello everybody,

 I have two questions in one. I upgrade from Spark 1.1 to 1.3 and some part
 of my code using groupBy became really slow.

 *1/ *Why does the groupBy of rdd is really slow in comparison to the
 groupBy
 of dataFrame?

 // DataFrame : running in few seconds
 val result = table.groupBy(col1).count

 // RDD : taking hours with a lot of /spilling in-memory/
 val schemaOriginel = table.schema
 val result = table.rdd.groupBy { r =
  val rs = RowSchema(r, schemaOriginel)
  val col1 = rs.getValueByName(col1)
  col1
   }.map(l = (l._1,l._2.size) ).count()


 *2/* My goal is to groupBy on a key, then to order each group over a
 column
 and finally to add the row number in each group. I had this code running
 before changing to Spark 1.3 and it worked fine, but since I have changed
 to
 DataFrame it is really slow.

  val schemaOriginel = table.schema
  val result = table.rdd.groupBy { r =
 val rs = RowSchema(r, schemaOriginel)
 val col1 = rs.getValueByName(col1)
  col1
 }.flatMap {
  l =
l._2.toList
  .sortBy {
   u =
 val rs = RowSchema(u, schemaOriginel)
 val col1 = rs.getValueByName(col1)
 val col2 = rs.getValueByName(col2)
 (col1, col2)
 } .zipWithIndex
 }

 /I think the SQL equivalent of what I try to do : /

 SELECT a,
ROW_NUMBER() OVER (PARTITION BY a) AS num
 FROM table.


  I don't think I can do this with a GroupedData (result of df.groupby).
 Any
 ideas on how I can speed up this?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-groupBy-vs-RDD-groupBy-tp22995.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Strange ClassNotFound exeption

2015-05-23 Thread Ted Yu
In my local maven repo, I found:

$ jar tvf
/Users/tyu/.m2/repository//org/spark-project/akka/akka-actor_2.10/2.3.4-spark/akka-actor_2.10-2.3.4-spark.jar
| grep SelectionPath
   521 Mon Sep 29 12:05:36 PDT 2014 akka/actor/SelectionPathElement.class

Is the above jar in your classpath ?

On Sat, May 23, 2015 at 5:05 PM, boci boci.b...@gmail.com wrote:

 Hi guys!

 I have a small spark application. It's query some data from postgres,
 enrich it and write to elasticsearch. When I deployed into spark container
 I got a very fustrating error:
 https://gist.github.com/b0c1/66527e00bada1e4c0dc3

 Spark version: 1.3.1
 Hadoop version: 2.6.0
 Additional info:
   serialization: kryo
   rdd: custom rdd to query

 I not understand
 1. akka.actor.SelectionPath doesn't exists in 1.3.1
 2. I checked all dependencies in my project, I only have
 org.spark-project.akka:akka-*_2.10:2.3.4-spark:jar doesn't have
 3. I not found any reference for this...
 4. I created own RDD, it's work, but I need to register to kryo? (mapRow
 using ResultSet, I need to create
 5. I used some months ago and it's already worked with spark 1.2... I
 recompiled with 1.3.1 but I got this strange error

 Any idea?


 --
 Skype: boci13, Hangout: boci.b...@gmail.com



SparkSQL can't read S3 path for hive external table

2015-05-23 Thread ogoh

Hello,
I am using Spark1.3 in AWS.
SparkSQL can't recognize Hive external table on S3. 
The following is the error message. 
I appreciate any help.
Thanks,
Okehee
--  
15/05/24 01:02:18 ERROR thriftserver.SparkSQLDriver: Failed in [select
count(*) from api_search where pdate='2015-05-08']
java.lang.IllegalArgumentException: Wrong FS:
s3://test-emr/datawarehouse/api_s3_perf/api_search/pdate=2015-05-08/phour=00,
expected: hdfs://10.128.193.211:9000
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:647)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:467)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252)
at
org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-can-t-read-S3-path-for-hive-external-table-tp23002.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



?????? ?????? ?????? How to use spark to access HBase with Security enabled

2015-05-23 Thread donhoff_h
Hi, 


The exception is the same as before. Just like the following:
2015-05-23 18:01:40,943 ERROR [hconnection-0x14027b82-shared--pool1-t1] 
ipc.AbstractRpcClient: SASL authentication failed. The most likely cause is 
missing or invalid credentials. Consider 'kinit'. 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)] at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
at 
org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
 at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:604)
 at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:153)
  at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:730)
   at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:727)
   at java.security.AccessController.doPrivileged(Native Method)   at 
javax.security.auth.Subject.doAs(Subject.java:415)   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)


But after many tests, I found that the cause reason is :
1. When I try to get a HBase connection in spark  or call  sc.newAPIHadoopRDD 
the API in spark on Yarn-cluster mode, it uses the 
UserGroupInformation.getCurrentUser API to get the User object which is used to 
authenticate.
2. The UserGroupInformation.getCurrentUser API's logic is as following:
AccessControlContext context = AccessController.getContext();
Subject subject = Subject.getSubject(context);
return subject != null  !subject.getPrincipals(User.class).isEmpty()?new 
UserGroupInformation(subject):getLoginUser();

3. I printed the subject object to the stdout. I found the user info is my 
linux os user spark, not the principal sp...@bgdt.dev.hrb


This is the reason why I can not pass the authentication.  The context of the 
executor threads spawned by the nodemanager do not contain any principal info 
which I am sure that I have already set it up using the kinit command.  But I 
still don't know why and how to solve it.


Does anybody know how to set configurations so that the context of threads 
spawned by the nodemanager contain the principal I set up with the kinit 
command? 


Many Thanks!
--  --
??: yuzhihong;yuzhih...@gmail.com;
: 2015??5??22??(??) 7:25
??: donhoff_h165612...@qq.com; 
: Bill Qbill.q@gmail.com; useruser@spark.apache.org; 
: Re: ?? ?? How to use spark to access HBase with Security enabled



Can you share the exception(s) you encountered ?


Thanks




On May 22, 2015, at 12:33 AM, donhoff_h 165612...@qq.com wrote:


Hi,

My modified code is listed below, just add the SecurityUtil API.  I don't know 
which propertyKeys I should use, so I make 2 my own propertyKeys to find the 
keytab and principal.

object TestHBaseRead2 {
 def main(args: Array[String]) {

   val conf = new SparkConf()
   val sc = new SparkContext(conf)
   val hbConf = HBaseConfiguration.create()
   hbConf.set(dhao.keytab.file,//etc//spark//keytab//spark.user.keytab)
   hbConf.set(dhao.user.principal,sp...@bgdt.dev.hrb)
   SecurityUtil.login(hbConf,dhao.keytab.file,dhao.user.principal)
   val conn = ConnectionFactory.createConnection(hbConf)
   val tbl = conn.getTable(TableName.valueOf(spark_t01))
   try {
 val get = new Get(Bytes.toBytes(row01))
 val res = tbl.get(get)
 println(result:+res.toString)
   }
   finally {
 tbl.close()
 conn.close()
 es.shutdown()
   }

   val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))
   val v = rdd.sum()
   println(Value=+v)
   sc.stop()

 }
}




--  --
??: yuzhihong;yuzhih...@gmail.com;
: 2015??5??22??(??) 3:25
??: donhoff_h165612...@qq.com; 
: Bill Qbill.q@gmail.com; useruser@spark.apache.org; 
: Re: ?? How to use spark to access HBase with Security enabled



Can you post the morning modified code ?


Thanks




On May 21, 2015, at 11:11 PM, donhoff_h 165612...@qq.com wrote:


Hi,

Thanks very much for the reply.  I have tried the SecurityUtil. I can see 
from log that this statement executed successfully, but I still can not pass 
the authentication of HBase. And with more experiments, I found a new 
interesting senario. If I run the program with yarn-client mode, the driver can 
pass the authentication, but the executors can not. If I run the program with 
yarn-cluster mode, both the driver and the executors can not pass the 
authentication.  Can anybody give me some clue with this info? Many Thanks!




--  --
??: yuzhihong;yuzhih...@gmail.com;
: 2015??5??22??(??) 5:29
??: donhoff_h165612...@qq.com; 
: Bill Qbill.q@gmail.com; 

Re: [Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll

2015-05-23 Thread Aniket Bhatnagar
Hi TD

Unfortunately, I am off for a week so I won't be able to test this until
next week. Will keep you posted.

Aniket

On Sat, May 23, 2015, 6:16 AM Tathagata Das t...@databricks.com wrote:

 Hey Aniket, I just checked in the fix in Spark master and branch-1.4.
 Could you download Spark and test it out?



 On Thu, May 21, 2015 at 1:43 AM, Tathagata Das t...@databricks.com
 wrote:

 Thanks for the JIRA. I will look into this issue.

 TD

 On Thu, May 21, 2015 at 1:31 AM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I ran into one of the issues that are potentially caused because of this
 and have logged a JIRA bug -
 https://issues.apache.org/jira/browse/SPARK-7788

 Thanks,
 Aniket

 On Wed, Sep 24, 2014 at 12:59 PM Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 Hi all

 Reading through Spark streaming's custom receiver documentation, it is
 recommended that onStart and onStop methods should not block indefinitely.
 However, looking at the source code of KinesisReceiver, the onStart method
 calls worker.run that blocks until worker is shutdown (via a call to
 onStop).

 So, my question is what are the ramifications of making a blocking call
 in onStart and whether this is something that should be addressed
 in KinesisReceiver implementation.

 Thanks,
 Aniket






split function on spark sql created rdd

2015-05-23 Thread kali.tumm...@gmail.com
Hi All, 

I am trying to do word count on number of tweets, my first step is to get
data from table using spark sql and then run split function on top of it to
calculate word count.

Error:- valuse split is not a member of org.apache.spark.sql.SchemaRdd

Spark Code that doesn't work to do word count:-

val disitnct_tweets=hiveCtx.sql(select distinct(text) from tweets_table
where text  '')
val distinct_tweets_List=sc.parallelize(List(distinct_tweets))
//tried split on both the rdd disnt worked
distinct_tweets.flatmap(line = line.split( )).map(word =
(word,1)).reduceByKey(_+_)
distinct_tweets_List.flatmap(line = line.split( )).map(word =
(word,1)).reduceByKey(_+_)

But when I output the data from sparksql to a file and load it again and run
split it works.

Example Code that works:-

val distinct_tweets=hiveCtx.sql(select dsitinct(text) from tweets_table
where text  '')
val distinct_tweets_op=distinct_tweets.collect()
val rdd=sc.parallelize(distinct_tweets_op)
rdd.saveAsTextFile(/home/cloudera/bdp/op)
val textFile=sc.textFile(/home/cloudera/bdp/op/part-0)
val counts=textFile.flatMap(line = line.split( )).map(word =
(word,1)).reduceByKey(_+_)
counts.SaveAsTextFile(/home/cloudera/bdp/wordcount)

I don't want to write to file instead want to collect in a rdd and apply
filter function on top of schema rdd, is there a way.

Thanks 







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/split-function-on-spark-sql-created-rdd-tp23001.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is anyone using Amazon EC2?

2015-05-23 Thread Joe Wass
I used Spark on EC2 a while ago


Not able to run SparkPi locally

2015-05-23 Thread Sujit Pal
Hello all,

This is probably me doing something obviously wrong, would really
appreciate some pointers on how to fix this.

I installed spark-1.3.1-bin-hadoop2.6.tgz from the Spark download page [
https://spark.apache.org/downloads.html] and just untarred it on a local
drive. I am on Mac OSX 10.9.5 and the JDK is 1.8.0_40.

I ran the following commands (the first 3 run succesfully, I mention it
here to rule out any possibility of it being an obviously bad install).

1) laptop$ bin/spark-shell

scala sc.parallelize(1 to 100).count()

res0: Long = 100

scala exit

2) laptop$ bin/pyspark

 sc.parallelize(range(100)).count()

100

 quit()

3) laptop$ bin/spark-submit examples/src/main/python/pi.py

Pi is roughly 3.142800

4) laptop$ bin/run-example SparkPi

This hangs at this line (full stack trace is provided at the end of this
mail)

15/05/23 07:52:10 INFO Executor: Fetching
http://10.0.0.5:51575/jars/spark-examples-1.3.1-hadoop2.6.0.jar with
timestamp 1432392670140

15/05/23 07:52:10 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)

java.net.SocketTimeoutException: connect timed out

...

and finally dies with this message:

Exception in thread main org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure:
Lost task 0.0 in stage 0.0 (TID 0, localhost):
java.net.SocketTimeoutException: connect timed out


I checked with ifconfig -a on my box, 10.0.0.5 is my IP address on my local
network.

en0: flags=8863UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST mtu 1500

ether 34:36:3b:d2:b0:f4

inet 10.0.0.5 netmask 0xff00 broadcast 10.0.0.255

media: autoselect

 status: active


I think perhaps there may be some configuration I am missing. Being able to
run jobs locally (without HDFS or creating a cluster) is essential for
development, and the examples come from the Spark 1.3.1 Quick Start page [
https://spark.apache.org/docs/latest/quick-start.html], so this is probably
something to do with my environment.


Thanks in advance for any help you can provide.

-sujit

=

Full output of SparkPi run (including stack trace) follows:

Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties

15/05/23 08:08:55 INFO SparkContext: Running Spark version 1.3.1

15/05/23 08:08:57 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable

15/05/23 08:08:57 INFO SecurityManager: Changing view acls to: palsujit

15/05/23 08:08:57 INFO SecurityManager: Changing modify acls to: palsujit

15/05/23 08:08:57 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(palsujit);
users with modify permissions: Set(palsujit)

15/05/23 08:08:57 INFO Slf4jLogger: Slf4jLogger started

15/05/23 08:08:57 INFO Remoting: Starting remoting

15/05/23 08:08:58 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@10.0.0.5:52008]

15/05/23 08:08:58 INFO Utils: Successfully started service 'sparkDriver' on
port 52008.

15/05/23 08:08:58 INFO SparkEnv: Registering MapOutputTracker

15/05/23 08:08:58 INFO SparkEnv: Registering BlockManagerMaster

15/05/23 08:08:58 INFO DiskBlockManager: Created local directory at
/var/folders/z8/s_crq_2j2rqb9mv_4j8djsjnx359l2/T/spark-d97baddf-1b6f-41db-92bb-f82ab5184cb7/blockmgr-4ef3a194-1929-4dd3-a0e5-215175d8e41a

15/05/23 08:08:58 INFO MemoryStore: MemoryStore started with capacity 265.1
MB

15/05/23 08:08:58 INFO HttpFileServer: HTTP File server directory is
/var/folders/z8/s_crq_2j2rqb9mv_4j8djsjnx359l2/T/spark-fdf36480-def0-44b7-9942-098d9ef3e2b4/httpd-e494852a-7d61-4441-8b80-566d9f820afb

15/05/23 08:08:58 INFO HttpServer: Starting HTTP Server

15/05/23 08:08:58 INFO Server: jetty-8.y.z-SNAPSHOT

15/05/23 08:08:58 INFO AbstractConnector: Started
SocketConnector@0.0.0.0:52009

15/05/23 08:08:58 INFO Utils: Successfully started service 'HTTP file
server' on port 52009.

15/05/23 08:08:58 INFO SparkEnv: Registering OutputCommitCoordinator

15/05/23 08:08:58 INFO Server: jetty-8.y.z-SNAPSHOT

15/05/23 08:08:58 INFO AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040

15/05/23 08:08:58 INFO Utils: Successfully started service 'SparkUI' on
port 4040.

15/05/23 08:08:58 INFO SparkUI: Started SparkUI at http://10.0.0.5:4040

15/05/23 08:08:58 INFO SparkContext: Added JAR
file:/Users/palsujit/Software/spark-1.3.1-bin-hadoop2.6/lib/spark-examples-1.3.1-hadoop2.6.0.jar
at http://10.0.0.5:52009/jars/spark-examples-1.3.1-hadoop2.6.0.jar with
timestamp 1432393738514

15/05/23 08:08:58 INFO Executor: Starting executor ID driver on host
localhost

15/05/23 08:08:58 INFO AkkaUtils: Connecting to HeartbeatReceiver:
akka.tcp://sparkDriver@10.0.0.5:52008/user/HeartbeatReceiver

15/05/23 08:08:58 INFO NettyBlockTransferService: Server created on 52010

15/05/23 08:08:58 INFO BlockManagerMaster: Trying to register BlockManager

15/05/23 08:08:58 INFO 

Re: Doubts about SparkSQL

2015-05-23 Thread Ram Sriharsha
Yes it does ... you can try out the following example (the People dataset
that comes with Spark). There is an inner query that filters on age and an
outer query that filters on name.
The physical plan applies a single composite filter on name and age as you
can see below

sqlContext.sql(select * from (select * from people where age = 13)A where
A.name = 'Justin').explain
== Physical Plan ==
Filter ((age#1 = 13)  (name#0 = Justin))
 PhysicalRDD [name#0,age#1], MapPartitionsRDD[8] at rddToDataFrameHolder at
console:26


On Sat, May 23, 2015 at 9:52 AM, Renato Marroquín Mogrovejo 
renatoj.marroq...@gmail.com wrote:

 Hi all,

 I have some doubts about the latest SparkSQL.

 1. In the paper about SparkSQL it has been stated that The physical
 planner also performs rule-based physical optimizations, such as pipelining
 projections or filters into one Spark map operation. ...

 If dealing with a query of the form:

 select *  from (
   select * from tableA where date1  '19-12-2015'
 )A
 where attribute1 = 'valueA' and attribute2 = 'valueB'

 Could I be sure that the both filters are applied sequentially in-memory
 i.e. first applying the date filter and over that result set, the next
 attributes filter gets applied? Or will two different Map-only operations
 will be spawned?

 2. Does the Catalyst query optimizer is aware of how data was partitioned?
 or does it not make any assumptions on this?
 Thanks in advance!


 Renato M.



Re: Not able to run SparkPi locally

2015-05-23 Thread Sujit Pal
Replying to my own email in case someone has the same or similar issue.

On a hunch I ran this against my Linux (Ubuntu 14.04 with JDK 8) box. Not
only did bin/run-example SparkPi run without any problems, it also
provided a very helpful message in the output.

15/05/23 08:35:15 WARN Utils: Your hostname, tsunami resolves to a loopback
address: 127.0.1.1; using 10.0.0.10 instead (on interface wlan0)

15/05/23 08:35:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
another address

So I went back to my Mac, set SPARK_LOCAL_IP=127.0.0.1 and everything runs
fine now. To make this permanent I put this in conf/spark-env.sh.

-sujit




On Sat, May 23, 2015 at 8:14 AM, Sujit Pal sujitatgt...@gmail.com wrote:

 Hello all,

 This is probably me doing something obviously wrong, would really
 appreciate some pointers on how to fix this.

 I installed spark-1.3.1-bin-hadoop2.6.tgz from the Spark download page [
 https://spark.apache.org/downloads.html] and just untarred it on a local
 drive. I am on Mac OSX 10.9.5 and the JDK is 1.8.0_40.

 I ran the following commands (the first 3 run succesfully, I mention it
 here to rule out any possibility of it being an obviously bad install).

 1) laptop$ bin/spark-shell

 scala sc.parallelize(1 to 100).count()

 res0: Long = 100

 scala exit

 2) laptop$ bin/pyspark

  sc.parallelize(range(100)).count()

 100

  quit()

 3) laptop$ bin/spark-submit examples/src/main/python/pi.py

 Pi is roughly 3.142800

 4) laptop$ bin/run-example SparkPi

 This hangs at this line (full stack trace is provided at the end of this
 mail)

 15/05/23 07:52:10 INFO Executor: Fetching
 http://10.0.0.5:51575/jars/spark-examples-1.3.1-hadoop2.6.0.jar with
 timestamp 1432392670140

 15/05/23 07:52:10 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 0)

 java.net.SocketTimeoutException: connect timed out

 ...

 and finally dies with this message:

 Exception in thread main org.apache.spark.SparkException: Job aborted
 due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent
 failure: Lost task 0.0 in stage 0.0 (TID 0, localhost):
 java.net.SocketTimeoutException: connect timed out


 I checked with ifconfig -a on my box, 10.0.0.5 is my IP address on my
 local network.

 en0: flags=8863UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST mtu 1500

 ether 34:36:3b:d2:b0:f4

 inet 10.0.0.5 netmask 0xff00 broadcast 10.0.0.255

 media: autoselect

  status: active


 I think perhaps there may be some configuration I am missing. Being able
 to run jobs locally (without HDFS or creating a cluster) is essential for
 development, and the examples come from the Spark 1.3.1 Quick Start page [
 https://spark.apache.org/docs/latest/quick-start.html], so this is
 probably something to do with my environment.


 Thanks in advance for any help you can provide.

 -sujit

 =

 Full output of SparkPi run (including stack trace) follows:

 Using Spark's default log4j profile:
 org/apache/spark/log4j-defaults.properties

 15/05/23 08:08:55 INFO SparkContext: Running Spark version 1.3.1

 15/05/23 08:08:57 WARN NativeCodeLoader: Unable to load native-hadoop
 library for your platform... using builtin-java classes where applicable

 15/05/23 08:08:57 INFO SecurityManager: Changing view acls to: palsujit

 15/05/23 08:08:57 INFO SecurityManager: Changing modify acls to: palsujit

 15/05/23 08:08:57 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(palsujit);
 users with modify permissions: Set(palsujit)

 15/05/23 08:08:57 INFO Slf4jLogger: Slf4jLogger started

 15/05/23 08:08:57 INFO Remoting: Starting remoting

 15/05/23 08:08:58 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@10.0.0.5:52008]

 15/05/23 08:08:58 INFO Utils: Successfully started service 'sparkDriver'
 on port 52008.

 15/05/23 08:08:58 INFO SparkEnv: Registering MapOutputTracker

 15/05/23 08:08:58 INFO SparkEnv: Registering BlockManagerMaster

 15/05/23 08:08:58 INFO DiskBlockManager: Created local directory at
 /var/folders/z8/s_crq_2j2rqb9mv_4j8djsjnx359l2/T/spark-d97baddf-1b6f-41db-92bb-f82ab5184cb7/blockmgr-4ef3a194-1929-4dd3-a0e5-215175d8e41a

 15/05/23 08:08:58 INFO MemoryStore: MemoryStore started with capacity
 265.1 MB

 15/05/23 08:08:58 INFO HttpFileServer: HTTP File server directory is
 /var/folders/z8/s_crq_2j2rqb9mv_4j8djsjnx359l2/T/spark-fdf36480-def0-44b7-9942-098d9ef3e2b4/httpd-e494852a-7d61-4441-8b80-566d9f820afb

 15/05/23 08:08:58 INFO HttpServer: Starting HTTP Server

 15/05/23 08:08:58 INFO Server: jetty-8.y.z-SNAPSHOT

 15/05/23 08:08:58 INFO AbstractConnector: Started
 SocketConnector@0.0.0.0:52009

 15/05/23 08:08:58 INFO Utils: Successfully started service 'HTTP file
 server' on port 52009.

 15/05/23 08:08:58 INFO SparkEnv: Registering OutputCommitCoordinator

 15/05/23 08:08:58 INFO Server: jetty-8.y.z-SNAPSHOT

 15/05/23 08:08:58 INFO AbstractConnector: Started
 SelectChannelConnector@0.0.0.0:4040

 

Doubts about SparkSQL

2015-05-23 Thread Renato Marroquín Mogrovejo
Hi all,

I have some doubts about the latest SparkSQL.

1. In the paper about SparkSQL it has been stated that The physical
planner also performs rule-based physical optimizations, such as pipelining
projections or filters into one Spark map operation. ...

If dealing with a query of the form:

select *  from (
  select * from tableA where date1  '19-12-2015'
)A
where attribute1 = 'valueA' and attribute2 = 'valueB'

Could I be sure that the both filters are applied sequentially in-memory
i.e. first applying the date filter and over that result set, the next
attributes filter gets applied? Or will two different Map-only operations
will be spawned?

2. Does the Catalyst query optimizer is aware of how data was partitioned?
or does it not make any assumptions on this?
Thanks in advance!


Renato M.


Re: Help reading Spark UI tea leaves..

2015-05-23 Thread Shay Seng
Thanks!

I was getting a little confused by this partitioner business, I thought
that by default a pairRDD would be partitioned by a HashPartitioner? Was
this possibly the case in 0.9.3 but not in 1.x?

In anycase, I tried your suggestion and the shuffle was removed. Cheers.

One small question though, is it important that the same hash partitioner
be used. e.g. could I have written instead:

val d3 = sc.parallelize(1 to 100).map { x = (x % 10) - x}.partitionBy(new
org.apache.spark.HashPartitioner(10))
(0 until 5).foreach { idx =
  val otherData = sc.parallelize(1 to (idx * 100)).map{ x = (x % 10) -
x}.partitionBy(new org.apache.spark.HashPartitioner(10))
  println(idx +  ---  + otherData.join(d3).count())
}






On Fri, May 22, 2015 at 1:10 PM, Imran Rashid iras...@cloudera.com wrote:

 This is a pretty interesting case.  I think the fact that stage 519 says
 its doing the map at reconstruct.scala:153 is somewhat misleading -- its
 just the best ID it has for what its working on.  Really, its just writing
 the map output data for the shuffle.  You can see that the stage generated
 8.4 GB of shuffle output.  But its not actually recomputing your RDD, its
 just spending that time taking the data from memory, and writing the
 appropriate shuffle outputs to disk.

 Here's an example which shows that the RDD isn't really getting recomputed
 each time.  I generate some data, with a fake slow function (sleep for 5
 seconds) and then cache it.  Then that data set is joined against a few
 other datasets each time.

 val d1 = sc.parallelize(1 to 100).mapPartitions { itr =
   Thread.sleep(5000)
   itr.map { x = (x % 10) - x }
 }.cache()
 d1.count()

 (0 until 5).foreach { idx =
   val otherData = sc.parallelize(1 to (idx * 100)).map{ x = (x % 10) - x}
   println(idx +  ---  + otherData.join(d1).count())
 }

 If you run and look at the UI, the stages that are run is very similar to
 what you see in your example.  It seems that d1 gets run many times, but
 actually its just generating the shuffle map output many times.  You will
 only have that long 5 second sleep happen once.

 But, you can actually do even better in this case.  You can use the idea
 of narrow dependencies to make spark write the shuffle output for the
 shared dataset only one time.  Consider this example instead:

 val partitioner = new org.apache.spark.HashPartitioner(10)
 val d3 = sc.parallelize(1 to 100).map { x = (x % 10) -
 x}.partitionBy(partitioner)
 (0 until 5).foreach { idx =
   val otherData = sc.parallelize(1 to (idx * 100)).map{ x = (x % 10) -
 x}.partitionBy(partitioner)
   println(idx +  ---  + otherData.join(d3).count())
 }

 This time, d3 isnt' even cached.  But because d3 and otherData use the
 same partitioner, spark knows it doesn't need to resort d3 each time.  It
 can use the existing shuffle output it already has sitting on disk.  So
 you'll see the stage is skipped in the UI (except for the first job):




 On Fri, May 22, 2015 at 11:59 AM, Shay Seng s...@urbanengines.com wrote:

 Hi.

 I have an RDD that I use repeatedly through many iterations of an
 algorithm. To prevent recomputation, I persist the RDD (and incidentally I
 also persist and checkpoint it's parents)


 val consCostConstraintMap = consCost.join(constraintMap).map {
   case (cid, (costs,(mid1,_,mid2,_,_))) = {
 (cid, (costs, mid1, mid2))
   }
 }
 consCostConstraintMap.setName(consCostConstraintMap)
 consCostConstraintMap.persist(MEMORY_AND_DISK_SER)

 ...

 later on in an iterative loop

 val update = updatedTrips.join(consCostConstraintMap).flatMap {
   ...
 }.treeReduce()

 -

 I can see from the UI that consCostConstraintMap is in storage
 RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize
 in TachyonSize on Disk






 consCostConstraintMap
 http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:4040/storage/rdd?id=113Memory
 Serialized 1x Replicated600100%15.2 GB0.0 B0.0 B
 -
 In the Jobs list, I see the following pattern

 Where each of the treeReduce line corresponds to one iteration loop

 Job IdDescriptionSubmittedDurationStages: Succeeded/TotalTasks (for all
 stages): Succeeded/Total





 13treeReduce at reconstruct.scala:243
 http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:8080/history/app-20150522160613-0001/jobs/job?id=132015/05/22
 16:27:112.9 min16/16 (194 skipped)
 9024/9024 (109225 skipped)
 12treeReduce at reconstruct.scala:243
 http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:8080/history/app-20150522160613-0001/jobs/job?id=122015/05/22
 16:24:162.9 min16/16 (148 skipped)
 9024/9024 (82725 skipped)
 11treeReduce at reconstruct.scala:243
 http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:8080/history/app-20150522160613-0001/jobs/job?id=112015/05/22
 16:21:212.9 min16/16 (103 skipped)
 9024/9024 (56280 skipped)
 10treeReduce at reconstruct.scala:243
 

Re: Is anyone using Amazon EC2?

2015-05-23 Thread Johan Beisser
Yes.

We're looking at bootstrapping in EMR...
On Sat, May 23, 2015 at 07:21 Joe Wass jw...@crossref.org wrote:

 I used Spark on EC2 a while ago



Re: Is anyone using Amazon EC2?

2015-05-23 Thread Shafaq
Yes-Spark EC2 cluster . Looking into migrating to spark emr.
Adding more ec2 is not possible afaik.
On May 23, 2015 11:22 AM, Johan Beisser j...@caustic.org wrote:

 Yes.

 We're looking at bootstrapping in EMR...
 On Sat, May 23, 2015 at 07:21 Joe Wass jw...@crossref.org wrote:

 I used Spark on EC2 a while ago




Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-23 Thread Mike Trienis
Yup, and since I have only one core per executor it explains why there was
only one executor utilized. I'll need to investigate which EC2 instance
type is going to be the best fit.

Thanks Evo.

On Fri, May 22, 2015 at 3:47 PM, Evo Eftimov evo.efti...@isecc.com wrote:

 A receiver occupies a cpu core, an executor is simply a jvm instance and
 as such it can be granted any number of cores and ram

 So check how many cores you have per executor


 Sent from Samsung Mobile


  Original message 
 From: Mike Trienis
 Date:2015/05/22 21:51 (GMT+00:00)
 To: user@spark.apache.org
 Subject: Re: Spark Streaming: all tasks running on one executor (Kinesis +
 Mongodb)

 I guess each receiver occupies a executor. So there was only one executor
 available for processing the job.

 On Fri, May 22, 2015 at 1:24 PM, Mike Trienis mike.trie...@orcsol.com
 wrote:

 Hi All,

 I have cluster of four nodes (three workers and one master, with one core
 each) which consumes data from Kinesis at 15 second intervals using two
 streams (i.e. receivers). The job simply grabs the latest batch and pushes
 it to MongoDB. I believe that the problem is that all tasks are executed on
 a single worker node and never distributed to the others. This is true even
 after I set the number of concurrentJobs to 3. Overall, I would really like
 to increase throughput (i.e. more than 500 records / second) and understand
 why all executors are not being utilized.

 Here are some parameters I have set:

-
- spark.streaming.blockInterval   200
- spark.locality.wait 500
- spark.streaming.concurrentJobs  3

 This is the code that's actually doing the writing:

 def write(rdd: RDD[Data], time:Time) : Unit = {
 val result = doSomething(rdd, time)
 result.foreachPartition { i =
 i.foreach(record = connection.insert(record))
 }
 }

 def doSomething(rdd: RDD[Data]) : RDD[MyObject] = {
 rdd.flatMap(MyObject)
 }

 Any ideas as to how to improve the throughput?

 Thanks, Mike.





Re: spark.executor.extraClassPath - Values not picked up by executors

2015-05-23 Thread Todd Nist
Hi Yana,

Yes typeo in the eamil, file name is correct spark-defaults.conf; thanks
though.  So it appears to work if in the driver is specify it as part of
the sparkConf:

val conf = new SparkConf().setAppName(getClass.getSimpleName)
  .set(spark.executor.extraClassPath,
/projects/spark-cassandra-connector/spark-cassandra-connetor/target/scala-2.10/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar
)

I thought the spark-defaults would be applied regardless of weather it was
a spark-submit (driver) or a custom driver as in my case, but apparently I
am mistaken.  This will work fine as I can ensure that all hosts
participating in the cluster have access to a common directory with the
dependencies and then just set the spark.executor.extraClassPath to
/some/shared/directory/lib/*.jar.

If there is a better way to address this, let me know.

As for the spark-cassandra-connector 1.3.0-SNAPSHOT, I am building that
from master.  Haven't hit any issue with it yet.

-Todd

On Fri, May 22, 2015 at 9:39 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 Todd, I don't have any answers for you...other than the file is actually
 named spark-defaults.conf (not sure if you made a typo in the email or
 misnamed the file...). Do any other options from that file get read?

 I also wanted to ask if you built the spark-cassandra-connector-assembly-
 1.3.0-SNAPSHOT.jar from trunk or if they published a 1.3 drop somewhere
 -- I'm just starting out with Cassandra and discovered
 https://datastax-oss.atlassian.net/browse/SPARKC-98 is still open...

 On Fri, May 22, 2015 at 6:15 PM, Todd Nist tsind...@gmail.com wrote:

 I'm using the spark-cassandra-connector from DataStax in a spark
 streaming job launched from my own driver.  It is connecting a a standalone
 cluster on my local box which has two worker running.

 This is Spark 1.3.1 and spark-cassandra-connector-1.3.0-SNAPSHOT.  I have
 added the following entry to my $SPARK_HOME/conf/spark-default.conf:

 spark.executor.extraClassPath 
 /projects/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar


 When I start the master with, $SPARK_HOME/sbin/start-master.sh, it comes
 up just fine.  As do the two workers with the following command:

 Worker 1, port 8081:

 radtech:spark $ ./bin/spark-class org.apache.spark.deploy.worker.Worker 
 spark://radtech.io:7077 --webui-port 8081 --cores 2

 Worker 2, port 8082

 radtech:spark $ ./bin/spark-class org.apache.spark.deploy.worker.Worker 
 spark://radtech.io:7077 --webui-port 8082 --cores 2

 When I execute the Driver connecting the the master:

 sbt app/run -Dspark.master=spark://radtech.io:7077

 It starts up, but when the executors are launched they do not include the
 entry in the spark.executor.extraClassPath:

 15/05/22 17:35:26 INFO Worker: Asked to launch executor 
 app-20150522173526-/0 for KillrWeatherApp$15/05/22 17:35:26 INFO 
 ExecutorRunner: Launch command: java -cp 
 /usr/local/spark/conf:/usr/local/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/spark/conf:/usr/local/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar
  -Dspark.driver.port=55932 -Xms512M -Xmx512M 
 org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
 akka.tcp://sparkDriver@192.168.1.3:55932/user/CoarseGrainedScheduler 
 --executor-id 0 --hostname 192.168.1.3 --cores 2 --app-id 
 app-20150522173526- --worker-url 
 akka.tcp://sparkWorker@192.168.1.3:55923/user/Worker



 which will then cause the executor to fail with a ClassNotFoundException,
 which I would expect:

 [WARN] [2015-05-22 17:38:18,035] 
 [org.apache.spark.scheduler.TaskSetManager]: Lost task 0.0 in stage 2.0 (TID 
 23, 192.168.1.3): java.lang.ClassNotFoundException: 
 com.datastax.spark.connector.rdd.partitioner.CassandraPartition
 at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:344)
 at 
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:65)
 at 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
 at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
 at 

Re: SparkSQL failing while writing into S3 for 'insert into table'

2015-05-23 Thread Cheolsoo Park
 It seems it generated query results into tmp dir firstly, and tries to rename
it into the right folder finally. But, it failed while renaming it.

This problem exists not only in SparkSQL but also in any Hadoop tools (e.g.
Hive, Pig, etc) when using with s3. Usually, It is better to write task
outputs to local disk and copy them to the final S3 location in the task
commit phase. In fact, this is how EMR Hive does insert overwrite, and
that's why EMR Hive works well with S3 while Apache Hive doesn't.

If you look at SparkHiveWriterContainer, you will see how it mimics Hadoop
task. Basically, you can modify that code to make it write to local disk
first and then commit to the final s3 location. Actually, I am doing the
same at work in 1.4 branch.


On Fri, May 22, 2015 at 5:50 PM, ogoh oke...@gmail.com wrote:


 Hello,
 I am using spark 1.3  Hive 0.13.1 in AWS.
 From Spark-SQL, when running Hive query to export Hive query result into
 AWS
 S3, it failed with the following message:
 ==
 org.apache.hadoop.hive.ql.metadata.HiveException: checkPaths:

 s3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1
 has nested

 directorys3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1/_temporary

 at org.apache.hadoop.hive.ql.metadata.Hive.checkPaths(Hive.java:2157)

 at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2298)

 at org.apache.hadoop.hive.ql.metadata.Table.copyFiles(Table.java:686)

 at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1469)

 at

 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:230)

 at

 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:124)

 at

 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:249)

 at

 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088)

 at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088)
 ==

 The query tested is

 spark-sqlcreate external table s3_dwserver_sql_t1 (q string) location
 's3://test-dev/s3_dwserver_sql_t1')

 spark-sqlinsert into table s3_dwserver_sql_t1 select q from api_search
 where pdate='2015-05-12' limit 100;
 ==

 It seems it generated query results into tmp dir firstly, and tries to
 rename it into the right folder finally. But, it failed while renaming it.

 I appreciate any advice.
 Thanks,
 Okehee





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-failing-while-writing-into-S3-for-insert-into-table-tp23000.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org