Re: Migrate Relational to Distributed
Hi Brant, Let me partially answer to your concerns: please follow a new open source project PL/HQL (www.plhql.org) aimed at allowing you to reuse existing logic and leverage existing skills at some extent, so you do not need to rewrite everything to Scala/Java and can do this gradually. I hope it can help. Thanks, Dmitry On Sat, May 23, 2015 at 1:22 AM, Brant Seibert brantseib...@hotmail.com wrote: Hi, The healthcare industry can do wonderful things with Apache Spark. But, there is already a very large base of data and applications firmly rooted in the relational paradigm and they are resistent to change - stuck on Oracle. ** QUESTION 1 - Migrate legacy relational data (plus new transactions) to distributed storage? DISCUSSION 1 - The primary advantage I see is not having to engage in the lengthy (1+ years) process of creating a relational data warehouse and cubes. Just store the data in a distributed system and analyze first in memory with Spark. ** QUESTION 2 - Will we have to re-write the enormous amount of logic that is already built for the old relational system? DISCUSSION 2 - If we move the data to distributed, can we simply run that existing relational logic as SparkSQL queries? [existing SQL -- Spark Context -- Cassandra -- process in SparkSQL -- display in existing UI]. Can we create an RDD that uses existing SQL? Or do we need to rewrite all our SQL? ** DATA SIZE - We are adding many new data sources to a system that already manages health care data for over a million people. The number of rows may not be enormous right now compared to the advertising industry, for example, but the number of dimensions runs well into the thousands. If we add to this, IoT data for each health care patient, that creates billions of events per day, and the number of rows then grows exponentially. We would like to be prepared to handle that huge data scenario. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Migrate-Relational-to-Distributed-tp22999.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is anyone using Amazon EC2?
Yes, we're running Spark on EC2. Will transition to EMR soon. -Vadim ᐧ On Sat, May 23, 2015 at 2:22 PM, Johan Beisser j...@caustic.org wrote: Yes. We're looking at bootstrapping in EMR... On Sat, May 23, 2015 at 07:21 Joe Wass jw...@crossref.org wrote: I used Spark on EC2 a while ago
Re: Is anyone using Amazon EC2?
Sorry guys, my email submitted before I finished writing it. Check my other message (with the same subject)! On 23 May 2015 at 20:25, Shafaq s.abdullah...@gmail.com wrote: Yes-Spark EC2 cluster . Looking into migrating to spark emr. Adding more ec2 is not possible afaik. On May 23, 2015 11:22 AM, Johan Beisser j...@caustic.org wrote: Yes. We're looking at bootstrapping in EMR... On Sat, May 23, 2015 at 07:21 Joe Wass jw...@crossref.org wrote: I used Spark on EC2 a while ago
??????spark.executor.extraClassPath - Values not picked up by executors
My experience is don't put any application specific settings into spark-defaults.conf which is applied to all applications. Instead, you can either set them programmatically as what you did below or through spark-submit. Also, if you still like to do it via spark-defaults.conf, you will have to change that on all of your worker nodes when you go distributed one day. This is not scalable and not right either as you will have to put your app specific class path to all of your spark worker nodes' spark-defaults.conf iPhone -- -- ??: Todd Nist tsind...@gmail.com : 2015??05??24?? 02:14 ??: yana.kadiyska yana.kadiy...@gmail.com : user@spark.apache.org user@spark.apache.org : Re: spark.executor.extraClassPath - Values not picked up by executors Hi Yana, Yes typeo in the eamil, file name is correct spark-defaults.conf; thanks though. So it appears to work if in the driver is specify it as part of the sparkConf: val conf = new SparkConf().setAppName(getClass.getSimpleName) .set(spark.executor.extraClassPath, /projects/spark-cassandra-connector/spark-cassandra-connetor/target/scala-2.10/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar) I thought the spark-defaults would be applied regardless of weather it was a spark-submit (driver) or a custom driver as in my case, but apparently I am mistaken. This will work fine as I can ensure that all hosts participating in the cluster have access to a common directory with the dependencies and then just set the spark.executor.extraClassPath to /some/shared/directory/lib/*.jar. If there is a better way to address this, let me know. As for the spark-cassandra-connector 1.3.0-SNAPSHOT, I am building that from master. Haven't hit any issue with it yet. -Todd On Fri, May 22, 2015 at 9:39 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Todd, I don't have any answers for you...other than the file is actually named spark-defaults.conf (not sure if you made a typo in the email or misnamed the file...). Do any other options from that file get read? I also wanted to ask if you built the spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar from trunk or if they published a 1.3 drop somewhere -- I'm just starting out with Cassandra and discovered https://datastax-oss.atlassian.net/browse/SPARKC-98 is still open... On Fri, May 22, 2015 at 6:15 PM, Todd Nist tsind...@gmail.com wrote: I'm using the spark-cassandra-connector from DataStax in a spark streaming job launched from my own driver. It is connecting a a standalone cluster on my local box which has two worker running. This is Spark 1.3.1 and spark-cassandra-connector-1.3.0-SNAPSHOT. I have added the following entry to my $SPARK_HOME/conf/spark-default.conf: spark.executor.extraClassPath /projects/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar When I start the master with, $SPARK_HOME/sbin/start-master.sh, it comes up just fine. As do the two workers with the following command: Worker 1, port 8081: radtech:spark $ ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://radtech.io:7077 --webui-port 8081 --cores 2 Worker 2, port 8082 radtech:spark $ ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://radtech.io:7077 --webui-port 8082 --cores 2 When I execute the Driver connecting the the master: sbt app/run -Dspark.master=spark://radtech.io:7077 It starts up, but when the executors are launched they do not include the entry in the spark.executor.extraClassPath: 15/05/22 17:35:26 INFO Worker: Asked to launch executor app-20150522173526-/0 for KillrWeatherApp$ 15/05/22 17:35:26 INFO ExecutorRunner: Launch command: java -cp /usr/local/spark/conf:/usr/local/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/spark/conf:/usr/local/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar -Dspark.driver.port=55932 -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver@192.168.1.3:55932/user/CoarseGrainedScheduler --executor-id 0 --hostname 192.168.1.3 --cores 2 --app-id app-20150522173526- --worker-url akka.tcp://sparkWorker@192.168.1.3:55923/user/Worker which will then cause the executor to fail with a ClassNotFoundException, which I would expect: [WARN] [2015-05-22 17:38:18,035] [org.apache.spark.scheduler.TaskSetManager]: Lost task 0.0 in stage 2.0 (TID 23, 192.168.1.3): java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition at
Strange ClassNotFound exeption
Hi guys! I have a small spark application. It's query some data from postgres, enrich it and write to elasticsearch. When I deployed into spark container I got a very fustrating error: https://gist.github.com/b0c1/66527e00bada1e4c0dc3 Spark version: 1.3.1 Hadoop version: 2.6.0 Additional info: serialization: kryo rdd: custom rdd to query I not understand 1. akka.actor.SelectionPath doesn't exists in 1.3.1 2. I checked all dependencies in my project, I only have org.spark-project.akka:akka-*_2.10:2.3.4-spark:jar doesn't have 3. I not found any reference for this... 4. I created own RDD, it's work, but I need to register to kryo? (mapRow using ResultSet, I need to create 5. I used some months ago and it's already worked with spark 1.2... I recompiled with 1.3.1 but I got this strange error Any idea? -- Skype: boci13, Hangout: boci.b...@gmail.com
Re: DataFrame groupBy vs RDD groupBy
Hi Michael This is great info. I am currently using repartitionandsort function to achieve the same. Is this the recommended way till 1.3 or is there any better way? On 23 May 2015 07:38, Michael Armbrust mich...@databricks.com wrote: DataFrames have a lot more information about the data, so there is a whole class of optimizations that are possible there that we cannot do in RDDs. This is why we are focusing a lot of effort on this part of the project. In Spark 1.4 you can accomplish what you want using the new window function feature. This can be done with SQL as you described or directly on a DataFrame: import org.apache.spark.sql.functions._ import org.apache.spark.sql.expressions._ val df = Seq((a, 1), (b, 1), (c, 2), (d, 2)).toDF(x, y) df.select('x, 'y, rowNumber.over(Window.partitionBy(y).orderBy(x)).as(number)).show +-+-+--+ |x|y|number| +-+-+--+ |a|1| 1| |b|1| 2| |c|2| 1| |d|2| 2| +-+-+--+ On Fri, May 22, 2015 at 3:35 AM, gtanguy g.tanguy.claravi...@gmail.com wrote: Hello everybody, I have two questions in one. I upgrade from Spark 1.1 to 1.3 and some part of my code using groupBy became really slow. *1/ *Why does the groupBy of rdd is really slow in comparison to the groupBy of dataFrame? // DataFrame : running in few seconds val result = table.groupBy(col1).count // RDD : taking hours with a lot of /spilling in-memory/ val schemaOriginel = table.schema val result = table.rdd.groupBy { r = val rs = RowSchema(r, schemaOriginel) val col1 = rs.getValueByName(col1) col1 }.map(l = (l._1,l._2.size) ).count() *2/* My goal is to groupBy on a key, then to order each group over a column and finally to add the row number in each group. I had this code running before changing to Spark 1.3 and it worked fine, but since I have changed to DataFrame it is really slow. val schemaOriginel = table.schema val result = table.rdd.groupBy { r = val rs = RowSchema(r, schemaOriginel) val col1 = rs.getValueByName(col1) col1 }.flatMap { l = l._2.toList .sortBy { u = val rs = RowSchema(u, schemaOriginel) val col1 = rs.getValueByName(col1) val col2 = rs.getValueByName(col2) (col1, col2) } .zipWithIndex } /I think the SQL equivalent of what I try to do : / SELECT a, ROW_NUMBER() OVER (PARTITION BY a) AS num FROM table. I don't think I can do this with a GroupedData (result of df.groupby). Any ideas on how I can speed up this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-groupBy-vs-RDD-groupBy-tp22995.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Strange ClassNotFound exeption
In my local maven repo, I found: $ jar tvf /Users/tyu/.m2/repository//org/spark-project/akka/akka-actor_2.10/2.3.4-spark/akka-actor_2.10-2.3.4-spark.jar | grep SelectionPath 521 Mon Sep 29 12:05:36 PDT 2014 akka/actor/SelectionPathElement.class Is the above jar in your classpath ? On Sat, May 23, 2015 at 5:05 PM, boci boci.b...@gmail.com wrote: Hi guys! I have a small spark application. It's query some data from postgres, enrich it and write to elasticsearch. When I deployed into spark container I got a very fustrating error: https://gist.github.com/b0c1/66527e00bada1e4c0dc3 Spark version: 1.3.1 Hadoop version: 2.6.0 Additional info: serialization: kryo rdd: custom rdd to query I not understand 1. akka.actor.SelectionPath doesn't exists in 1.3.1 2. I checked all dependencies in my project, I only have org.spark-project.akka:akka-*_2.10:2.3.4-spark:jar doesn't have 3. I not found any reference for this... 4. I created own RDD, it's work, but I need to register to kryo? (mapRow using ResultSet, I need to create 5. I used some months ago and it's already worked with spark 1.2... I recompiled with 1.3.1 but I got this strange error Any idea? -- Skype: boci13, Hangout: boci.b...@gmail.com
SparkSQL can't read S3 path for hive external table
Hello, I am using Spark1.3 in AWS. SparkSQL can't recognize Hive external table on S3. The following is the error message. I appreciate any help. Thanks, Okehee -- 15/05/24 01:02:18 ERROR thriftserver.SparkSQLDriver: Failed in [select count(*) from api_search where pdate='2015-05-08'] java.lang.IllegalArgumentException: Wrong FS: s3://test-emr/datawarehouse/api_s3_perf/api_search/pdate=2015-05-08/phour=00, expected: hdfs://10.128.193.211:9000 at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:647) at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:467) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:252) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-can-t-read-S3-path-for-hive-external-table-tp23002.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
?????? ?????? ?????? How to use spark to access HBase with Security enabled
Hi, The exception is the same as before. Just like the following: 2015-05-23 18:01:40,943 ERROR [hconnection-0x14027b82-shared--pool1-t1] ipc.AbstractRpcClient: SASL authentication failed. The most likely cause is missing or invalid credentials. Consider 'kinit'. javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212) at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:604) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$600(RpcClientImpl.java:153) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:730) at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:727) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) But after many tests, I found that the cause reason is : 1. When I try to get a HBase connection in spark or call sc.newAPIHadoopRDD the API in spark on Yarn-cluster mode, it uses the UserGroupInformation.getCurrentUser API to get the User object which is used to authenticate. 2. The UserGroupInformation.getCurrentUser API's logic is as following: AccessControlContext context = AccessController.getContext(); Subject subject = Subject.getSubject(context); return subject != null !subject.getPrincipals(User.class).isEmpty()?new UserGroupInformation(subject):getLoginUser(); 3. I printed the subject object to the stdout. I found the user info is my linux os user spark, not the principal sp...@bgdt.dev.hrb This is the reason why I can not pass the authentication. The context of the executor threads spawned by the nodemanager do not contain any principal info which I am sure that I have already set it up using the kinit command. But I still don't know why and how to solve it. Does anybody know how to set configurations so that the context of threads spawned by the nodemanager contain the principal I set up with the kinit command? Many Thanks! -- -- ??: yuzhihong;yuzhih...@gmail.com; : 2015??5??22??(??) 7:25 ??: donhoff_h165612...@qq.com; : Bill Qbill.q@gmail.com; useruser@spark.apache.org; : Re: ?? ?? How to use spark to access HBase with Security enabled Can you share the exception(s) you encountered ? Thanks On May 22, 2015, at 12:33 AM, donhoff_h 165612...@qq.com wrote: Hi, My modified code is listed below, just add the SecurityUtil API. I don't know which propertyKeys I should use, so I make 2 my own propertyKeys to find the keytab and principal. object TestHBaseRead2 { def main(args: Array[String]) { val conf = new SparkConf() val sc = new SparkContext(conf) val hbConf = HBaseConfiguration.create() hbConf.set(dhao.keytab.file,//etc//spark//keytab//spark.user.keytab) hbConf.set(dhao.user.principal,sp...@bgdt.dev.hrb) SecurityUtil.login(hbConf,dhao.keytab.file,dhao.user.principal) val conn = ConnectionFactory.createConnection(hbConf) val tbl = conn.getTable(TableName.valueOf(spark_t01)) try { val get = new Get(Bytes.toBytes(row01)) val res = tbl.get(get) println(result:+res.toString) } finally { tbl.close() conn.close() es.shutdown() } val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10)) val v = rdd.sum() println(Value=+v) sc.stop() } } -- -- ??: yuzhihong;yuzhih...@gmail.com; : 2015??5??22??(??) 3:25 ??: donhoff_h165612...@qq.com; : Bill Qbill.q@gmail.com; useruser@spark.apache.org; : Re: ?? How to use spark to access HBase with Security enabled Can you post the morning modified code ? Thanks On May 21, 2015, at 11:11 PM, donhoff_h 165612...@qq.com wrote: Hi, Thanks very much for the reply. I have tried the SecurityUtil. I can see from log that this statement executed successfully, but I still can not pass the authentication of HBase. And with more experiments, I found a new interesting senario. If I run the program with yarn-client mode, the driver can pass the authentication, but the executors can not. If I run the program with yarn-cluster mode, both the driver and the executors can not pass the authentication. Can anybody give me some clue with this info? Many Thanks! -- -- ??: yuzhihong;yuzhih...@gmail.com; : 2015??5??22??(??) 5:29 ??: donhoff_h165612...@qq.com; : Bill Qbill.q@gmail.com;
Re: [Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll
Hi TD Unfortunately, I am off for a week so I won't be able to test this until next week. Will keep you posted. Aniket On Sat, May 23, 2015, 6:16 AM Tathagata Das t...@databricks.com wrote: Hey Aniket, I just checked in the fix in Spark master and branch-1.4. Could you download Spark and test it out? On Thu, May 21, 2015 at 1:43 AM, Tathagata Das t...@databricks.com wrote: Thanks for the JIRA. I will look into this issue. TD On Thu, May 21, 2015 at 1:31 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I ran into one of the issues that are potentially caused because of this and have logged a JIRA bug - https://issues.apache.org/jira/browse/SPARK-7788 Thanks, Aniket On Wed, Sep 24, 2014 at 12:59 PM Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi all Reading through Spark streaming's custom receiver documentation, it is recommended that onStart and onStop methods should not block indefinitely. However, looking at the source code of KinesisReceiver, the onStart method calls worker.run that blocks until worker is shutdown (via a call to onStop). So, my question is what are the ramifications of making a blocking call in onStart and whether this is something that should be addressed in KinesisReceiver implementation. Thanks, Aniket
split function on spark sql created rdd
Hi All, I am trying to do word count on number of tweets, my first step is to get data from table using spark sql and then run split function on top of it to calculate word count. Error:- valuse split is not a member of org.apache.spark.sql.SchemaRdd Spark Code that doesn't work to do word count:- val disitnct_tweets=hiveCtx.sql(select distinct(text) from tweets_table where text '') val distinct_tweets_List=sc.parallelize(List(distinct_tweets)) //tried split on both the rdd disnt worked distinct_tweets.flatmap(line = line.split( )).map(word = (word,1)).reduceByKey(_+_) distinct_tweets_List.flatmap(line = line.split( )).map(word = (word,1)).reduceByKey(_+_) But when I output the data from sparksql to a file and load it again and run split it works. Example Code that works:- val distinct_tweets=hiveCtx.sql(select dsitinct(text) from tweets_table where text '') val distinct_tweets_op=distinct_tweets.collect() val rdd=sc.parallelize(distinct_tweets_op) rdd.saveAsTextFile(/home/cloudera/bdp/op) val textFile=sc.textFile(/home/cloudera/bdp/op/part-0) val counts=textFile.flatMap(line = line.split( )).map(word = (word,1)).reduceByKey(_+_) counts.SaveAsTextFile(/home/cloudera/bdp/wordcount) I don't want to write to file instead want to collect in a rdd and apply filter function on top of schema rdd, is there a way. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/split-function-on-spark-sql-created-rdd-tp23001.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Is anyone using Amazon EC2?
I used Spark on EC2 a while ago
Not able to run SparkPi locally
Hello all, This is probably me doing something obviously wrong, would really appreciate some pointers on how to fix this. I installed spark-1.3.1-bin-hadoop2.6.tgz from the Spark download page [ https://spark.apache.org/downloads.html] and just untarred it on a local drive. I am on Mac OSX 10.9.5 and the JDK is 1.8.0_40. I ran the following commands (the first 3 run succesfully, I mention it here to rule out any possibility of it being an obviously bad install). 1) laptop$ bin/spark-shell scala sc.parallelize(1 to 100).count() res0: Long = 100 scala exit 2) laptop$ bin/pyspark sc.parallelize(range(100)).count() 100 quit() 3) laptop$ bin/spark-submit examples/src/main/python/pi.py Pi is roughly 3.142800 4) laptop$ bin/run-example SparkPi This hangs at this line (full stack trace is provided at the end of this mail) 15/05/23 07:52:10 INFO Executor: Fetching http://10.0.0.5:51575/jars/spark-examples-1.3.1-hadoop2.6.0.jar with timestamp 1432392670140 15/05/23 07:52:10 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.net.SocketTimeoutException: connect timed out ... and finally dies with this message: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketTimeoutException: connect timed out I checked with ifconfig -a on my box, 10.0.0.5 is my IP address on my local network. en0: flags=8863UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST mtu 1500 ether 34:36:3b:d2:b0:f4 inet 10.0.0.5 netmask 0xff00 broadcast 10.0.0.255 media: autoselect status: active I think perhaps there may be some configuration I am missing. Being able to run jobs locally (without HDFS or creating a cluster) is essential for development, and the examples come from the Spark 1.3.1 Quick Start page [ https://spark.apache.org/docs/latest/quick-start.html], so this is probably something to do with my environment. Thanks in advance for any help you can provide. -sujit = Full output of SparkPi run (including stack trace) follows: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/05/23 08:08:55 INFO SparkContext: Running Spark version 1.3.1 15/05/23 08:08:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/05/23 08:08:57 INFO SecurityManager: Changing view acls to: palsujit 15/05/23 08:08:57 INFO SecurityManager: Changing modify acls to: palsujit 15/05/23 08:08:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(palsujit); users with modify permissions: Set(palsujit) 15/05/23 08:08:57 INFO Slf4jLogger: Slf4jLogger started 15/05/23 08:08:57 INFO Remoting: Starting remoting 15/05/23 08:08:58 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.0.0.5:52008] 15/05/23 08:08:58 INFO Utils: Successfully started service 'sparkDriver' on port 52008. 15/05/23 08:08:58 INFO SparkEnv: Registering MapOutputTracker 15/05/23 08:08:58 INFO SparkEnv: Registering BlockManagerMaster 15/05/23 08:08:58 INFO DiskBlockManager: Created local directory at /var/folders/z8/s_crq_2j2rqb9mv_4j8djsjnx359l2/T/spark-d97baddf-1b6f-41db-92bb-f82ab5184cb7/blockmgr-4ef3a194-1929-4dd3-a0e5-215175d8e41a 15/05/23 08:08:58 INFO MemoryStore: MemoryStore started with capacity 265.1 MB 15/05/23 08:08:58 INFO HttpFileServer: HTTP File server directory is /var/folders/z8/s_crq_2j2rqb9mv_4j8djsjnx359l2/T/spark-fdf36480-def0-44b7-9942-098d9ef3e2b4/httpd-e494852a-7d61-4441-8b80-566d9f820afb 15/05/23 08:08:58 INFO HttpServer: Starting HTTP Server 15/05/23 08:08:58 INFO Server: jetty-8.y.z-SNAPSHOT 15/05/23 08:08:58 INFO AbstractConnector: Started SocketConnector@0.0.0.0:52009 15/05/23 08:08:58 INFO Utils: Successfully started service 'HTTP file server' on port 52009. 15/05/23 08:08:58 INFO SparkEnv: Registering OutputCommitCoordinator 15/05/23 08:08:58 INFO Server: jetty-8.y.z-SNAPSHOT 15/05/23 08:08:58 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 15/05/23 08:08:58 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/05/23 08:08:58 INFO SparkUI: Started SparkUI at http://10.0.0.5:4040 15/05/23 08:08:58 INFO SparkContext: Added JAR file:/Users/palsujit/Software/spark-1.3.1-bin-hadoop2.6/lib/spark-examples-1.3.1-hadoop2.6.0.jar at http://10.0.0.5:52009/jars/spark-examples-1.3.1-hadoop2.6.0.jar with timestamp 1432393738514 15/05/23 08:08:58 INFO Executor: Starting executor ID driver on host localhost 15/05/23 08:08:58 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@10.0.0.5:52008/user/HeartbeatReceiver 15/05/23 08:08:58 INFO NettyBlockTransferService: Server created on 52010 15/05/23 08:08:58 INFO BlockManagerMaster: Trying to register BlockManager 15/05/23 08:08:58 INFO
Re: Doubts about SparkSQL
Yes it does ... you can try out the following example (the People dataset that comes with Spark). There is an inner query that filters on age and an outer query that filters on name. The physical plan applies a single composite filter on name and age as you can see below sqlContext.sql(select * from (select * from people where age = 13)A where A.name = 'Justin').explain == Physical Plan == Filter ((age#1 = 13) (name#0 = Justin)) PhysicalRDD [name#0,age#1], MapPartitionsRDD[8] at rddToDataFrameHolder at console:26 On Sat, May 23, 2015 at 9:52 AM, Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com wrote: Hi all, I have some doubts about the latest SparkSQL. 1. In the paper about SparkSQL it has been stated that The physical planner also performs rule-based physical optimizations, such as pipelining projections or filters into one Spark map operation. ... If dealing with a query of the form: select * from ( select * from tableA where date1 '19-12-2015' )A where attribute1 = 'valueA' and attribute2 = 'valueB' Could I be sure that the both filters are applied sequentially in-memory i.e. first applying the date filter and over that result set, the next attributes filter gets applied? Or will two different Map-only operations will be spawned? 2. Does the Catalyst query optimizer is aware of how data was partitioned? or does it not make any assumptions on this? Thanks in advance! Renato M.
Re: Not able to run SparkPi locally
Replying to my own email in case someone has the same or similar issue. On a hunch I ran this against my Linux (Ubuntu 14.04 with JDK 8) box. Not only did bin/run-example SparkPi run without any problems, it also provided a very helpful message in the output. 15/05/23 08:35:15 WARN Utils: Your hostname, tsunami resolves to a loopback address: 127.0.1.1; using 10.0.0.10 instead (on interface wlan0) 15/05/23 08:35:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address So I went back to my Mac, set SPARK_LOCAL_IP=127.0.0.1 and everything runs fine now. To make this permanent I put this in conf/spark-env.sh. -sujit On Sat, May 23, 2015 at 8:14 AM, Sujit Pal sujitatgt...@gmail.com wrote: Hello all, This is probably me doing something obviously wrong, would really appreciate some pointers on how to fix this. I installed spark-1.3.1-bin-hadoop2.6.tgz from the Spark download page [ https://spark.apache.org/downloads.html] and just untarred it on a local drive. I am on Mac OSX 10.9.5 and the JDK is 1.8.0_40. I ran the following commands (the first 3 run succesfully, I mention it here to rule out any possibility of it being an obviously bad install). 1) laptop$ bin/spark-shell scala sc.parallelize(1 to 100).count() res0: Long = 100 scala exit 2) laptop$ bin/pyspark sc.parallelize(range(100)).count() 100 quit() 3) laptop$ bin/spark-submit examples/src/main/python/pi.py Pi is roughly 3.142800 4) laptop$ bin/run-example SparkPi This hangs at this line (full stack trace is provided at the end of this mail) 15/05/23 07:52:10 INFO Executor: Fetching http://10.0.0.5:51575/jars/spark-examples-1.3.1-hadoop2.6.0.jar with timestamp 1432392670140 15/05/23 07:52:10 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.net.SocketTimeoutException: connect timed out ... and finally dies with this message: Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketTimeoutException: connect timed out I checked with ifconfig -a on my box, 10.0.0.5 is my IP address on my local network. en0: flags=8863UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST mtu 1500 ether 34:36:3b:d2:b0:f4 inet 10.0.0.5 netmask 0xff00 broadcast 10.0.0.255 media: autoselect status: active I think perhaps there may be some configuration I am missing. Being able to run jobs locally (without HDFS or creating a cluster) is essential for development, and the examples come from the Spark 1.3.1 Quick Start page [ https://spark.apache.org/docs/latest/quick-start.html], so this is probably something to do with my environment. Thanks in advance for any help you can provide. -sujit = Full output of SparkPi run (including stack trace) follows: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/05/23 08:08:55 INFO SparkContext: Running Spark version 1.3.1 15/05/23 08:08:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/05/23 08:08:57 INFO SecurityManager: Changing view acls to: palsujit 15/05/23 08:08:57 INFO SecurityManager: Changing modify acls to: palsujit 15/05/23 08:08:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(palsujit); users with modify permissions: Set(palsujit) 15/05/23 08:08:57 INFO Slf4jLogger: Slf4jLogger started 15/05/23 08:08:57 INFO Remoting: Starting remoting 15/05/23 08:08:58 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.0.0.5:52008] 15/05/23 08:08:58 INFO Utils: Successfully started service 'sparkDriver' on port 52008. 15/05/23 08:08:58 INFO SparkEnv: Registering MapOutputTracker 15/05/23 08:08:58 INFO SparkEnv: Registering BlockManagerMaster 15/05/23 08:08:58 INFO DiskBlockManager: Created local directory at /var/folders/z8/s_crq_2j2rqb9mv_4j8djsjnx359l2/T/spark-d97baddf-1b6f-41db-92bb-f82ab5184cb7/blockmgr-4ef3a194-1929-4dd3-a0e5-215175d8e41a 15/05/23 08:08:58 INFO MemoryStore: MemoryStore started with capacity 265.1 MB 15/05/23 08:08:58 INFO HttpFileServer: HTTP File server directory is /var/folders/z8/s_crq_2j2rqb9mv_4j8djsjnx359l2/T/spark-fdf36480-def0-44b7-9942-098d9ef3e2b4/httpd-e494852a-7d61-4441-8b80-566d9f820afb 15/05/23 08:08:58 INFO HttpServer: Starting HTTP Server 15/05/23 08:08:58 INFO Server: jetty-8.y.z-SNAPSHOT 15/05/23 08:08:58 INFO AbstractConnector: Started SocketConnector@0.0.0.0:52009 15/05/23 08:08:58 INFO Utils: Successfully started service 'HTTP file server' on port 52009. 15/05/23 08:08:58 INFO SparkEnv: Registering OutputCommitCoordinator 15/05/23 08:08:58 INFO Server: jetty-8.y.z-SNAPSHOT 15/05/23 08:08:58 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
Doubts about SparkSQL
Hi all, I have some doubts about the latest SparkSQL. 1. In the paper about SparkSQL it has been stated that The physical planner also performs rule-based physical optimizations, such as pipelining projections or filters into one Spark map operation. ... If dealing with a query of the form: select * from ( select * from tableA where date1 '19-12-2015' )A where attribute1 = 'valueA' and attribute2 = 'valueB' Could I be sure that the both filters are applied sequentially in-memory i.e. first applying the date filter and over that result set, the next attributes filter gets applied? Or will two different Map-only operations will be spawned? 2. Does the Catalyst query optimizer is aware of how data was partitioned? or does it not make any assumptions on this? Thanks in advance! Renato M.
Re: Help reading Spark UI tea leaves..
Thanks! I was getting a little confused by this partitioner business, I thought that by default a pairRDD would be partitioned by a HashPartitioner? Was this possibly the case in 0.9.3 but not in 1.x? In anycase, I tried your suggestion and the shuffle was removed. Cheers. One small question though, is it important that the same hash partitioner be used. e.g. could I have written instead: val d3 = sc.parallelize(1 to 100).map { x = (x % 10) - x}.partitionBy(new org.apache.spark.HashPartitioner(10)) (0 until 5).foreach { idx = val otherData = sc.parallelize(1 to (idx * 100)).map{ x = (x % 10) - x}.partitionBy(new org.apache.spark.HashPartitioner(10)) println(idx + --- + otherData.join(d3).count()) } On Fri, May 22, 2015 at 1:10 PM, Imran Rashid iras...@cloudera.com wrote: This is a pretty interesting case. I think the fact that stage 519 says its doing the map at reconstruct.scala:153 is somewhat misleading -- its just the best ID it has for what its working on. Really, its just writing the map output data for the shuffle. You can see that the stage generated 8.4 GB of shuffle output. But its not actually recomputing your RDD, its just spending that time taking the data from memory, and writing the appropriate shuffle outputs to disk. Here's an example which shows that the RDD isn't really getting recomputed each time. I generate some data, with a fake slow function (sleep for 5 seconds) and then cache it. Then that data set is joined against a few other datasets each time. val d1 = sc.parallelize(1 to 100).mapPartitions { itr = Thread.sleep(5000) itr.map { x = (x % 10) - x } }.cache() d1.count() (0 until 5).foreach { idx = val otherData = sc.parallelize(1 to (idx * 100)).map{ x = (x % 10) - x} println(idx + --- + otherData.join(d1).count()) } If you run and look at the UI, the stages that are run is very similar to what you see in your example. It seems that d1 gets run many times, but actually its just generating the shuffle map output many times. You will only have that long 5 second sleep happen once. But, you can actually do even better in this case. You can use the idea of narrow dependencies to make spark write the shuffle output for the shared dataset only one time. Consider this example instead: val partitioner = new org.apache.spark.HashPartitioner(10) val d3 = sc.parallelize(1 to 100).map { x = (x % 10) - x}.partitionBy(partitioner) (0 until 5).foreach { idx = val otherData = sc.parallelize(1 to (idx * 100)).map{ x = (x % 10) - x}.partitionBy(partitioner) println(idx + --- + otherData.join(d3).count()) } This time, d3 isnt' even cached. But because d3 and otherData use the same partitioner, spark knows it doesn't need to resort d3 each time. It can use the existing shuffle output it already has sitting on disk. So you'll see the stage is skipped in the UI (except for the first job): On Fri, May 22, 2015 at 11:59 AM, Shay Seng s...@urbanengines.com wrote: Hi. I have an RDD that I use repeatedly through many iterations of an algorithm. To prevent recomputation, I persist the RDD (and incidentally I also persist and checkpoint it's parents) val consCostConstraintMap = consCost.join(constraintMap).map { case (cid, (costs,(mid1,_,mid2,_,_))) = { (cid, (costs, mid1, mid2)) } } consCostConstraintMap.setName(consCostConstraintMap) consCostConstraintMap.persist(MEMORY_AND_DISK_SER) ... later on in an iterative loop val update = updatedTrips.join(consCostConstraintMap).flatMap { ... }.treeReduce() - I can see from the UI that consCostConstraintMap is in storage RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize in TachyonSize on Disk consCostConstraintMap http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:4040/storage/rdd?id=113Memory Serialized 1x Replicated600100%15.2 GB0.0 B0.0 B - In the Jobs list, I see the following pattern Where each of the treeReduce line corresponds to one iteration loop Job IdDescriptionSubmittedDurationStages: Succeeded/TotalTasks (for all stages): Succeeded/Total 13treeReduce at reconstruct.scala:243 http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:8080/history/app-20150522160613-0001/jobs/job?id=132015/05/22 16:27:112.9 min16/16 (194 skipped) 9024/9024 (109225 skipped) 12treeReduce at reconstruct.scala:243 http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:8080/history/app-20150522160613-0001/jobs/job?id=122015/05/22 16:24:162.9 min16/16 (148 skipped) 9024/9024 (82725 skipped) 11treeReduce at reconstruct.scala:243 http://ec2-54-151-185-196.ap-southeast-1.compute.amazonaws.com:8080/history/app-20150522160613-0001/jobs/job?id=112015/05/22 16:21:212.9 min16/16 (103 skipped) 9024/9024 (56280 skipped) 10treeReduce at reconstruct.scala:243
Re: Is anyone using Amazon EC2?
Yes. We're looking at bootstrapping in EMR... On Sat, May 23, 2015 at 07:21 Joe Wass jw...@crossref.org wrote: I used Spark on EC2 a while ago
Re: Is anyone using Amazon EC2?
Yes-Spark EC2 cluster . Looking into migrating to spark emr. Adding more ec2 is not possible afaik. On May 23, 2015 11:22 AM, Johan Beisser j...@caustic.org wrote: Yes. We're looking at bootstrapping in EMR... On Sat, May 23, 2015 at 07:21 Joe Wass jw...@crossref.org wrote: I used Spark on EC2 a while ago
Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)
Yup, and since I have only one core per executor it explains why there was only one executor utilized. I'll need to investigate which EC2 instance type is going to be the best fit. Thanks Evo. On Fri, May 22, 2015 at 3:47 PM, Evo Eftimov evo.efti...@isecc.com wrote: A receiver occupies a cpu core, an executor is simply a jvm instance and as such it can be granted any number of cores and ram So check how many cores you have per executor Sent from Samsung Mobile Original message From: Mike Trienis Date:2015/05/22 21:51 (GMT+00:00) To: user@spark.apache.org Subject: Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb) I guess each receiver occupies a executor. So there was only one executor available for processing the job. On Fri, May 22, 2015 at 1:24 PM, Mike Trienis mike.trie...@orcsol.com wrote: Hi All, I have cluster of four nodes (three workers and one master, with one core each) which consumes data from Kinesis at 15 second intervals using two streams (i.e. receivers). The job simply grabs the latest batch and pushes it to MongoDB. I believe that the problem is that all tasks are executed on a single worker node and never distributed to the others. This is true even after I set the number of concurrentJobs to 3. Overall, I would really like to increase throughput (i.e. more than 500 records / second) and understand why all executors are not being utilized. Here are some parameters I have set: - - spark.streaming.blockInterval 200 - spark.locality.wait 500 - spark.streaming.concurrentJobs 3 This is the code that's actually doing the writing: def write(rdd: RDD[Data], time:Time) : Unit = { val result = doSomething(rdd, time) result.foreachPartition { i = i.foreach(record = connection.insert(record)) } } def doSomething(rdd: RDD[Data]) : RDD[MyObject] = { rdd.flatMap(MyObject) } Any ideas as to how to improve the throughput? Thanks, Mike.
Re: spark.executor.extraClassPath - Values not picked up by executors
Hi Yana, Yes typeo in the eamil, file name is correct spark-defaults.conf; thanks though. So it appears to work if in the driver is specify it as part of the sparkConf: val conf = new SparkConf().setAppName(getClass.getSimpleName) .set(spark.executor.extraClassPath, /projects/spark-cassandra-connector/spark-cassandra-connetor/target/scala-2.10/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar ) I thought the spark-defaults would be applied regardless of weather it was a spark-submit (driver) or a custom driver as in my case, but apparently I am mistaken. This will work fine as I can ensure that all hosts participating in the cluster have access to a common directory with the dependencies and then just set the spark.executor.extraClassPath to /some/shared/directory/lib/*.jar. If there is a better way to address this, let me know. As for the spark-cassandra-connector 1.3.0-SNAPSHOT, I am building that from master. Haven't hit any issue with it yet. -Todd On Fri, May 22, 2015 at 9:39 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Todd, I don't have any answers for you...other than the file is actually named spark-defaults.conf (not sure if you made a typo in the email or misnamed the file...). Do any other options from that file get read? I also wanted to ask if you built the spark-cassandra-connector-assembly- 1.3.0-SNAPSHOT.jar from trunk or if they published a 1.3 drop somewhere -- I'm just starting out with Cassandra and discovered https://datastax-oss.atlassian.net/browse/SPARKC-98 is still open... On Fri, May 22, 2015 at 6:15 PM, Todd Nist tsind...@gmail.com wrote: I'm using the spark-cassandra-connector from DataStax in a spark streaming job launched from my own driver. It is connecting a a standalone cluster on my local box which has two worker running. This is Spark 1.3.1 and spark-cassandra-connector-1.3.0-SNAPSHOT. I have added the following entry to my $SPARK_HOME/conf/spark-default.conf: spark.executor.extraClassPath /projects/spark-cassandra-connector/spark-cassandra-connector/target/scala-2.10/spark-cassandra-connector-assembly-1.3.0-SNAPSHOT.jar When I start the master with, $SPARK_HOME/sbin/start-master.sh, it comes up just fine. As do the two workers with the following command: Worker 1, port 8081: radtech:spark $ ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://radtech.io:7077 --webui-port 8081 --cores 2 Worker 2, port 8082 radtech:spark $ ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://radtech.io:7077 --webui-port 8082 --cores 2 When I execute the Driver connecting the the master: sbt app/run -Dspark.master=spark://radtech.io:7077 It starts up, but when the executors are launched they do not include the entry in the spark.executor.extraClassPath: 15/05/22 17:35:26 INFO Worker: Asked to launch executor app-20150522173526-/0 for KillrWeatherApp$15/05/22 17:35:26 INFO ExecutorRunner: Launch command: java -cp /usr/local/spark/conf:/usr/local/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/local/spark/conf:/usr/local/spark/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/usr/local/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/local/spark/lib/datanucleus-core-3.2.10.jar:/usr/local/spark/lib/datanucleus-rdbms-3.2.9.jar -Dspark.driver.port=55932 -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver@192.168.1.3:55932/user/CoarseGrainedScheduler --executor-id 0 --hostname 192.168.1.3 --cores 2 --app-id app-20150522173526- --worker-url akka.tcp://sparkWorker@192.168.1.3:55923/user/Worker which will then cause the executor to fail with a ClassNotFoundException, which I would expect: [WARN] [2015-05-22 17:38:18,035] [org.apache.spark.scheduler.TaskSetManager]: Lost task 0.0 in stage 2.0 (TID 23, 192.168.1.3): java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:344) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:65) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) at
Re: SparkSQL failing while writing into S3 for 'insert into table'
It seems it generated query results into tmp dir firstly, and tries to rename it into the right folder finally. But, it failed while renaming it. This problem exists not only in SparkSQL but also in any Hadoop tools (e.g. Hive, Pig, etc) when using with s3. Usually, It is better to write task outputs to local disk and copy them to the final S3 location in the task commit phase. In fact, this is how EMR Hive does insert overwrite, and that's why EMR Hive works well with S3 while Apache Hive doesn't. If you look at SparkHiveWriterContainer, you will see how it mimics Hadoop task. Basically, you can modify that code to make it write to local disk first and then commit to the final s3 location. Actually, I am doing the same at work in 1.4 branch. On Fri, May 22, 2015 at 5:50 PM, ogoh oke...@gmail.com wrote: Hello, I am using spark 1.3 Hive 0.13.1 in AWS. From Spark-SQL, when running Hive query to export Hive query result into AWS S3, it failed with the following message: == org.apache.hadoop.hive.ql.metadata.HiveException: checkPaths: s3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1 has nested directorys3://test-dev/tmp/hive-hadoop/hive_2015-05-23_00-33-06_943_4594473380941885173-1/-ext-1/_temporary at org.apache.hadoop.hive.ql.metadata.Hive.checkPaths(Hive.java:2157) at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2298) at org.apache.hadoop.hive.ql.metadata.Table.copyFiles(Table.java:686) at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1469) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:230) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:124) at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:249) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1088) at org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1088) == The query tested is spark-sqlcreate external table s3_dwserver_sql_t1 (q string) location 's3://test-dev/s3_dwserver_sql_t1') spark-sqlinsert into table s3_dwserver_sql_t1 select q from api_search where pdate='2015-05-12' limit 100; == It seems it generated query results into tmp dir firstly, and tries to rename it into the right folder finally. But, it failed while renaming it. I appreciate any advice. Thanks, Okehee -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-failing-while-writing-into-S3-for-insert-into-table-tp23000.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org