subscribe
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Multi-tenancy for Spark (Streaming) Applications
Hi, by now I understood maybe a bit better how spark-submit and YARN play together and how Spark driver and slaves play together on YARN. Now for my usecase, as described on https://spark.apache.org/docs/latest/submitting-applications.html, I would probably have a end-user-facing gateway that submits my Spark (Streaming) application to the YARN cluster in yarn-cluster mode. I have a couple of questions regarding that setup: * That gateway does not need to be written in Scala or Java, it actually has no contact with the Spark libraries; it is just executing a program on the command line (./spark-submit ...), right? * Since my application is a streaming application, it won't finish by itself. What is the best way to terminate the application on the cluster from my gateway program? Can I just send SIGTERM to the spark-submit program? Is it recommended? * I guess there are many possibilities to achieve that, but what is a good way to send commands/instructions to the running Spark application? If I want to push some commands from the gateway to the Spark driver, I guess I need to get its IP address - how? If I want the Spark driver to pull its instructions, what is a good way to do so? Any suggestions? Thanks, Tobias
Re: can fileStream() or textFileStream() remember state?
When you get a stream from sc.fileStream() spark will process only files with file timestamp then current timestamp so all data from HDFS should not be processed again. You may have a another problem - spark will not process files that moved to your HDFS folder between your application restarts. To avoid this you should use the checkpoints as discribed here : https://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-the-driver-node akso wrote When streaming from HDFS through eihter sc.fileStream() or sc.textFileStream(), how can state info be saved so that it won't process duplicate data. When app is stop and restart, all data from HDFS is processed again. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/can-fileStream-or-textFileStream-remember-state-tp9105p13950.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SchemaRDD saveToCassandra
Hi, My requirement is to extract certain fields from json files, run queries on them and save the result to cassandra. I was able to parse json , filter the result and save the rdd(regular) to cassandra. Now, when I try to read the json file through sqlContext , execute some queries on the same and then save the SchemaRDD to cassandra using saveToCassandra function, I am getting the following error: java.lang.NoSuchMethodException: Cannot resolve any suitable constructor for class org.apache.spark.sql.catalyst.expressions.Row Pls let me know if a spark SchemaRDD can be directly saved to cassandra just like the regular rdd? If that is not possible, is there any way to convert the schema RDD to a regular RDD ? Please advise. Regards, lmk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-saveToCassandra-tp13951.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark not installed + no access to web UI
Hi, I have been launching Spark in the same ways for the past months, but I have only recently started to have problems with it. I launch Spark using spark-ec2 script, but then I cannot access the web UI when I type address:8080 into the browser (it doesn't work with lynx either from the master node), and I cannot find pyspark in the usual spark/bin/pyspark folder. Any hints as to what might be happening? Thanks in advance! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-not-installed-no-access-to-web-UI-tp13952.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Unpersist
I want to create a temporary variables in a spark code. Can I do this? for (i - num) { val temp = .. { do something } temp.unpersist() } Thank You
Re: Spark not installed + no access to web UI
Which version of spark are you having? Thanks Best Regards On Thu, Sep 11, 2014 at 3:10 PM, mrm ma...@skimlinks.com wrote: Hi, I have been launching Spark in the same ways for the past months, but I have only recently started to have problems with it. I launch Spark using spark-ec2 script, but then I cannot access the web UI when I type address:8080 into the browser (it doesn't work with lynx either from the master node), and I cannot find pyspark in the usual spark/bin/pyspark folder. Any hints as to what might be happening? Thanks in advance! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-not-installed-no-access-to-web-UI-tp13952.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Unpersist
like this? var temp = ... for (i - num) { temp = .. { do something } temp.unpersist() } Thanks Best Regards On Thu, Sep 11, 2014 at 3:26 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: I want to create a temporary variables in a spark code. Can I do this? for (i - num) { val temp = .. { do something } temp.unpersist() } Thank You
Re: Spark not installed + no access to web UI
I tried 1.0.0, 1.0.1 and 1.0.2. I also tried the latest github commit. After several hours trying to launch it, now it seems to be working, this is what I did (not sure if any of these steps helped): 1/ clone the spark repo into the master node 2/ run sbt/sbt assembly 3/ copy spark and spark-ec2 directories to my slaves 4/ launch the cluster again with --resume Now I can finally access the web UI and spark is properly installed! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-not-installed-no-access-to-web-UI-tp13952p13957.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
JMXSink for YARN deployment
Hello, we are in Sematext (https://apps.sematext.com/) are writing Monitoring tool for Spark and we came across one question: How to enable JMX metrics for YARN deployment? We put *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink to file $SPARK_HOME/conf/metrics.properties but it doesn't work. Everything works in Standalone mode, but not in YARN mode. Can somebody help? Thx! PS: I've found also https://stackoverflow.com/questions/23529404/spark-on-yarn-how-to-send-metrics-to-graphite-sink/25786112 without answer.
Re: How to scale more consumer to Kafka stream
Thanks for all I'm going to check both solution -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-scale-more-consumer-to-Kafka-stream-tp13883p13959.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark streaming stops computing while the receiver keeps running without any errors reported
Hi all I am trying to run kinesis spark streaming application on a standalone spark cluster. The job works find in local mode but when I submit it (using spark-submit), it doesn't do anything. I enabled logs for org.apache.spark.streaming.kinesis package and I regularly get the following in worker logs: 14/09/11 11:41:25 DEBUG KinesisRecordProcessor: Stored: Worker x.x.x.x:b88a9210-cbb9-4c31-8da7-35fd92faba09 stored 34 records for shardId shardId- 14/09/11 11:41:25 DEBUG KinesisRecordProcessor: Stored: Worker x.x.x.x:b2e9c20f-470a-44fe-a036-630c776919fb stored 31 records for shardId shardId-0001 But the job does not perform any operations defined on DStream. To investigate this further, I changed the kinesis-asl's KinesisUtils to perform the following computation on the DStream created using ssc.receiverStream(new KinesisReceiver...): stream.count().foreachRDD(rdd = rdd.foreach(tuple = logInfo(Emitted + tuple))) Even the above line does not results in any corresponding log entries both in driver and worker logs. The only relevant logs that I could find in driver logs are: 14/09/11 11:40:58 INFO DAGScheduler: Stage 2 (foreach at KinesisUtils.scala:68) finished in 0.398 s 14/09/11 11:40:58 INFO SparkContext: Job finished: foreach at KinesisUtils.scala:68, took 4.926449985 s 14/09/11 11:40:58 INFO JobScheduler: Finished job streaming job 1410435653000 ms.0 from job set of time 1410435653000 ms 14/09/11 11:40:58 INFO JobScheduler: Starting job streaming job 1410435653000 ms.1 from job set of time 1410435653000 ms 14/09/11 11:40:58 INFO SparkContext: Starting job: foreach at KinesisUtils.scala:68 14/09/11 11:40:58 INFO DAGScheduler: Registering RDD 13 (union at DStream.scala:489) 14/09/11 11:40:58 INFO DAGScheduler: Got job 3 (foreach at KinesisUtils.scala:68) with 2 output partitions (allowLocal=false) 14/09/11 11:40:58 INFO DAGScheduler: Final stage: Stage 5(foreach at KinesisUtils.scala:68) After the above logs, nothing shows up corresponding to KinesisUtils. I am out of ideas on this one and any help on this would greatly appreciated. Thanks, Aniket
problem in using Spark-Cassandra connector
Hi, I am new to spark. I encountered an issue when trying to connect to Cassandra using Spark Cassandra connector. Can anyone help me. Following are the details. 1) Following Spark and Cassandra versions I am using on LUbuntu12.0. i)spark-1.0.2-bin-hadoop2 ii) apache-cassandra-2.0.10 2) In the Cassandra, i created a key space, table and inserted some data. 3)Following libs are specified when starting the spark-shell. antworks@INHN1I-DW1804:$ spark-shell --jars /home/antworks/lib-spark-cassandra/apache-cassandra-clientutil-2.0.10.jar,/home/antworks/lib-spark-cassandra/apache-cassandra-thrift-2.0.10.jar,/home/antworks/lib-spark-cassandra/cassandra-driver-core-2.0.2.jar,/home/antworks/lib-spark-cassandra/guava-15.0.jar,/home/antworks/lib-spark-cassandra/joda-convert-1.6.jar,/home/antworks/lib-spark-cassandra/joda-time-2.3.jar,/home/antworks/lib-spark-cassandra/libthrift-0.9.1.jar,/home/antworks/lib-spark-cassandra/spark-cassandra-connector_2.10-1.0.0-rc3.jar 4) when running the stmt val rdd = sc.cassandraTable(EmailKeySpace, Emails)encountered following issue. My application connecting to Cassandra and immediately disconnecting and throwing java.io.IOException: Table not found: EmailKeySpace.Emails Here is the stack trace. scala import com.datastax.spark.connector._ import com.datastax.spark.connector._ scala val rdd = sc.cassandraTable(EmailKeySpace, Emails) 14/09/11 23:06:51 WARN FrameCompressor: Cannot find LZ4 class, you should make sure the LZ4 library is in the classpath if you intend to use it. LZ4 compression will not be available for the protocol. 14/09/11 23:06:51 INFO Cluster: New Cassandra host /172.23.1.68:9042 added 14/09/11 23:06:51 INFO CassandraConnector: Connected to Cassandra cluster: AWCluster 14/09/11 23:06:52 INFO CassandraConnector: Disconnected from Cassandra cluster: AWCluster java.io.IOException: Table not found: EmailKeySpace.Emails at com.datastax.spark.connector.rdd.CassandraRDD.tableDef$lzycompute(CassandraRDD.scala:208) at com.datastax.spark.connector.rdd.CassandraRDD.tableDef(CassandraRDD.scala:205) at com.datastax.spark.connector.rdd.CassandraRDD.init(CassandraRDD.scala:212) at com.datastax.spark.connector.SparkContextFunctions.cassandraTable(SparkContextFunctions.scala:48) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC.init(console:26) at $iwC.init(console:28) at init(console:30) at .init(console:34) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at
Re: problem in using Spark-Cassandra connector
You will have to create create KeySpace and Table. See the message, Table not found: EmailKeySpace.Emails Looks like you have not created the Emails table. On Thu, Sep 11, 2014 at 6:04 PM, Karunya Padala karunya.pad...@infotech-enterprises.com wrote: Hi, I am new to spark. I encountered an issue when trying to connect to Cassandra using Spark Cassandra connector. Can anyone help me. Following are the details. 1) Following Spark and Cassandra versions I am using on LUbuntu12.0. i)spark-1.0.2-bin-hadoop2 ii) apache-cassandra-2.0.10 2) In the Cassandra, i created a key space, table and inserted some data. 3)Following libs are specified when starting the spark-shell. antworks@INHN1I-DW1804:$ spark-shell --jars /home/antworks/lib-spark-cassandra/apache-cassandra-clientutil-2.0.10.jar,/home/antworks/lib-spark-cassandra/apache-cassandra-thrift-2.0.10.jar,/home/antworks/lib-spark-cassandra/cassandra-driver-core-2.0.2.jar,/home/antworks/lib-spark-cassandra/guava-15.0.jar,/home/antworks/lib-spark-cassandra/joda-convert-1.6.jar,/home/antworks/lib-spark-cassandra/joda-time-2.3.jar,/home/antworks/lib-spark-cassandra/libthrift-0.9.1.jar,/home/antworks/lib-spark-cassandra/spark-cassandra-connector_2.10-1.0.0-rc3.jar 4) when running the stmt val rdd = sc.cassandraTable(EmailKeySpace, Emails)encountered following issue. My application connecting to Cassandra and immediately disconnecting and throwing java.io.IOException: Table not found: EmailKeySpace.Emails Here is the stack trace. scala import com.datastax.spark.connector._ import com.datastax.spark.connector._ scala val rdd = sc.cassandraTable(EmailKeySpace, Emails) 14/09/11 23:06:51 WARN FrameCompressor: Cannot find LZ4 class, you should make sure the LZ4 library is in the classpath if you intend to use it. LZ4 compression will not be available for the protocol. 14/09/11 23:06:51 INFO Cluster: New Cassandra host /172.23.1.68:9042 added 14/09/11 23:06:51 INFO CassandraConnector: Connected to Cassandra cluster: AWCluster 14/09/11 23:06:52 INFO CassandraConnector: Disconnected from Cassandra cluster: AWCluster java.io.IOException: Table not found: EmailKeySpace.Emails at com.datastax.spark.connector.rdd.CassandraRDD.tableDef$lzycompute(CassandraRDD.scala:208) at com.datastax.spark.connector.rdd.CassandraRDD.tableDef(CassandraRDD.scala:205) at com.datastax.spark.connector.rdd.CassandraRDD.init(CassandraRDD.scala:212) at com.datastax.spark.connector.SparkContextFunctions.cassandraTable(SparkContextFunctions.scala:48) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC.init(console:26) at $iwC.init(console:28) at init(console:30) at .init(console:34) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982) at org.apache.spark.repl.Main$.main(Main.scala:31) at org.apache.spark.repl.Main.main(Main.scala) at
RE: problem in using Spark-Cassandra connector
I have created key space called EmailKeySpace’and table called Emails and inserted some data in the Cassandra. See my Cassandra console screen shot. [cid:image001.png@01CFCDEB.8FB55CB0] Regards, Karunya. From: Reddy Raja [mailto:areddyr...@gmail.com] Sent: 11 September 2014 18:07 To: Karunya Padala Cc: u...@spark.incubator.apache.org Subject: Re: problem in using Spark-Cassandra connector You will have to create create KeySpace and Table. See the message, Table not found: EmailKeySpace.Emails Looks like you have not created the Emails table. On Thu, Sep 11, 2014 at 6:04 PM, Karunya Padala karunya.pad...@infotech-enterprises.commailto:karunya.pad...@infotech-enterprises.com wrote: Hi, I am new to spark. I encountered an issue when trying to connect to Cassandra using Spark Cassandra connector. Can anyone help me. Following are the details. 1) Following Spark and Cassandra versions I am using on LUbuntu12.0. i)spark-1.0.2-bin-hadoop2 ii) apache-cassandra-2.0.10 2) In the Cassandra, i created a key space, table and inserted some data. 3)Following libs are specified when starting the spark-shell. antworks@INHN1I-DW1804:$ spark-shell --jars /home/antworks/lib-spark-cassandra/apache-cassandra-clientutil-2.0.10.jar,/home/antworks/lib-spark-cassandra/apache-cassandra-thrift-2.0.10.jar,/home/antworks/lib-spark-cassandra/cassandra-driver-core-2.0.2.jar,/home/antworks/lib-spark-cassandra/guava-15.0.jar,/home/antworks/lib-spark-cassandra/joda-convert-1.6.jar,/home/antworks/lib-spark-cassandra/joda-time-2.3.jar,/home/antworks/lib-spark-cassandra/libthrift-0.9.1.jar,/home/antworks/lib-spark-cassandra/spark-cassandra-connector_2.10-1.0.0-rc3.jar 4) when running the stmt val rdd = sc.cassandraTable(EmailKeySpace, Emails)encountered following issue. My application connecting to Cassandra and immediately disconnecting and throwing java.io.IOException: Table not found: EmailKeySpace.Emails Here is the stack trace. scala import com.datastax.spark.connector._ import com.datastax.spark.connector._ scala val rdd = sc.cassandraTable(EmailKeySpace, Emails) 14/09/11 23:06:51 WARN FrameCompressor: Cannot find LZ4 class, you should make sure the LZ4 library is in the classpath if you intend to use it. LZ4 compression will not be available for the protocol. 14/09/11 23:06:51 INFO Cluster: New Cassandra host /172.23.1.68:9042http://172.23.1.68:9042 added 14/09/11 23:06:51 INFO CassandraConnector: Connected to Cassandra cluster: AWCluster 14/09/11 23:06:52 INFO CassandraConnector: Disconnected from Cassandra cluster: AWCluster java.io.IOException: Table not found: EmailKeySpace.Emails at com.datastax.spark.connector.rdd.CassandraRDD.tableDef$lzycompute(CassandraRDD.scala:208) at com.datastax.spark.connector.rdd.CassandraRDD.tableDef(CassandraRDD.scala:205) at com.datastax.spark.connector.rdd.CassandraRDD.init(CassandraRDD.scala:212) at com.datastax.spark.connector.SparkContextFunctions.cassandraTable(SparkContextFunctions.scala:48) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC$$iwC$$iwC.init(console:22) at $iwC$$iwC$$iwC.init(console:24) at $iwC$$iwC.init(console:26) at $iwC.init(console:28) at init(console:30) at .init(console:34) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796) at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841) at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753) at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601) at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608) at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884) at
Spark on Raspberry Pi?
Has anyone tried using Raspberry Pi for Spark? How efficient is it to use around 10 Pi's for local testing env ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Raspberry-Pi-tp13965.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to scale more consumer to Kafka stream
I agree Gerard. Thanks for pointing this.. Dib On Thu, Sep 11, 2014 at 5:28 PM, Gerard Maas gerard.m...@gmail.com wrote: This pattern works. One note, thought: Use 'union' only if you need to group the data from all RDDs into one RDD for processing (like count distinct or need a groupby). If your process can be parallelized over every stream of incoming data, I suggest you just apply the required transformations on every dstream and avoid 'union' altogether. -kr, Gerard. On Wed, Sep 10, 2014 at 8:17 PM, Tim Smith secs...@gmail.com wrote: How are you creating your kafka streams in Spark? If you have 10 partitions for a topic, you can call createStream ten times to create 10 parallel receivers/executors and then use union to combine all the dStreams. On Wed, Sep 10, 2014 at 7:16 AM, richiesgr richie...@gmail.com wrote: Hi (my previous post as been used by someone else) I'm building a application the read from kafka stream event. In production we've 5 consumers that share 10 partitions. But on spark streaming kafka only 1 worker act as a consumer then distribute the tasks to workers so I can have only 1 machine acting as consumer but I need more because only 1 consumer means Lags. Do you've any idea what I can do ? Another point is interresting the master is not loaded at all I can get up more than 10 % CPU I've tried to increase the queued.max.message.chunks on the kafka client to read more records thinking it'll speed up the read but I only get ERROR consumer.ConsumerFetcherThread: [ConsumerFetcherThread-SparkEC2_ip-10-138-59-194.ec2.internal-1410182950783-5c49c8e8-0-174167372], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 73; ClientId: SparkEC2-ConsumerFetcherThread-SparkEC2_ip-10-138-59-194.ec2.internal-1410182950783-5c49c8e8-0-174167372; ReplicaId: -1; MaxWait: 100 ms; MinBytes: 1 bytes; RequestInfo: [IA2,7] - PartitionFetchInfo(929838589,1048576),[IA2,6] - PartitionFetchInfo(929515796,1048576),[IA2,9] - PartitionFetchInfo(929577946,1048576),[IA2,8] - PartitionFetchInfo(930751599,1048576),[IA2,2] - PartitionFetchInfo(926457704,1048576),[IA2,5] - PartitionFetchInfo(930774385,1048576),[IA2,0] - PartitionFetchInfo(929913213,1048576),[IA2,3] - PartitionFetchInfo(929268891,1048576),[IA2,4] - PartitionFetchInfo(929949877,1048576),[IA2,1] - PartitionFetchInfo(930063114,1048576) java.lang.OutOfMemoryError: Java heap space Is someone have ideas ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-scale-more-consumer-to-Kafka-stream-tp13883.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Fwd: Spark on Raspberry Pi?
Pi's bus speed, memory size and access speed, and processing ability are limited. The only benefit could be the power consumption. On Thu, Sep 11, 2014 at 8:04 AM, Sandeep Singh sand...@techaddict.me wrote: Has anyone tried using Raspberry Pi for Spark? How efficient is it to use around 10 Pi's for local testing env ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Raspberry-Pi-tp13965.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: JMXSink for YARN deployment
Hi, I’m guessing the problem is that driver or executor cannot get the metrics.properties configuration file in the yarn container, so metrics system cannot load the right sinks. Thanks Jerry From: Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com] Sent: Thursday, September 11, 2014 7:30 PM To: user@spark.apache.org Subject: JMXSink for YARN deployment Hello, we are in Sematext (https://apps.sematext.com/) are writing Monitoring tool for Spark and we came across one question: How to enable JMX metrics for YARN deployment? We put *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink to file $SPARK_HOME/conf/metrics.properties but it doesn't work. Everything works in Standalone mode, but not in YARN mode. Can somebody help? Thx! PS: I've found also https://stackoverflow.com/questions/23529404/spark-on-yarn-how-to-send-metrics-to-graphite-sink/25786112 without answer.
unable to create new native thread
Hi I am trying the Spark sample program “SparkPi”, I got an error unable to create new native thread, how to resolve this? 14/09/11 21:36:16 INFO scheduler.DAGScheduler: Completed ResultTask(0, 644) 14/09/11 21:36:16 INFO scheduler.TaskSetManager: Finished TID 643 in 43 ms on node1 (progress: 636/10) 14/09/11 21:36:16 INFO scheduler.DAGScheduler: Completed ResultTask(0, 643) 14/09/11 21:36:16 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [spark-akka.actor.default-dispatcher-16] shutting down ActorSystem [spark] java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at scala.concurrent.forkjoin.ForkJoinPool.tryAddWorker(ForkJoinPool.java:1672) at scala.concurrent.forkjoin.ForkJoinPool.signalWork(ForkJoinPool.java:1966) at scala.concurrent.forkjoin.ForkJoinPool.externalPush(ForkJoinPool.java:1829) at scala.concurrent.forkjoin.ForkJoinPool.execute(ForkJoinPool.java:2955) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool.execute(AbstractDispatcher.scala:374) at akka.dispatch.ExecutorServiceDelegate$class.execute(ThreadPoolBuilder.scala:212) at akka.dispatch.Dispatcher$LazyExecutorServiceDelegate.execute(Dispatcher.scala:43) at akka.dispatch.Dispatcher.registerForExecution(Dispatcher.scala:118) at akka.dispatch.Dispatcher.dispatch(Dispatcher.scala:59) at akka.actor.dungeon.Dispatch$class.sendMessage(Dispatch.scala:120) at akka.actor.ActorCell.sendMessage(ActorCell.scala:338) at akka.actor.Cell$class.sendMessage(ActorCell.scala:259) at akka.actor.ActorCell.sendMessage(ActorCell.scala:338) at akka.actor.LocalActorRef.$bang(ActorRef.scala:389) at akka.actor.ActorRef.tell(ActorRef.scala:125) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:489) at akka.actor.ActorCell.invoke(ActorCell.scala:455) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Regards Arthur
Re: JMXSink for YARN deployment
Hi Shao, thx for explanation, any ideas how to fix it? Where should I put metrics.properties file? On Thu, Sep 11, 2014 at 4:18 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, I’m guessing the problem is that driver or executor cannot get the metrics.properties configuration file in the yarn container, so metrics system cannot load the right sinks. Thanks Jerry *From:* Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com] *Sent:* Thursday, September 11, 2014 7:30 PM *To:* user@spark.apache.org *Subject:* JMXSink for YARN deployment Hello, we are in Sematext (https://apps.sematext.com/) are writing Monitoring tool for Spark and we came across one question: How to enable JMX metrics for YARN deployment? We put *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink to file $SPARK_HOME/conf/metrics.properties but it doesn't work. Everything works in Standalone mode, but not in YARN mode. Can somebody help? Thx! PS: I've found also https://stackoverflow.com/questions/23529404/spark-on-yarn-how-to-send-metrics-to-graphite-sink/25786112 without answer.
RE: JMXSink for YARN deployment
I think you can try to use ” spark.metrics.conf” to manually specify the path of metrics.properties, but the prerequisite is that each container should find this file in their local FS because this file is loaded locally. Besides I think this might be a kind of workaround, a better solution is to fix this by some other solutions. Thanks Jerry From: Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com] Sent: Thursday, September 11, 2014 10:08 PM Cc: user@spark.apache.org Subject: Re: JMXSink for YARN deployment Hi Shao, thx for explanation, any ideas how to fix it? Where should I put metrics.properties file? On Thu, Sep 11, 2014 at 4:18 PM, Shao, Saisai saisai.s...@intel.commailto:saisai.s...@intel.com wrote: Hi, I’m guessing the problem is that driver or executor cannot get the metrics.properties configuration file in the yarn container, so metrics system cannot load the right sinks. Thanks Jerry From: Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.commailto:vladimir.tretya...@sematext.com] Sent: Thursday, September 11, 2014 7:30 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: JMXSink for YARN deployment Hello, we are in Sematext (https://apps.sematext.com/) are writing Monitoring tool for Spark and we came across one question: How to enable JMX metrics for YARN deployment? We put *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink to file $SPARK_HOME/conf/metrics.properties but it doesn't work. Everything works in Standalone mode, but not in YARN mode. Can somebody help? Thx! PS: I've found also https://stackoverflow.com/questions/23529404/spark-on-yarn-how-to-send-metrics-to-graphite-sink/25786112 without answer.
Re: Unpersist
After every loop I want the temp variable to cease to exist On Thu, Sep 11, 2014 at 4:33 PM, Akhil Das ak...@sigmoidanalytics.com wrote: like this? var temp = ... for (i - num) { temp = .. { do something } temp.unpersist() } Thanks Best Regards On Thu, Sep 11, 2014 at 3:26 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: I want to create a temporary variables in a spark code. Can I do this? for (i - num) { val temp = .. { do something } temp.unpersist() } Thank You
Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..
Hi, Can you attach more logs to see if there is some entry from ContextCleaner? I met very similar issue before…but haven’t get resolved Best, -- Nan Zhu On Thursday, September 11, 2014 at 10:13 AM, Dibyendu Bhattacharya wrote: Dear All, Not sure if this is a false alarm. But wanted to raise to this to understand what is happening. I am testing the Kafka Receiver which I have written (https://github.com/dibbhatt/kafka-spark-consumer) which basically a low level Kafka Consumer implemented custom Receivers for every Kafka topic partitions and pulling data in parallel. Individual streams from all topic partitions are then merged to create Union stream which used for further processing. The custom Receiver working fine in normal load with no issues. But when I tested this with huge amount of backlog messages from Kafka ( 50 million + messages), I see couple of major issue in Spark Streaming. Wanted to get some opinion on this I am using latest Spark 1.1 taken from the source and built it. Running in Amazon EMR , 3 m1.xlarge Node Spark cluster running in Standalone Mode. Below are two main question I have.. 1. What I am seeing when I run the Spark Streaming with my Kafka Consumer with a huge backlog in Kafka ( around 50 Million), Spark is completely busy performing the Receiving task and hardly schedule any processing task. Can you let me if this is expected ? If there is large backlog, Spark will take long time pulling them . Why Spark not doing any processing ? Is it because of resource limitation ( say all cores are busy puling ) or it is by design ? I am setting the executor-memory to 10G and driver-memory to 4G . 2. This issue seems to be more serious. I have attached the Driver trace with this email. What I can see very frequently Block are selected to be Removed...This kind of entries are all over the place. But when a Block is removed , below problem happen May be this issue cause the issue 1 that no Jobs are getting processed .. INFO : org.apache.spark.storage.MemoryStore - 1 blocks selected for dropping INFO : org.apache.spark.storage.BlockManager - Dropping block input-0-1410443074600 from memory INFO : org.apache.spark.storage.MemoryStore - Block input-0-1410443074600 of size 12651900 dropped from memory (free 21220667) INFO : org.apache.spark.storage.BlockManagerInfo - Removed input-0-1410443074600 on ip-10-252-5-113.asskickery.us:53752 (http://ip-10-252-5-113.asskickery.us:53752) in memory (size: 12.1 MB, free: 100.6 MB) ... INFO : org.apache.spark.storage.BlockManagerInfo - Removed input-0-1410443074600 on ip-10-252-5-62.asskickery.us:37033 (http://ip-10-252-5-62.asskickery.us:37033) in memory (size: 12.1 MB, free: 154.6 MB) .. WARN : org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 7.0 (TID 118, ip-10-252-5-62.asskickery.us (http://ip-10-252-5-62.asskickery.us)): java.lang.Exception: Could not compute split, block input-0-1410443074600 not found ... INFO : org.apache.spark.scheduler.TaskSetManager - Lost task 0.1 in stage 7.0 (TID 126) on executor ip-10-252-5-62.asskickery.us (http://ip-10-252-5-62.asskickery.us): java.lang.Exception (Could not compute split, block input-0-1410443074600 not found) [duplicate 1] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 139, ip-10-252-5-62.asskickery.us (http://ip-10-252-5-62.asskickery.us)): java.lang.Exception: Could not compute split, block input-0-1410443074600 not found org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Regards, Dibyendu - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org (mailto:user-unsubscr...@spark.apache.org) For additional commands, e-mail: user-h...@spark.apache.org (mailto:user-h...@spark.apache.org) Attachments: - driver-trace.txt
Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..
This is my case about broadcast variable: 14/07/21 19:49:13 INFO Executor: Running task ID 4 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 2) 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 2 in 95 ms on localhost (progress: 3/106) 14/07/21 19:49:13 INFO TableOutputFormat: Created table instance for hdfstest_customers 14/07/21 19:49:13 INFO Executor: Serialized size of result for 3 is 596 14/07/21 19:49:13 INFO Executor: Sending result for 3 directly to driver 14/07/21 19:49:13 INFO BlockManager: Found block broadcast_0 locally 14/07/21 19:49:13 INFO Executor: Finished task ID 3 14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:5 as TID 5 on executor localhost: localhost (PROCESS_LOCAL) 14/07/21 19:49:13 INFO TaskSetManager: Serialized task 0.0:5 as 11885 bytes in 0 ms 14/07/21 19:49:13 INFO Executor: Running task ID 5 14/07/21 19:49:13 INFO BlockManager: Removing broadcast 0 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 3) 14/07/21 19:49:13 INFO ContextCleaner: Cleaned broadcast 0 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 3 in 97 ms on localhost (progress: 4/106) 14/07/21 19:49:13 INFO BlockManager: Found block broadcast_0 locally 14/07/21 19:49:13 INFO BlockManager: Removing block broadcast_0 14/07/21 19:49:13 INFO MemoryStore: Block broadcast_0 of size 202564 dropped from memory (free 886623436) 14/07/21 19:49:13 INFO ContextCleaner: Cleaned shuffle 0 14/07/21 19:49:13 INFO ShuffleBlockManager: Deleted all files for shuffle 0 14/07/21 19:49:13 INFO HadoopRDD: Input split: hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:25+5 14/07/21 19:49:13 INFO HadoopRDD: Input split: hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:20+5 14/07/21 19:49:13 INFO TableOutputFormat: Created table instance for hdfstest_customers 14/07/21 19:49:13 INFO Executor: Serialized size of result for 4 is 596 14/07/21 19:49:13 INFO Executor: Sending result for 4 directly to driver 14/07/21 19:49:13 INFO Executor: Finished task ID 4 14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:6 as TID 6 on executor localhost: localhost (PROCESS_LOCAL) 14/07/21 19:49:13 INFO TaskSetManager: Serialized task 0.0:6 as 11885 bytes in 0 ms 14/07/21 19:49:13 INFO Executor: Running task ID 6 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 4) 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 4 in 80 ms on localhost (progress: 5/106) 14/07/21 19:49:13 INFO TableOutputFormat: Created table instance for hdfstest_customers 14/07/21 19:49:13 INFO Executor: Serialized size of result for 5 is 596 14/07/21 19:49:13 INFO Executor: Sending result for 5 directly to driver 14/07/21 19:49:13 INFO Executor: Finished task ID 5 14/07/21 19:49:13 INFO TaskSetManager: Starting task 0.0:7 as TID 7 on executor localhost: localhost (PROCESS_LOCAL) 14/07/21 19:49:13 INFO TaskSetManager: Serialized task 0.0:7 as 11885 bytes in 0 ms 14/07/21 19:49:13 INFO Executor: Running task ID 7 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 5) 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 5 in 77 ms on localhost (progress: 6/106) 14/07/21 19:49:13 INFO HttpBroadcast: Started reading broadcast variable 0 14/07/21 19:49:13 INFO HttpBroadcast: Started reading broadcast variable 0 14/07/21 19:49:13 ERROR Executor: Exception in task ID 6 java.io.FileNotFoundException: http://172.31.34.174:52070/broadcast_0 at sun.net.www.protocol.http.HttpURLConnection.getInputStream (http://www.protocol.http.HttpURLConnection.getInputStream)(HttpURLConnection.java:1624) at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:196) at org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:89) at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at
Re[2]: HBase 0.96+ with Spark 1.0+
Hi guys, any luck with this issue, anyone? I aswell tried all the possible exclusion combos to a no avail. thanks for your ideas reinis -Original-Nachricht- Von: Stephen Boesch java...@gmail.com An: user user@spark.apache.org Datum: 28-06-2014 15:12 Betreff: Re: HBase 0.96+ with Spark 1.0+ Hi Siyuan, Thanks for the input. We are preferring to use the SparkBuild.scala instead of maven. I did not see any protobuf.version related settings in that file. But - as noted by Sean Owen - in any case the issue we are facing presently is about the duplicate incompatible javax.servlet entries - apparently from the org.mortbay artifacts. 2014-06-28 6:01 GMT-07:00 Siyuan he hsy...@gmail.com: Hi Stephen, I am using spark1.0+ HBase0.96.2. This is what I did: 1) rebuild spark using: mvn -Dhadoop.version=2.3.0 -Dprotobuf.version=2.5.0 -DskipTests clean package 2) In spark-env.sh, set SPARK_CLASSPATH = /path-to/hbase-protocol-0.96.2-hadoop2.jar Hopefully it can help. Siyuan On Sat, Jun 28, 2014 at 8:52 AM, Stephen Boesch java...@gmail.com wrote: Thanks Sean. I had actually already added exclusion rule for org.mortbay.jetty - and that had not resolved it. Just in case I used your precise formulation: val excludeMortbayJetty = ExclusionRule(organization = org.mortbay.jetty) .. ,(org.apache.spark % spark-core_2.10 % sparkVersion withSources()).excludeAll(excludeMortbayJetty) ,(org.apache.spark % spark-sql_2.10 % sparkVersion withSources()).excludeAll(excludeMortbayJetty) However the same error still recurs: 14/06/28 05:48:35 INFO HttpServer: Starting HTTP Server [error] (run-main-0) java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package 2014-06-28 4:22 GMT-07:00 Sean Owen so...@cloudera.com: This sounds like an instance of roughly the same item as in https://issues.apache.org/jira/browse/SPARK-1949 Have a look at adding that exclude to see if it works. On Fri, Jun 27, 2014 at 10:21 PM, Stephen Boesch java...@gmail.com wrote: The present trunk is built and tested against HBase 0.94. I have tried various combinations of versions of HBase 0.96+ and Spark 1.0+ and all end up with 14/06/27 20:11:15 INFO HttpServer: Starting HTTP Server [error] (run-main-0) java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package at java.lang.ClassLoader.checkCerts(ClassLoader.java:952) I have tried a number of different ways to exclude javax.servlet related jars. But none have avoided this error. Anyone have a (small-ish) build.sbt that works with later versions of HBase? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: JMXSink for YARN deployment
Hi again, yeah , I've tried to use ” spark.metrics.conf” before my question in ML, had no luck:( Any other ideas from somebody? Seems nobody use metrics in YARN deployment mode. How about Mesos? I didn't try but maybe Spark has the same difficulties on Mesos? PS: Spark is great thing in general, will be nice to see metrics in YARN/Mesos mode, not only in Standalone:) On Thu, Sep 11, 2014 at 5:25 PM, Shao, Saisai saisai.s...@intel.com wrote: I think you can try to use ” spark.metrics.conf” to manually specify the path of metrics.properties, but the prerequisite is that each container should find this file in their local FS because this file is loaded locally. Besides I think this might be a kind of workaround, a better solution is to fix this by some other solutions. Thanks Jerry *From:* Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com] *Sent:* Thursday, September 11, 2014 10:08 PM *Cc:* user@spark.apache.org *Subject:* Re: JMXSink for YARN deployment Hi Shao, thx for explanation, any ideas how to fix it? Where should I put metrics.properties file? On Thu, Sep 11, 2014 at 4:18 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, I’m guessing the problem is that driver or executor cannot get the metrics.properties configuration file in the yarn container, so metrics system cannot load the right sinks. Thanks Jerry *From:* Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com] *Sent:* Thursday, September 11, 2014 7:30 PM *To:* user@spark.apache.org *Subject:* JMXSink for YARN deployment Hello, we are in Sematext (https://apps.sematext.com/) are writing Monitoring tool for Spark and we came across one question: How to enable JMX metrics for YARN deployment? We put *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink to file $SPARK_HOME/conf/metrics.properties but it doesn't work. Everything works in Standalone mode, but not in YARN mode. Can somebody help? Thx! PS: I've found also https://stackoverflow.com/questions/23529404/spark-on-yarn-how-to-send-metrics-to-graphite-sink/25786112 without answer.
Re: JMXSink for YARN deployment
Hi Vladimir How about use --files option with spark-submit? - Kousuke (2014/09/11 23:43), Vladimir Tretyakov wrote: Hi again, yeah , I've tried to use ” spark.metrics.conf” before my question in ML, had no luck:( Any other ideas from somebody? Seems nobody use metrics in YARN deployment mode. How about Mesos? I didn't try but maybe Spark has the same difficulties on Mesos? PS: Spark is great thing in general, will be nice to see metrics in YARN/Mesos mode, not only in Standalone:) On Thu, Sep 11, 2014 at 5:25 PM, Shao, Saisai saisai.s...@intel.com mailto:saisai.s...@intel.com wrote: I think you can try to use ”spark.metrics.conf” to manually specify the path of metrics.properties, but the prerequisite is that each container should find this file in their local FS because this file is loaded locally. Besides I think this might be a kind of workaround, a better solution is to fix this by some other solutions. Thanks Jerry *From:*Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com mailto:vladimir.tretya...@sematext.com] *Sent:* Thursday, September 11, 2014 10:08 PM *Cc:* user@spark.apache.org mailto:user@spark.apache.org *Subject:* Re: JMXSink for YARN deployment Hi Shao, thx for explanation, any ideas how to fix it? Where should I put metrics.properties file? On Thu, Sep 11, 2014 at 4:18 PM, Shao, Saisai saisai.s...@intel.com mailto:saisai.s...@intel.com wrote: Hi, I’m guessing the problem is that driver or executor cannot get the metrics.properties configuration file in the yarn container, so metrics system cannot load the right sinks. Thanks Jerry *From:*Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com mailto:vladimir.tretya...@sematext.com] *Sent:* Thursday, September 11, 2014 7:30 PM *To:* user@spark.apache.org mailto:user@spark.apache.org *Subject:* JMXSink for YARN deployment Hello, we are in Sematext (https://apps.sematext.com/) are writing Monitoring tool for Spark and we came across one question: How to enable JMX metrics for YARN deployment? We put *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink to file $SPARK_HOME/conf/metrics.properties but it doesn't work. Everything works in Standalone mode, but not in YARN mode. Can somebody help? Thx! PS: I've found also https://stackoverflow.com/questions/23529404/spark-on-yarn-how-to-send-metrics-to-graphite-sink/25786112 without answer.
Python execution support on clusters
Is there some doc that I missed that describes what execution engines Python is support for with Spark? If we use spark-submit, with a yarn cluster an error is produced saying 'Error: Cannot currently run Python driver programs on cluster'. Thanks in advance David -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Python-execution-support-on-clusters-tp13977.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Table not found: using jdbc console to query sparksql hive thriftserver
Thank you!! I can do this using saveAsTable with the schemaRDD, right? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Table-not-found-using-jdbc-console-to-query-sparksql-hive-thriftserver-tp13840p13979.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
compiling spark source code
HI, Can someone please tell me how to compile the spark source code to effect the changes in the source code. I was trying to ship the jars to all the slaves, but in vain. -Karthik
Out of memory with Spark Streaming
I am running a simple Spark Streaming program that pulls in data from Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps data and persists to a store. The program is running in local mode right now and runs out of memory after a while. I am yet to investigate heap dumps but I think Spark isn't releasing memory after processing is complete. I have even tried changing storage level to disk only. Help! Thanks, Aniket
Re: efficient zipping of lots of RDDs
filed jira SPARK-3489 https://issues.apache.org/jira/browse/SPARK-3489 On Thu, Sep 4, 2014 at 9:36 AM, Mohit Jaggi mohitja...@gmail.com wrote: Folks, I sent an email announcing https://github.com/AyasdiOpenSource/df This dataframe is basically a map of RDDs of columns(along with DSL sugar), as column based operations seem to be most common. But row operations are not uncommon. To get rows out of columns right now I zip the column RDDs together. I use RDD.zip then flatten the tuples I get. I realize that RDD.zipPartitions might be faster. However, I believe an even better approach should be possible. Surely we can have a zip method that can combine a large variable number of RDDs? Can that be added to Spark-core? Or is there an alternative equally good or better approach? Cheers, Mohit.
Re: Setting up jvm in pyspark from shell
The heap size of JVM can not been changed dynamically, so you need to config it before running pyspark. If you run it in local mode, you should config spark.driver.memory (in 1.1 or master). Or, you can use --driver-memory 2G (should work in 1.0+) On Wed, Sep 10, 2014 at 10:43 PM, Mohit Singh mohit1...@gmail.com wrote: Hi, I am using pyspark shell and am trying to create an rdd from numpy matrix rdd = sc.parallelize(matrix) I am getting the following error: JVMDUMP039I Processing dump event systhrow, detail java/lang/OutOfMemoryError at 2014/09/10 22:41:44 - please wait. JVMDUMP032I JVM requested Heap dump using '/global/u2/m/msingh/heapdump.20140910.224144.29660.0005.phd' in response to an event JVMDUMP010I Heap dump written to /global/u2/m/msingh/heapdump.20140910.224144.29660.0005.phd JVMDUMP032I JVM requested Java dump using '/global/u2/m/msingh/javacore.20140910.224144.29660.0006.txt' in response to an event JVMDUMP010I Java dump written to /global/u2/m/msingh/javacore.20140910.224144.29660.0006.txt JVMDUMP032I JVM requested Snap dump using '/global/u2/m/msingh/Snap.20140910.224144.29660.0007.trc' in response to an event JVMDUMP010I Snap dump written to /global/u2/m/msingh/Snap.20140910.224144.29660.0007.trc JVMDUMP013I Processed dump event systhrow, detail java/lang/OutOfMemoryError. Exception AttributeError: 'SparkContext' object has no attribute '_jsc' in bound method SparkContext.__del__ of pyspark.context.SparkContext object at 0x11f9450 ignored Traceback (most recent call last): File stdin, line 1, in module File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line 271, in parallelize jrdd = readRDDFromFile(self._jsc, tempFile.name, numSlices) File /usr/common/usg/spark/1.0.2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /usr/common/usg/spark/1.0.2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.readRDDFromFile. : java.lang.OutOfMemoryError: Java heap space at org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:279) at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:88) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55) at java.lang.reflect.Method.invoke(Method.java:618) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:804) I did try to setSystemProperty sc.setSystemProperty(spark.executor.memory, 20g) How do i increase jvm heap from the shell? -- Mohit When you want success as badly as you want the air, then you will get it. There is no other secret of success. -Socrates - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: compiling spark source code
In the spark source folder, execute `sbt/sbt assembly` On Thu, Sep 11, 2014 at 8:27 AM, rapelly kartheek kartheek.m...@gmail.com wrote: HI, Can someone please tell me how to compile the spark source code to effect the changes in the source code. I was trying to ship the jars to all the slaves, but in vain. -Karthik
Re: JMXSink for YARN deployment
Hi, Kousuke, Can you please explain a bit detailed what do you mean, I am new in Spark, looked at https://spark.apache.org/docs/latest/submitting-applications.html seems there is no '--files' option. I just have to add '--files /path-to-metrics.properties' ? Undocumented ability? Thx for answer. On Thu, Sep 11, 2014 at 5:55 PM, Kousuke Saruta saru...@oss.nttdata.co.jp wrote: Hi Vladimir How about use --files option with spark-submit? - Kousuke (2014/09/11 23:43), Vladimir Tretyakov wrote: Hi again, yeah , I've tried to use ” spark.metrics.conf” before my question in ML, had no luck:( Any other ideas from somebody? Seems nobody use metrics in YARN deployment mode. How about Mesos? I didn't try but maybe Spark has the same difficulties on Mesos? PS: Spark is great thing in general, will be nice to see metrics in YARN/Mesos mode, not only in Standalone:) On Thu, Sep 11, 2014 at 5:25 PM, Shao, Saisai saisai.s...@intel.com wrote: I think you can try to use ” spark.metrics.conf” to manually specify the path of metrics.properties, but the prerequisite is that each container should find this file in their local FS because this file is loaded locally. Besides I think this might be a kind of workaround, a better solution is to fix this by some other solutions. Thanks Jerry *From:* Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com] *Sent:* Thursday, September 11, 2014 10:08 PM *Cc:* user@spark.apache.org *Subject:* Re: JMXSink for YARN deployment Hi Shao, thx for explanation, any ideas how to fix it? Where should I put metrics.properties file? On Thu, Sep 11, 2014 at 4:18 PM, Shao, Saisai saisai.s...@intel.com wrote: Hi, I’m guessing the problem is that driver or executor cannot get the metrics.properties configuration file in the yarn container, so metrics system cannot load the right sinks. Thanks Jerry *From:* Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com] *Sent:* Thursday, September 11, 2014 7:30 PM *To:* user@spark.apache.org *Subject:* JMXSink for YARN deployment Hello, we are in Sematext (https://apps.sematext.com/) are writing Monitoring tool for Spark and we came across one question: How to enable JMX metrics for YARN deployment? We put *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink to file $SPARK_HOME/conf/metrics.properties but it doesn't work. Everything works in Standalone mode, but not in YARN mode. Can somebody help? Thx! PS: I've found also https://stackoverflow.com/questions/23529404/spark-on-yarn-how-to-send-metrics-to-graphite-sink/25786112 without answer.
Re: Spark on Raspberry Pi?
Limited memory could also cause you some problems and limit usability. If you're looking for a local testing environment, vagrant boxes may serve you much better. On Thu, Sep 11, 2014 at 6:18 AM, Chen He airb...@gmail.com wrote: Pi's bus speed, memory size and access speed, and processing ability are limited. The only benefit could be the power consumption. On Thu, Sep 11, 2014 at 8:04 AM, Sandeep Singh sand...@techaddict.me wrote: Has anyone tried using Raspberry Pi for Spark? How efficient is it to use around 10 Pi's for local testing env ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Raspberry-Pi-tp13965.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Out of memory with Spark Streaming
I did change it to be 1 gb. It still ran out of memory but a little later. The streaming job isnt handling a lot of data. In every 2 seconds, it doesn't get more than 50 records. Each record size is not more than 500 bytes. On Sep 11, 2014 10:54 PM, Bharat Venkat bvenkat.sp...@gmail.com wrote: You could set spark.executor.memory to something bigger than the default (512mb) On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am running a simple Spark Streaming program that pulls in data from Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps data and persists to a store. The program is running in local mode right now and runs out of memory after a while. I am yet to investigate heap dumps but I think Spark isn't releasing memory after processing is complete. I have even tried changing storage level to disk only. Help! Thanks, Aniket
Re: Spark on Raspberry Pi?
Just curiois... What's the use case you are looking to implement? On Sep 11, 2014 10:50 PM, Daniil Osipov daniil.osi...@shazam.com wrote: Limited memory could also cause you some problems and limit usability. If you're looking for a local testing environment, vagrant boxes may serve you much better. On Thu, Sep 11, 2014 at 6:18 AM, Chen He airb...@gmail.com wrote: Pi's bus speed, memory size and access speed, and processing ability are limited. The only benefit could be the power consumption. On Thu, Sep 11, 2014 at 8:04 AM, Sandeep Singh sand...@techaddict.me wrote: Has anyone tried using Raspberry Pi for Spark? How efficient is it to use around 10 Pi's for local testing env ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Raspberry-Pi-tp13965.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark on Raspberry Pi?
We've found that Raspberry Pi is not enough for Hadoop/Spark mainly because the memory consumption. What we've built is a cluster form with 22 Cubieboards, each contains 1 GB RAM. Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit On Thu, Sep 11, 2014 at 8:04 PM, Sandeep Singh sand...@techaddict.me wrote: Has anyone tried using Raspberry Pi for Spark? How efficient is it to use around 10 Pi's for local testing env ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Raspberry-Pi-tp13965.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL JDBC
Even when I comment out those 3 lines, I still get the same error. Did someone solve this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-JDBC-tp11369p13992.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark on yarn history server + hdfs permissions issue
To answer my own question, in case someone else runs into this. The spark user needs to be in the same group on the namenode, and hdfs caches that information for it seems like at least an hour. Magically started working on its own. Greg From: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Date: Tuesday, September 9, 2014 2:30 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: spark on yarn history server + hdfs permissions issue I am running Spark on Yarn with the HDP 2.1 technical preview. I'm having issues getting the spark history server permissions to read the spark event logs from hdfs. Both sides are configured to write/read logs from: hdfs:///apps/spark/events The history server is running as user spark, the jobs are running as user lavaqe. Both users are in the hdfs group on all the nodes in the cluster. That root logs folder is globally writeable, but owned by the spark user: drwxrwxrwx - spark hdfs 0 2014-09-09 18:19 /apps/spark/events All good so far. Spark jobs create subfolders and put their event logs in there just fine. The problem is that the history server, running as the spark user, cannot read those logs. They're written as the user that initiates the job, but still in the same hdfs group: drwxrwx--- - lavaqe hdfs 0 2014-09-09 19:24 /apps/spark/events/spark-pi-1410290714996 The files are group readable/writable, but this is the error I get: Permission denied: user=spark, access=READ_EXECUTE, inode=/apps/spark/events/spark-pi-1410290714996:lavaqe:hdfs:drwxrwx--- So, two questions, I guess: 1. Do group permissions just plain not work in hdfs or am I missing something? 2. Is there a way to tell Spark to log with more permissive permissions so the history server can read the generated logs? Greg
Re: Re[2]: HBase 0.96+ with Spark 1.0+
Dependency hell... My fav problem :). I had run into a similar issue with hbase and jetty. I cant remember thw exact fix, but is are excerpts from my dependencies that may be relevant: val hadoop2Common = org.apache.hadoop % hadoop-common % hadoop2Version excludeAll( ExclusionRule(organization = javax.servlet), ExclusionRule(organization = javax.servlet.jsp), ExclusionRule(organization = org.mortbay.jetty) ) val hadoop2MapRedClient = org.apache.hadoop % hadoop-mapreduce-client-core % hadoop2Version val hbase = org.apache.hbase % hbase % hbaseVersion excludeAll( ExclusionRule(organization = org.apache.maven.wagon), ExclusionRule(organization = org.jboss.netty), ExclusionRule(organization = org.mortbay.jetty), ExclusionRule(organization = org.jruby) // Don't need HBASE's jruby. It pulls in whole lot of other dependencies like joda-time. ) val sparkCore = org.apache.spark %% spark-core % sparkVersion val sparkStreaming = org.apache.spark %% spark-streaming % sparkVersion val sparkSQL = org.apache.spark %% spark-sql % sparkVersion val sparkHive = org.apache.spark %% spark-hive % sparkVersion val sparkRepl = org.apache.spark %% spark-repl % sparkVersion val sparkAll = Seq ( sparkCore excludeAll( ExclusionRule(organization = org.apache.hadoop)), // We assume hadoop 2 and hence omit hadoop 1 dependencies sparkSQL, sparkStreaming, hadoop2MapRedClient, hadoop2Common, org.mortbay.jetty % servlet-api % 3.0.20100224 ) On Sep 11, 2014 8:05 PM, sp...@orbit-x.de wrote: Hi guys, any luck with this issue, anyone? I aswell tried all the possible exclusion combos to a no avail. thanks for your ideas reinis -Original-Nachricht- Von: Stephen Boesch java...@gmail.com An: user user@spark.apache.org Datum: 28-06-2014 15:12 Betreff: Re: HBase 0.96+ with Spark 1.0+ Hi Siyuan, Thanks for the input. We are preferring to use the SparkBuild.scala instead of maven. I did not see any protobuf.version related settings in that file. But - as noted by Sean Owen - in any case the issue we are facing presently is about the duplicate incompatible javax.servlet entries - apparently from the org.mortbay artifacts. 2014-06-28 6:01 GMT-07:00 Siyuan he hsy...@gmail.com: Hi Stephen, I am using spark1.0+ HBase0.96.2. This is what I did: 1) rebuild spark using: mvn -Dhadoop.version=2.3.0 -Dprotobuf.version=2.5.0 -DskipTests clean package 2) In spark-env.sh, set SPARK_CLASSPATH = /path-to/hbase-protocol-0.96.2-hadoop2.jar Hopefully it can help. Siyuan On Sat, Jun 28, 2014 at 8:52 AM, Stephen Boesch java...@gmail.com wrote: Thanks Sean. I had actually already added exclusion rule for org.mortbay.jetty - and that had not resolved it. Just in case I used your precise formulation: val excludeMortbayJetty = ExclusionRule(organization = org.mortbay.jetty) .. ,(org.apache.spark % spark-core_2.10 % sparkVersion withSources()).excludeAll(excludeMortbayJetty) ,(org.apache.spark % spark-sql_2.10 % sparkVersion withSources()).excludeAll(excludeMortbayJetty) However the same error still recurs: 14/06/28 05:48:35 INFO HttpServer: Starting HTTP Server [error] (run-main-0) java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package 2014-06-28 4:22 GMT-07:00 Sean Owen so...@cloudera.com: This sounds like an instance of roughly the same item as in https://issues.apache.org/jira/browse/SPARK-1949 Have a look at adding that exclude to see if it works. On Fri, Jun 27, 2014 at 10:21 PM, Stephen Boesch java...@gmail.com wrote: The present trunk is built and tested against HBase 0.94. I have tried various combinations of versions of HBase 0.96+ and Spark 1.0+ and all end up with 14/06/27 20:11:15 INFO HttpServer: Starting HTTP Server [error] (run-main-0) java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package java.lang.SecurityException: class javax.servlet.FilterRegistration's signer information does not match signer information of other classes in the same package at java.lang.ClassLoader.checkCerts(ClassLoader.java:952) I have tried a number of different ways to exclude javax.servlet related jars. But none have avoided this error. Anyone have a (small-ish) build.sbt that works with later versions of HBase? - To unsubscribe, e-mail:
Network requirements between Driver, Master, and Slave
Hello all, I'm trying to run a Driver on my local network with a deployment on EC2 and it's not working. I was wondering if either the master or slave instances (in standalone) connect back to the driver program. I outlined the details of my observations in a previous post but here is what I'm seeing: I have v1.1.0 installed (the new tag) on ec2 using the spark-ec2 script. I have the same version of the code built locally. I edited the master security group to allow inbound access from anywhere to 7077 and 8080. I see a connection take place. I see the workers fail with a timeout when any job is run. The master eventually removes the driver's job. I supposed this makes sense if there's a requirement for either the worker or the master to be on the same network as the driver. Is that the case? Thanks Jim -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Network-requirements-between-Driver-Master-and-Slave-tp13997.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkSQL HiveContext TypeTag compile error
Hi, I have the following code snippet. It works fine on spark-shell but in a standalone app it reports No TypeTag available for MySchema” at compile time when calling hc.createScheamaRdd(rdd). Anybody knows what might be missing? Thanks, Du -- Import org.apache.spark.sql.hive.HiveContext case class MySchema(key: Int, value: String) val rdd = sc.parallelize((1 to 10).map(i = MySchema(i, sval$i))) val schemaRDD = hc.createSchemaRDD(rdd) schemaRDD.registerTempTable(data) val rows = hc.sql(select * from data) rows.collect.foreach(println)
Re: SchemaRDD saveToCassandra
This might be a better question to ask on the cassandra mailing list as I believe that is where the exception is coming from. On Thu, Sep 11, 2014 at 2:37 AM, lmk lakshmi.muralikrish...@gmail.com wrote: Hi, My requirement is to extract certain fields from json files, run queries on them and save the result to cassandra. I was able to parse json , filter the result and save the rdd(regular) to cassandra. Now, when I try to read the json file through sqlContext , execute some queries on the same and then save the SchemaRDD to cassandra using saveToCassandra function, I am getting the following error: java.lang.NoSuchMethodException: Cannot resolve any suitable constructor for class org.apache.spark.sql.catalyst.expressions.Row Pls let me know if a spark SchemaRDD can be directly saved to cassandra just like the regular rdd? If that is not possible, is there any way to convert the schema RDD to a regular RDD ? Please advise. Regards, lmk -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-saveToCassandra-tp13951.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Reading from multiple sockets
Still fairly new to Spark so please bear with me. I am trying to write a streaming app that has multiple workers that read from sockets and process the data. Here is a very simplified version of what I am trying to do: val carStreamSeq = (1 to 2).map( _ = ssc.socketTextStream(host, port) ).toArray val unionCarStream = ssc.union(carStreamSeq) val connectedCars = unionCarStream.count() connectedCars.foreachRDD(r = println(count: + r.collect().mkString)) I can see the workers are running and the data is coming through but anything I put inside the 'foreachRDD' does not seem to get executed. I don't see the output of println in either stdout or in the master output file. What am I doing wrong? Varad -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-from-multiple-sockets-tp14000.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: cannot read file form a local path
I am seeing this same issue with Spark 1.0.1 (tried with file:// for local file ) : scala val lines = sc.textFile(file:///home/monir/.bashrc) lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at console:12 scala val linecount = lines.count org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/monir/.bashrc at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) -Original Message- From: wsun Sent: Feb 03, 2014; 12:44pm To: u...@spark.incubator.apache.org Subject: cannot read file form a local path After installing spark 0.8.1 on a EC2 cluster, I launched Spark shell on the master. This is what happened to me: scalaval textFile=sc.textFile(README.md) 14/02/03 20:38:08 INFO storage.MemoryStore: ensureFreeSpace(34380) called with c urMem=0, maxMem=4082116853 14/02/03 20:38:08 INFO storage.MemoryStore: Block broadcast_0 stored as values t o memory (estimated size 33.6 KB, free 3.8 GB) textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at consol e:12 scala textFile.count() 14/02/03 20:38:39 WARN snappy.LoadSnappy: Snappy native library is available 14/02/03 20:38:39 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/02/03 20:38:39 INFO snappy.LoadSnappy: Snappy native library loaded org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs: //ec2-54-234-136-50.compute-1.amazonaws.com:9000/user/root/README.md at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:197) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:208) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:141) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) at scala.Option.getOrElse(Option.scala:108) at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:26) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) at scala.Option.getOrElse(Option.scala:108) at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) at org.apache.spark.SparkContext.runJob(SparkContext.scala:886) at org.apache.spark.rdd.RDD.count(RDD.scala:698) Spark seems looking for README.md in hdfs. However, I did not specify the file is located in hdfs. I am just wondering if there any configuration in Spark that force Spark to read files from local file system. Thanks in advance for any helps. wp - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL and running parquet tables?
Michael Armbrust wrote You'll need to run parquetFile(path).registerTempTable(name) to refresh the table. I'm not seeing that function on SchemaRDD in 1.0.2, is there something I'm missing? SchemaRDD Scaladoc http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-running-parquet-tables-tp13987p14002.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkSQL HiveContext TypeTag compile error
Solved it. The problem occurred because the case class was defined within a test case in FunSuite. Moving the case class definition out of test fixed the problem. From: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID Date: Thursday, September 11, 2014 at 11:25 AM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: SparkSQL HiveContext TypeTag compile error Hi, I have the following code snippet. It works fine on spark-shell but in a standalone app it reports No TypeTag available for MySchema” at compile time when calling hc.createScheamaRdd(rdd). Anybody knows what might be missing? Thanks, Du -- Import org.apache.spark.sql.hive.HiveContext case class MySchema(key: Int, value: String) val rdd = sc.parallelize((1 to 10).map(i = MySchema(i, sval$i))) val schemaRDD = hc.createSchemaRDD(rdd) schemaRDD.registerTempTable(data) val rows = hc.sql(select * from data) rows.collect.foreach(println)
single worker vs multiple workers on each machine
Hi There, I am new to Spark and I was wondering when you have so much memory on each machine of the cluster, is it better to run multiple workers with limited memory on each machine or is it better to run a single worker with access to the majority of the machine memory? If the answer is it depends, would you please elaborate? Thanks, Mike
spark sql - create new_table as select * from table
Hi, I am trying to create a new table from a select query as follows: CREATE TABLE IF NOT EXISTS new_table ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/test/new_table' AS select * from table this works in Hive, but in Spark SQL (1.0.2) I am getting Unsupported language features in query error. Could you suggest why I am getting this? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark sql - create new_table as select * from table
The implementation of SparkSQL is currently incomplete. You may try it out with HiveContext instead of SQLContext. On 9/11/14, 1:21 PM, jamborta jambo...@gmail.com wrote: Hi, I am trying to create a new table from a select query as follows: CREATE TABLE IF NOT EXISTS new_table ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/test/new_table' AS select * from table this works in Hive, but in Spark SQL (1.0.2) I am getting Unsupported language features in query error. Could you suggest why I am getting this? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-t able-as-select-from-table-tp14006.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SparkSQL HiveContext TypeTag compile error
Just moving it out of test is not enough. Must move the case class definition to the top level. Otherwise it would report a runtime error of task not serializable when executing collect(). From: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID Date: Thursday, September 11, 2014 at 12:33 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL HiveContext TypeTag compile error Solved it. The problem occurred because the case class was defined within a test case in FunSuite. Moving the case class definition out of test fixed the problem. From: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID Date: Thursday, September 11, 2014 at 11:25 AM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: SparkSQL HiveContext TypeTag compile error Hi, I have the following code snippet. It works fine on spark-shell but in a standalone app it reports No TypeTag available for MySchema” at compile time when calling hc.createScheamaRdd(rdd). Anybody knows what might be missing? Thanks, Du -- Import org.apache.spark.sql.hive.HiveContext case class MySchema(key: Int, value: String) val rdd = sc.parallelize((1 to 10).map(i = MySchema(i, sval$i))) val schemaRDD = hc.createSchemaRDD(rdd) schemaRDD.registerTempTable(data) val rows = hc.sql(select * from data) rows.collect.foreach(println)
Re: spark sql - create new_table as select * from table
thanks. this was actually using hivecontext. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14009.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: cannot read file form a local path
Seems starting spark-shell in local mode solves this. But still then it cannot recognize file beginning with a '.' MASTER=local[4] ./bin/spark-shell . scala val lineCount = sc.textFile(/home/monir/ref).count lineCount: Long = 68 scala val lineCount2 = sc.textFile(/home/monir/.ref).count org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/monir/.ref Though I am ok with running spark-shell in local mode to basic examples run, I was wondering if getting to local files on the cluster nodes is possible when all of the worker nodes have the file in question in their local file system. Still fairly new to Spark so bear with me if this is easily tunable by some config params. Bests, -Monir -Original Message- From: Mozumder, Monir Sent: Thursday, September 11, 2014 12:15 PM To: user@spark.apache.org Subject: RE: cannot read file form a local path I am seeing this same issue with Spark 1.0.1 (tried with file:// for local file ) : scala val lines = sc.textFile(file:///home/monir/.bashrc) lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at console:12 scala val linecount = lines.count org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/monir/.bashrc at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202) -Original Message- From: wsun Sent: Feb 03, 2014; 12:44pm To: u...@spark.incubator.apache.org Subject: cannot read file form a local path After installing spark 0.8.1 on a EC2 cluster, I launched Spark shell on the master. This is what happened to me: scalaval textFile=sc.textFile(README.md) 14/02/03 20:38:08 INFO storage.MemoryStore: ensureFreeSpace(34380) called with c urMem=0, maxMem=4082116853 14/02/03 20:38:08 INFO storage.MemoryStore: Block broadcast_0 stored as values t o memory (estimated size 33.6 KB, free 3.8 GB) textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at consol e:12 scala textFile.count() 14/02/03 20:38:39 WARN snappy.LoadSnappy: Snappy native library is available 14/02/03 20:38:39 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/02/03 20:38:39 INFO snappy.LoadSnappy: Snappy native library loaded org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs: //ec2-54-234-136-50.compute-1.amazonaws.com:9000/user/root/README.md at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j ava:197) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja va:208) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:141) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) at scala.Option.getOrElse(Option.scala:108) at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:26) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) at scala.Option.getOrElse(Option.scala:108) at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) at org.apache.spark.SparkContext.runJob(SparkContext.scala:886) at org.apache.spark.rdd.RDD.count(RDD.scala:698) Spark seems looking for README.md in hdfs. However, I did not specify the file is located in hdfs. I am just wondering if there any configuration in Spark that force Spark to read files from local file system. Thanks in advance for any helps. wp - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re[2]: HBase 0.96+ with Spark 1.0+
Thank you, Aniket for your hint! Alas, I am facing really hellish situation as it seems, because I have integration tests using BOTH spark and HBase (Minicluster). Thus I get either: class javax.servlet.ServletRegistration's signer information does not match signer information of other classes in the same package java.lang.SecurityException: class javax.servlet.ServletRegistration's signer information does not match signer information of other classes in the same package at java.lang.ClassLoader.checkCerts(ClassLoader.java:943) at java.lang.ClassLoader.preDefineClass(ClassLoader.java:657) at java.lang.ClassLoader.defineClass(ClassLoader.java:785) or: [info] Cause: java.lang.ClassNotFoundException: org.mortbay.jetty.servlet.Context [info] at java.net.URLClassLoader$1.run(URLClassLoader.java:366) [info] at java.net.URLClassLoader$1.run(URLClassLoader.java:355) [info] at java.security.AccessController.doPrivileged(Native Method) [info] at java.net.URLClassLoader.findClass(URLClassLoader.java:354) [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:423) [info] at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) [info] at java.lang.ClassLoader.loadClass(ClassLoader.java:356) [info] at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:661) [info] at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:552) [info] at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:720) I am searching the web already for a week trying to figure out how to make this work :-/ all the help or hints are greatly appreciated reinis -Original-Nachricht- Von: Aniket Bhatnagar aniket.bhatna...@gmail.com An: sp...@orbit-x.de Cc: user user@spark.apache.org Datum: 11-09-2014 20:00 Betreff: Re: Re[2]: HBase 0.96+ with Spark 1.0+ Dependency hell... My fav problem :). I had run into a similar issue with hbase and jetty. I cant remember thw exact fix, but is are excerpts from my dependencies that may be relevant: val hadoop2Common = org.apache.hadoop % hadoop-common % hadoop2Version excludeAll( ExclusionRule(organization = javax.servlet), ExclusionRule(organization = javax.servlet.jsp), ExclusionRule(organization = org.mortbay.jetty) ) val hadoop2MapRedClient = org.apache.hadoop % hadoop-mapreduce-client-core % hadoop2Version val hbase = org.apache.hbase % hbase % hbaseVersion excludeAll( ExclusionRule(organization = org.apache.maven.wagon), ExclusionRule(organization = org.jboss.netty), ExclusionRule(organization = org.mortbay.jetty), ExclusionRule(organization = org.jruby) // Don't need HBASE's jruby. It pulls in whole lot of other dependencies like joda-time. ) val sparkCore = org.apache.spark %% spark-core % sparkVersion val sparkStreaming = org.apache.spark %% spark-streaming % sparkVersion val sparkSQL = org.apache.spark %% spark-sql % sparkVersion val sparkHive = org.apache.spark %% spark-hive % sparkVersion val sparkRepl = org.apache.spark %% spark-repl % sparkVersion val sparkAll = Seq ( sparkCore excludeAll( ExclusionRule(organization = org.apache.hadoop)), // We assume hadoop 2 and hence omit hadoop 1 dependencies sparkSQL, sparkStreaming, hadoop2MapRedClient, hadoop2Common, org.mortbay.jetty % servlet-api % 3.0.20100224 ) On Sep 11, 2014 8:05 PM, sp...@orbit-x.de wrote: Hi guys, any luck with this issue, anyone? I aswell tried all the possible exclusion combos to a no avail. thanks for your ideas reinis -Original-Nachricht- Von: Stephen Boesch java...@gmail.com An: user user@spark.apache.org Datum: 28-06-2014 15:12 Betreff: Re: HBase 0.96+ with Spark 1.0+ Hi Siyuan, Thanks for the input. We are preferring to use the SparkBuild.scala instead of maven. I did not see any protobuf.version related settings in that file. But - as noted by Sean Owen - in any case the issue we are facing presently is about the duplicate incompatible javax.servlet entries - apparently from the org.mortbay artifacts. 2014-06-28 6:01 GMT-07:00 Siyuan he hsy...@gmail.com: Hi Stephen, I am using spark1.0+ HBase0.96.2. This is what I did: 1) rebuild spark using: mvn -Dhadoop.version=2.3.0 -Dprotobuf.version=2.5.0 -DskipTests clean package 2) In spark-env.sh, set SPARK_CLASSPATH = /path-to/hbase-protocol-0.96.2-hadoop2.jar Hopefully it can help. Siyuan On Sat, Jun 28, 2014 at 8:52 AM, Stephen Boesch java...@gmail.com wrote: Thanks Sean. I had actually already added exclusion rule for org.mortbay.jetty - and that had not resolved it. Just in case I used your precise formulation: val excludeMortbayJetty = ExclusionRule(organization = org.mortbay.jetty) .. ,(org.apache.spark % spark-core_2.10 % sparkVersion withSources()).excludeAll(excludeMortbayJetty) ,(org.apache.spark % spark-sql_2.10 %
Re: Out of memory with Spark Streaming
Which version of spark are you running? If you are running the latest one, then could try running not a window but a simple event count on every 2 second batch, and see if you are still running out of memory? TD On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I did change it to be 1 gb. It still ran out of memory but a little later. The streaming job isnt handling a lot of data. In every 2 seconds, it doesn't get more than 50 records. Each record size is not more than 500 bytes. On Sep 11, 2014 10:54 PM, Bharat Venkat bvenkat.sp...@gmail.com wrote: You could set spark.executor.memory to something bigger than the default (512mb) On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I am running a simple Spark Streaming program that pulls in data from Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps data and persists to a store. The program is running in local mode right now and runs out of memory after a while. I am yet to investigate heap dumps but I think Spark isn't releasing memory after processing is complete. I have even tried changing storage level to disk only. Help! Thanks, Aniket
Re: Spark streaming stops computing while the receiver keeps running without any errors reported
This is very puzzling, given that this works in the local mode. Does running the kinesis example work with your spark-submit? https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala The instructions are present in the streaming guide. https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md If that does not work on cluster, then I would see the streaming UI for the number records that are being received, and the stages page for whether jobs are being executed for every batch or not. Can tell use whether that is working well. Also ccing, chris fregly who wrote Kinesis integration. TD On Thu, Sep 11, 2014 at 4:51 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Hi all I am trying to run kinesis spark streaming application on a standalone spark cluster. The job works find in local mode but when I submit it (using spark-submit), it doesn't do anything. I enabled logs for org.apache.spark.streaming.kinesis package and I regularly get the following in worker logs: 14/09/11 11:41:25 DEBUG KinesisRecordProcessor: Stored: Worker x.x.x.x:b88a9210-cbb9-4c31-8da7-35fd92faba09 stored 34 records for shardId shardId- 14/09/11 11:41:25 DEBUG KinesisRecordProcessor: Stored: Worker x.x.x.x:b2e9c20f-470a-44fe-a036-630c776919fb stored 31 records for shardId shardId-0001 But the job does not perform any operations defined on DStream. To investigate this further, I changed the kinesis-asl's KinesisUtils to perform the following computation on the DStream created using ssc.receiverStream(new KinesisReceiver...): stream.count().foreachRDD(rdd = rdd.foreach(tuple = logInfo(Emitted + tuple))) Even the above line does not results in any corresponding log entries both in driver and worker logs. The only relevant logs that I could find in driver logs are: 14/09/11 11:40:58 INFO DAGScheduler: Stage 2 (foreach at KinesisUtils.scala:68) finished in 0.398 s 14/09/11 11:40:58 INFO SparkContext: Job finished: foreach at KinesisUtils.scala:68, took 4.926449985 s 14/09/11 11:40:58 INFO JobScheduler: Finished job streaming job 1410435653000 ms.0 from job set of time 1410435653000 ms 14/09/11 11:40:58 INFO JobScheduler: Starting job streaming job 1410435653000 ms.1 from job set of time 1410435653000 ms 14/09/11 11:40:58 INFO SparkContext: Starting job: foreach at KinesisUtils.scala:68 14/09/11 11:40:58 INFO DAGScheduler: Registering RDD 13 (union at DStream.scala:489) 14/09/11 11:40:58 INFO DAGScheduler: Got job 3 (foreach at KinesisUtils.scala:68) with 2 output partitions (allowLocal=false) 14/09/11 11:40:58 INFO DAGScheduler: Final stage: Stage 5(foreach at KinesisUtils.scala:68) After the above logs, nothing shows up corresponding to KinesisUtils. I am out of ideas on this one and any help on this would greatly appreciated. Thanks, Aniket
Re: Re[2]: HBase 0.96+ with Spark 1.0+
This was already answered at the bottom of this same thread -- read below. On Thu, Sep 11, 2014 at 9:51 PM, sp...@orbit-x.de wrote: class javax.servlet.ServletRegistration's signer information does not match signer information of other classes in the same package java.lang.SecurityException: class javax.servlet.ServletRegistration's signer information does not match signer information of other classes in the same package at java.lang.ClassLoader.checkCerts(ClassLoader.java:943) at java.lang.ClassLoader.preDefineClass(ClassLoader.java:657) at java.lang.ClassLoader.defineClass(ClassLoader.java:785) 2014-06-28 4:22 GMT-07:00 Sean Owen so...@cloudera.com: This sounds like an instance of roughly the same item as in https://issues.apache.org/jira/browse/SPARK-1949 Have a look at adding that exclude to see if it works. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkContext and multi threads
Hi, I'm trying to make spark work on multithreads java application. What i'm trying to do is, - Create a Single SparkContext - Create Multiple SparkILoop and SparkIMain - Inject created SparkContext into SparkIMain interpreter. Thread is created by every user request and take a SparkILoop and interpret some code. My problem is - If a thread take first SparkILoop instance, than everything works fine. - If a thread take other SparkILoop instance, Spark can not find closure / case classes that i defined inside of interpreter. I read some previous topic and I think it's related with SparkEnv and ClosureCleaner. tried SparkEnv.set(env) with the env i can get right after SparkContext created. i not still no class found exception. Can anyone give me some idea? Thanks. Best, moon
Re: spark sql - create new_table as select * from table
What is the schema of table? On Thu, Sep 11, 2014 at 4:30 PM, jamborta jambo...@gmail.com wrote: thanks. this was actually using hivecontext. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14009.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Re: Spark SQL -- more than two tables for join
1.0.1 does not have the support on outer joins (added in 1.1). Can you try 1.1 branch? On Wed, Sep 10, 2014 at 9:28 PM, boyingk...@163.com boyingk...@163.com wrote: Hi,michael : I think Arthur.hk.chan arthur.hk.c...@gmail.com isn't here now,I Can Show something: 1)my spark version is 1.0.1 2) when I use multiple join ,like this: sql(SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on (youhao_age.rowkey=youhao_totalKiloMeter.rowkey)) youhao_data,youhao_age,youhao_totalKiloMeter were registerAsTable 。 I take the Exception: Exception in thread main java.lang.RuntimeException: [1.90] failure: ``UNION'' expected but `left' found SELECT * FROM youhao_data left join youhao_age on (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on (youhao_age.rowkey=youhao_totalKiloMeter.rowkey) ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:69) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:181) at org.apache.spark.examples.sql.SparkSQLHBaseRelation$.main(SparkSQLHBaseRelation.scala:140) at org.apache.spark.examples.sql.SparkSQLHBaseRelation.main(SparkSQLHBaseRelation.scala) -- boyingk...@163.com *From:* Michael Armbrust mich...@databricks.com *Date:* 2014-09-11 00:28 *To:* arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com *CC:* arunshell87 shell.a...@gmail.com; u...@spark.incubator.apache.org *Subject:* Re: Spark SQL -- more than two tables for join What version of Spark SQL are you running here? I think a lot of your concerns have likely been addressed in more recent versions of the code / documentation. (Spark 1.1 should be published in the next few days) In particular, for serious applications you should use a HiveContext and HiveQL as this is a much more complete implementation of a SQL Parser. The one in SQL context is only suggested if the Hive dependencies conflict with your application. 1) spark sql does not support multiple join This is not true. What problem were you running into? 2) spark left join: has performance issue Can you describe your data and query more? 3) spark sql’s cache table: does not support two-tier query I'm not sure what you mean here. 4) spark sql does not support repartition You can repartition SchemaRDDs in the same way as normal RDDs.
Spark Streaming in 1 hour batch duration RDD files gets lost
Hi, Our spark streaming app is configured to pull data from Kafka in 1 hour batch duration which performs aggregation of data by specific keys and store the related RDDs to HDFS in the transform phase. We have tried checkpoint of 7 days on the DStream of Kafka to ensure that the generated stream does not expire/lost. The first hour gets completed, but on the succeeding hours it always fails with exception: Job aborted due to stage failure: Task 39.0:1 failed 64 times, most recent failure: Exception failure in TID 27578 on host X.ec2.internal: java.io.FileNotFoundException: /data/run/spark/work/spark-local-20140911175744-4ddf/0d/shuffle_3_1_311 (No such file or directory) java.io.FileOutputStream.open(Native Method) java.io.FileOutputStream.init(FileOutputStream.java:221) org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116) org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177) org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161) org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158) scala.collection.Iterator$class.foreach(Iterator.scala:727) Environment: CDH version: 2.3.0-cdh5.1.0 Spark version: 1.0.0-cdh5.1.0 Spark settings: spark.io.compression.codec : org.apache.spark.io.SnappyCompressionCodec spark.serializer : org.apache.spark.serializer.KryoSerializer spark.kryoserializer.buffer.mb : 2 spark.local.dir : /data/run/spark/work/ spark.scheduler.mode : FAIR spark.rdd.compress : false spark.task.maxFailures : 64 spark.shuffle.use.netty : false spark.shuffle.spill : true spark.streaming.checkpoint.dir : hdfs://X.ec2.internal:8020/user/spark/checkpoints/event-storage spark.akka.threads : 4 spark.cores.max : 4 spark.executor.memory : 3g spark.shuffle.consolidateFiles : false spark.streaming.unpersist : true spark.logConf : true spark.shuffle.spill.compress : true Thanks, JL -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-in-1-hour-batch-duration-RDD-files-gets-lost-tp14027.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Backwards RDD
Iterating an RDD gives you each partition in order of their split index. I'd like to be able to get each partition in reverse order, but I'm having difficultly implementing the compute() method. I thought I could do something like this: override def getDependencies: Seq[Dependency[_]] = { Seq(new NarrowDependency[T](prev) { def getParents(partitionId: Int): Seq[Int] = { Seq(prev.partitions.size - partitionId - 1) } }) } override def compute(split: Partition, context: TaskContext): Iterator[T] = { firstParent[T].iterator(split, context).toArray.reverseIterator } But that doesn't work. How do I get one split to depend on exactly one split from the parent that does not match indices?
Announcing Spark 1.1.0!
I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 171 developers! This release brings operational and performance improvements in Spark core including a new implementation of the Spark shuffle designed for very large scale workloads. Spark 1.1 adds significant extensions to the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a JDBC server, byte code generation for fast expression evaluation, a public types API, JSON support, and other features and optimizations. MLlib introduces a new statistics library along with several new algorithms and optimizations. Spark 1.1 also builds out Spark's Python support and adds new components to the Spark Streaming module. Visit the release notes [1] to read about the new features, or download [2] the release today. [1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html [2] http://spark.eu.apache.org/downloads.html NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS. Please e-mail me directly for any type-o's in the release notes or name listing. Thanks, and congratulations! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Configuring Spark for heterogenous hardware
So I have a bunch of hardware with different core and memory setups. Is there a way to do one of the following: 1. Express a ratio of cores to memory to retain. The spark worker config would represent all of the cores and all of the memory usable for any application, and the application would take a fraction that sustains the ratio. Say I have 4 cores and 20G of RAM. I'd like it to have the worker take 4/20 and the executor take 5 G for each of the 4 cores, thus maxing both out. If there were only 16G with the same ratio requirement, it would only take 3 cores and 12G in a single executor and leave the rest. 2. Have the executor take whole number ratios of what it needs. Say it is configured for 2/8G and the worker has 4/20. So we can give the executor 2/8G (which is true now) or we can instead give it 4/16G, maxing out one of the two parameters. Either way would allow me to get my heterogenous hardware all participating in the work of my spark cluster, presumably without endangering spark's assumption of homogenous execution environments in the dimensions of memory and cores. If there's any way to do this, please enlighten me.
History server: ERROR ReplayListenerBus: Exception in parsing Spark event log
Hi, I am using Spark 1.0.2 on a mesos cluster. After I run my job, when I try to look at the detailed application stats using a history server@18080, the stats don't show up for some of the jobs even though the job completed successfully and the event logs are written to the log folder. The log from the history server execution is attached below - looks like it is encountering some parsing error when reading the EVENT_LOG file ( I have not modified this file). Basically the line that says Malformed line seems to be truncating the first path (instead of amd64, it shows up as a d64). Does the history server have any String buffer limitations that would be causing this problem? Also, I want to point out that this problem does not happen all the time - during some runs the app details do show up. However this is quite unpredictable. The same job when I ran using Spark 1.0.1 in standalone mode (i.e without using a history server), showed up on the application details page. I am not sure if this is a problem with the history server or specifically with version 1.0.2. Is it possible to fix this problem, as I would like to use the application details? thanks 14/09/11 20:50:55 ERROR ReplayListenerBus: Exception in parsing Spark event log file:/mapr/applogs_spark_mesos/spark_test-1410468489529/EVENT_LOG_1 com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'd64': was expecting at [Source: java.io.StringReader@2d51a56a; line: 1, column: 4] at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1524) at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:557) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken(ReaderBasedJsonParser.java:2042) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1412) at com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:679) at com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3024) at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:2971) at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2091) at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19) at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69) at org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55) at org.apache.spark.deploy.history.HistoryServer.org$apache$spark$deploy$history$HistoryServer$$renderSparkUI(HistoryServer.scala:182) at org.apache.spark.deploy.history.HistoryServer$$anonfun$checkForLogs$3.apply(HistoryServer.scala:149) at org.apache.spark.deploy.history.HistoryServer$$anonfun$checkForLogs$3.apply(HistoryServer.scala:146) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.deploy.history.HistoryServer.checkForLogs(HistoryServer.scala:146) at org.apache.spark.deploy.history.HistoryServer$$anon$1$$anonfun$run$1.apply$mcV$sp(HistoryServer.scala:77) at org.apache.spark.deploy.history.HistoryServer$$anon$1$$anonfun$run$1.apply(HistoryServer.scala:74) at org.apache.spark.deploy.history.HistoryServer$$anon$1$$anonfun$run$1.apply(HistoryServer.scala:74) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160) at org.apache.spark.deploy.history.HistoryServer$$anon$1.run(HistoryServer.scala:73) ReplayListenerBus: Malformed line: d64/jre/lib/jsse.jar:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/jce.jar:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/charsets.jar:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rhino.jar:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/jfr.jar:/usr/lib/jvm/java-7-openjdk-amd64/jre/classes,file.encoding:ISO-8859-1,user.timezone:Etc/UTC,java.specification.vendor:Oracle Corporation,sun.java.launcher:SUN_STANDARD,os.version:3.13.0-32-generic,sun.os.patch.level:unknown,java.vm.specification.vendor:Oracle
RE: Announcing Spark 1.1.0!
I see the binary packages include hadoop 1, 2.3 and 2.4. Does Spark 1.1.0 support hadoop 2.5.0 at below address? http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Friday, September 12, 2014 8:13 AM To: d...@spark.apache.org; user@spark.apache.org Subject: Announcing Spark 1.1.0! I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 171 developers! This release brings operational and performance improvements in Spark core including a new implementation of the Spark shuffle designed for very large scale workloads. Spark 1.1 adds significant extensions to the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a JDBC server, byte code generation for fast expression evaluation, a public types API, JSON support, and other features and optimizations. MLlib introduces a new statistics library along with several new algorithms and optimizations. Spark 1.1 also builds out Spark's Python support and adds new components to the Spark Streaming module. Visit the release notes [1] to read about the new features, or download [2] the release today. [1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html [2] http://spark.eu.apache.org/downloads.html NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS. Please e-mail me directly for any type-o's in the release notes or name listing. Thanks, and congratulations! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
coalesce on SchemaRDD in pyspark
Hi All, I'm having some trouble with the coalesce and repartition functions for SchemaRDD objects in pyspark. When I run: sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}', '{foo:baz}'])).coalesce(1) I get this error: Py4JError: An error occurred while calling o94.coalesce. Trace: py4j.Py4JException: Method coalesce([class java.lang.Integer, class java.lang.Boolean]) does not exist For context, I have a dataset stored in a parquet file, and I'm using SQLContext to make several queries against the data. I then register the results of these as queries new tables in the SQLContext. Unfortunately each new table has the same number of partitions as the original (despite being much smaller). Hence my interest in coalesce and repartition. Has anybody else encountered this bug? Is there an alternate workflow I should consider? I am running the 1.1.0 binaries released today. best, -Brad
Re: Announcing Spark 1.1.0!
Hi, On Fri, Sep 12, 2014 at 9:12 AM, Patrick Wendell pwend...@gmail.com wrote: I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 171 developers! Great, congratulations!! The release notes read great! Seems like if I wait long enough for new Spark releases, my applications will build themselves in the end ;-) Tobias
RE: Announcing Spark 1.1.0!
I’m not sure if I’m completely answering your question here but I’m currently working (on OSX) with Hadoop 2.5 and I used the Spark 1.1 with Hadoop 2.4 without any issues. On September 11, 2014 at 18:11:46, Haopu Wang (hw...@qilinsoft.com) wrote: I see the binary packages include hadoop 1, 2.3 and 2.4. Does Spark 1.1.0 support hadoop 2.5.0 at below address? http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Friday, September 12, 2014 8:13 AM To: d...@spark.apache.org; user@spark.apache.org Subject: Announcing Spark 1.1.0! I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 171 developers! This release brings operational and performance improvements in Spark core including a new implementation of the Spark shuffle designed for very large scale workloads. Spark 1.1 adds significant extensions to the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a JDBC server, byte code generation for fast expression evaluation, a public types API, JSON support, and other features and optimizations. MLlib introduces a new statistics library along with several new algorithms and optimizations. Spark 1.1 also builds out Spark's Python support and adds new components to the Spark Streaming module. Visit the release notes [1] to read about the new features, or download [2] the release today. [1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html [2] http://spark.eu.apache.org/downloads.html NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS. Please e-mail me directly for any type-o's in the release notes or name listing. Thanks, and congratulations! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Table not found: using jdbc console to query sparksql hive thriftserver
It sort of depends on the definition of efficiently. From a work flow perspective I would agree but from an I/O perspective, wouldn’t there be the same multi-pass from the standpoint of the Hive context needing to push the data into HDFS? Saying this, if you’re pushing the data into HDFS and then creating Hive tables via load (vs. a reference point ala external tables), I would agree with you. And thanks for correcting me, the registerTempTable is in the SqlContext. On September 10, 2014 at 13:47:24, Du Li (l...@yahoo-inc.com) wrote: Hi Denny, There is a related question by the way. I have a program that reads in a stream of RDD¹s, each of which is to be loaded into a hive table as one partition. Currently I do this by first writing the RDD¹s to HDFS and then loading them to hive, which requires multiple passes of HDFS I/O and serialization/deserialization. I wonder if it is possible to do it more efficiently with Spark 1.1 streaming + SQL, e.g., by registering the RDDs into a hive context so that the data is loaded directly into the hive table in cache and meanwhile visible to jdbc/odbc clients. In the spark source code, the method registerTempTable you mentioned works on SqlContext instead of HiveContext. Thanks, Du On 9/10/14, 1:21 PM, Denny Lee denny.g@gmail.com wrote: Actually, when registering the table, it is only available within the sc context you are running it in. For Spark 1.1, the method name is changed to RegisterAsTempTable to better reflect that. The Thrift server process runs under a different process meaning that it cannot see any of the tables generated within the sc context. You would need to save the sc table into Hive and then the Thrift process would be able to see them. HTH! On Sep 10, 2014, at 13:08, alexandria1101 alexandria.shea...@gmail.com wrote: I used the hiveContext to register the tables and the tables are still not being found by the thrift server. Do I have to pass the hiveContext to JDBC somehow? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Table-not-found-using -jdbc-console-to-query-sparksql-hive-thriftserver-tp13840p13922.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Announcing Spark 1.1.0!
Please correct me if I’m wrong but I was under the impression as per the maven repositories that it was just to stay more in sync with the various version of Hadoop. Looking at the latest documentation (https://spark.apache.org/docs/latest/building-with-maven.html), there are multiple Hadoop versions called out. As for the potential differences in Spark, this is more about ensuring the various jars and library dependencies of the correct version of Hadoop are included so there can be proper connectivity to Hadoop from Spark vs. any differences in Spark itself. Another good reference on this topic is call out for Hadoop versions within github: https://github.com/apache/spark HTH! On September 11, 2014 at 18:39:10, Haopu Wang (hw...@qilinsoft.com) wrote: Danny, thanks for the response. I raise the question because in Spark 1.0.2, I saw one binary package for hadoop2, but in Spark 1.1.0, there are separate packages for hadoop 2.3 and 2.4. That implies some difference in Spark according to hadoop version. From:Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 9:35 AM To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; Patrick Wendell Subject: RE: Announcing Spark 1.1.0! I’m not sure if I’m completely answering your question here but I’m currently working (on OSX) with Hadoop 2.5 and I used the Spark 1.1 with Hadoop 2.4 without any issues. On September 11, 2014 at 18:11:46, Haopu Wang (hw...@qilinsoft.com) wrote: I see the binary packages include hadoop 1, 2.3 and 2.4. Does Spark 1.1.0 support hadoop 2.5.0 at below address? http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Friday, September 12, 2014 8:13 AM To: d...@spark.apache.org; user@spark.apache.org Subject: Announcing Spark 1.1.0! I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 171 developers! This release brings operational and performance improvements in Spark core including a new implementation of the Spark shuffle designed for very large scale workloads. Spark 1.1 adds significant extensions to the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a JDBC server, byte code generation for fast expression evaluation, a public types API, JSON support, and other features and optimizations. MLlib introduces a new statistics library along with several new algorithms and optimizations. Spark 1.1 also builds out Spark's Python support and adds new components to the Spark Streaming module. Visit the release notes [1] to read about the new features, or download [2] the release today. [1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html [2] http://spark.eu.apache.org/downloads.html NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS. Please e-mail me directly for any type-o's in the release notes or name listing. Thanks, and congratulations! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Announcing Spark 1.1.0!
From the web page (https://spark.apache.org/docs/latest/building-with-maven.html) which is pointed out by you, it’s saying “Because HDFS is not protocol-compatible across versions, if you want to read from HDFS, you’ll need to build Spark against the specific HDFS version in your environment.” Did you try to read a hadoop 2.5.0 file using Spark 1.1 with hadoop 2.4? Thanks! From: Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 10:00 AM To: Patrick Wendell; Haopu Wang; d...@spark.apache.org; user@spark.apache.org Subject: RE: Announcing Spark 1.1.0! Please correct me if I’m wrong but I was under the impression as per the maven repositories that it was just to stay more in sync with the various version of Hadoop. Looking at the latest documentation (https://spark.apache.org/docs/latest/building-with-maven.html), there are multiple Hadoop versions called out. As for the potential differences in Spark, this is more about ensuring the various jars and library dependencies of the correct version of Hadoop are included so there can be proper connectivity to Hadoop from Spark vs. any differences in Spark itself. Another good reference on this topic is call out for Hadoop versions within github: https://github.com/apache/spark HTH! On September 11, 2014 at 18:39:10, Haopu Wang (hw...@qilinsoft.com) wrote: Danny, thanks for the response. I raise the question because in Spark 1.0.2, I saw one binary package for hadoop2, but in Spark 1.1.0, there are separate packages for hadoop 2.3 and 2.4. That implies some difference in Spark according to hadoop version. From:Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 9:35 AM To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; Patrick Wendell Subject: RE: Announcing Spark 1.1.0! I’m not sure if I’m completely answering your question here but I’m currently working (on OSX) with Hadoop 2.5 and I used the Spark 1.1 with Hadoop 2.4 without any issues. On September 11, 2014 at 18:11:46, Haopu Wang (hw...@qilinsoft.com) wrote: I see the binary packages include hadoop 1, 2.3 and 2.4. Does Spark 1.1.0 support hadoop 2.5.0 at below address? http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Friday, September 12, 2014 8:13 AM To: d...@spark.apache.org; user@spark.apache.org Subject: Announcing Spark 1.1.0! I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 171 developers! This release brings operational and performance improvements in Spark core including a new implementation of the Spark shuffle designed for very large scale workloads. Spark 1.1 adds significant extensions to the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a JDBC server, byte code generation for fast expression evaluation, a public types API, JSON support, and other features and optimizations. MLlib introduces a new statistics library along with several new algorithms and optimizations. Spark 1.1 also builds out Spark's Python support and adds new components to the Spark Streaming module. Visit the release notes [1] to read about the new features, or download [2] the release today. [1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html [2] http://spark.eu.apache.org/downloads.html NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS. Please e-mail me directly for any type-o's in the release notes or name listing. Thanks, and congratulations! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Applications status missing when Spark HA(zookeeper) enabled
Hi guys, I configured Spark with the configuration in spark-env.sh: export SPARK_DAEMON_JAVA_OPTS=-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=host1:2181,host2:2181,host3:2181 -Dspark.deploy.zookeeper.dir=/spark And I started spark-shell on one master host1(active): MASTER=spark://host1:7077,host2:7077 bin/spark-shell I stop-master.sh on host1, then access host2 web ui, the worker successfully registered to new master host2, but the running application, even the completed applications shows nothing, did I missing anything when I configure spark HA ? Thanks !
RE: Announcing Spark 1.1.0!
Yes, atleast for my query scenarios, I have been able to use Spark 1.1 with Hadoop 2.4 against Hadoop 2.5. Note, Hadoop 2.5 is considered a relatively minor release (http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available) where Hadoop 2.4 and 2.3 were considered more significant releases. On September 11, 2014 at 19:22:05, Haopu Wang (hw...@qilinsoft.com) wrote: From the web page (https://spark.apache.org/docs/latest/building-with-maven.html) which is pointed out by you, it’s saying “Because HDFS is not protocol-compatible across versions, if you want to read from HDFS, you’ll need to build Spark against the specific HDFS version in your environment.” Did you try to read a hadoop 2.5.0 file using Spark 1.1 with hadoop 2.4? Thanks! From:Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 10:00 AM To: Patrick Wendell; Haopu Wang; d...@spark.apache.org; user@spark.apache.org Subject: RE: Announcing Spark 1.1.0! Please correct me if I’m wrong but I was under the impression as per the maven repositories that it was just to stay more in sync with the various version of Hadoop. Looking at the latest documentation (https://spark.apache.org/docs/latest/building-with-maven.html), there are multiple Hadoop versions called out. As for the potential differences in Spark, this is more about ensuring the various jars and library dependencies of the correct version of Hadoop are included so there can be proper connectivity to Hadoop from Spark vs. any differences in Spark itself. Another good reference on this topic is call out for Hadoop versions within github: https://github.com/apache/spark HTH! On September 11, 2014 at 18:39:10, Haopu Wang (hw...@qilinsoft.com) wrote: Danny, thanks for the response. I raise the question because in Spark 1.0.2, I saw one binary package for hadoop2, but in Spark 1.1.0, there are separate packages for hadoop 2.3 and 2.4. That implies some difference in Spark according to hadoop version. From:Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 9:35 AM To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; Patrick Wendell Subject: RE: Announcing Spark 1.1.0! I’m not sure if I’m completely answering your question here but I’m currently working (on OSX) with Hadoop 2.5 and I used the Spark 1.1 with Hadoop 2.4 without any issues. On September 11, 2014 at 18:11:46, Haopu Wang (hw...@qilinsoft.com) wrote: I see the binary packages include hadoop 1, 2.3 and 2.4. Does Spark 1.1.0 support hadoop 2.5.0 at below address? http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Friday, September 12, 2014 8:13 AM To: d...@spark.apache.org; user@spark.apache.org Subject: Announcing Spark 1.1.0! I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 171 developers! This release brings operational and performance improvements in Spark core including a new implementation of the Spark shuffle designed for very large scale workloads. Spark 1.1 adds significant extensions to the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a JDBC server, byte code generation for fast expression evaluation, a public types API, JSON support, and other features and optimizations. MLlib introduces a new statistics library along with several new algorithms and optimizations. Spark 1.1 also builds out Spark's Python support and adds new components to the Spark Streaming module. Visit the release notes [1] to read about the new features, or download [2] the release today. [1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html [2] http://spark.eu.apache.org/downloads.html NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS. Please e-mail me directly for any type-o's in the release notes or name listing. Thanks, and congratulations! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Announcing Spark 1.1.0!
Got it, thank you, Denny! From: Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 11:04 AM To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; Patrick Wendell Subject: RE: Announcing Spark 1.1.0! Yes, atleast for my query scenarios, I have been able to use Spark 1.1 with Hadoop 2.4 against Hadoop 2.5. Note, Hadoop 2.5 is considered a relatively minor release (http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available) where Hadoop 2.4 and 2.3 were considered more significant releases. On September 11, 2014 at 19:22:05, Haopu Wang (hw...@qilinsoft.com) wrote: From the web page (https://spark.apache.org/docs/latest/building-with-maven.html) which is pointed out by you, it’s saying “Because HDFS is not protocol-compatible across versions, if you want to read from HDFS, you’ll need to build Spark against the specific HDFS version in your environment.” Did you try to read a hadoop 2.5.0 file using Spark 1.1 with hadoop 2.4? Thanks! From:Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 10:00 AM To: Patrick Wendell; Haopu Wang; d...@spark.apache.org; user@spark.apache.org Subject: RE: Announcing Spark 1.1.0! Please correct me if I’m wrong but I was under the impression as per the maven repositories that it was just to stay more in sync with the various version of Hadoop. Looking at the latest documentation (https://spark.apache.org/docs/latest/building-with-maven.html), there are multiple Hadoop versions called out. As for the potential differences in Spark, this is more about ensuring the various jars and library dependencies of the correct version of Hadoop are included so there can be proper connectivity to Hadoop from Spark vs. any differences in Spark itself. Another good reference on this topic is call out for Hadoop versions within github: https://github.com/apache/spark HTH! On September 11, 2014 at 18:39:10, Haopu Wang (hw...@qilinsoft.com) wrote: Danny, thanks for the response. I raise the question because in Spark 1.0.2, I saw one binary package for hadoop2, but in Spark 1.1.0, there are separate packages for hadoop 2.3 and 2.4. That implies some difference in Spark according to hadoop version. From:Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 9:35 AM To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; Patrick Wendell Subject: RE: Announcing Spark 1.1.0! I’m not sure if I’m completely answering your question here but I’m currently working (on OSX) with Hadoop 2.5 and I used the Spark 1.1 with Hadoop 2.4 without any issues. On September 11, 2014 at 18:11:46, Haopu Wang (hw...@qilinsoft.com) wrote: I see the binary packages include hadoop 1, 2.3 and 2.4. Does Spark 1.1.0 support hadoop 2.5.0 at below address? http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Friday, September 12, 2014 8:13 AM To: d...@spark.apache.org; user@spark.apache.org Subject: Announcing Spark 1.1.0! I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 171 developers! This release brings operational and performance improvements in Spark core including a new implementation of the Spark shuffle designed for very large scale workloads. Spark 1.1 adds significant extensions to the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a JDBC server, byte code generation for fast expression evaluation, a public types API, JSON support, and other features and optimizations. MLlib introduces a new statistics library along with several new algorithms and optimizations. Spark 1.1 also builds out Spark's Python
Re: DistCP - Spark-based
I've created SPARK-3499 https://issues.apache.org/jira/browse/SPARK-3499 to track creating a Spark-based distcp utility. Nick On Tue, Aug 12, 2014 at 4:20 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Good question; I don't know of one but I believe people at Cloudera had some thoughts of porting Sqoop to Spark in the future, and maybe they'd consider DistCP as part of this effort. I agree it's missing right now. Matei On August 12, 2014 at 11:04:28 AM, Gary Malouf (malouf.g...@gmail.com) wrote: We are probably still the minority, but our analytics platform based on Spark + HDFS does not have map/reduce installed. I'm wondering if there is a distcp equivalent that leverages Spark to do the work. Our team is trying to find the best way to do cross-datacenter replication of our HDFS data to minimize the impact of outages/dc failure.
Re: Spark SQL JDBC
When you re-ran sbt did you clear out the packages first and ensure that the datanucleus jars were generated within lib_managed? I remembered having to do that when I was working testing out different configs. On Thu, Sep 11, 2014 at 10:50 AM, alexandria1101 alexandria.shea...@gmail.com wrote: Even when I comment out those 3 lines, I still get the same error. Did someone solve this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-JDBC-tp11369p13992.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL Thrift JDBC server deployment for production
Could you provide some context about running this in yarn-cluster mode? The Thrift server that's included within Spark 1.1 is based on Hive 0.12. Hive has been able to work against YARN since Hive 0.10. So when you start the thrift server, provided you copied the hive-site.xml over to the Spark conf folder, it should be able to connect to the same Hive metastore and then execute Hive against your YARN cluster. On Wed, Sep 10, 2014 at 11:55 PM, vasiliy zadonsk...@gmail.com wrote: Hi, i have a question about spark sql Thrift JDBC server. Is there a best practice for spark SQL deployement ? If i understand right script ./sbin/start-thriftserver.sh starts Thrift JDBC server in local mode. Is there an script options for running this server on yarn-cluster mode ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Announcing Spark 1.1.0!
Thanks for all the good work. Very excited about seeing more features and better stability in the framework. On Thu, Sep 11, 2014 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote: I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 171 developers! This release brings operational and performance improvements in Spark core including a new implementation of the Spark shuffle designed for very large scale workloads. Spark 1.1 adds significant extensions to the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a JDBC server, byte code generation for fast expression evaluation, a public types API, JSON support, and other features and optimizations. MLlib introduces a new statistics library along with several new algorithms and optimizations. Spark 1.1 also builds out Spark's Python support and adds new components to the Spark Streaming module. Visit the release notes [1] to read about the new features, or download [2] the release today. [1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html [2] http://spark.eu.apache.org/downloads.html NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS. Please e-mail me directly for any type-o's in the release notes or name listing. Thanks, and congratulations! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Announcing Spark 1.1.0!
Thanks to everyone who contributed to implementing and testing this release! Matei On September 11, 2014 at 11:52:43 PM, Tim Smith (secs...@gmail.com) wrote: Thanks for all the good work. Very excited about seeing more features and better stability in the framework. On Thu, Sep 11, 2014 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote: I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 171 developers! This release brings operational and performance improvements in Spark core including a new implementation of the Spark shuffle designed for very large scale workloads. Spark 1.1 adds significant extensions to the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a JDBC server, byte code generation for fast expression evaluation, a public types API, JSON support, and other features and optimizations. MLlib introduces a new statistics library along with several new algorithms and optimizations. Spark 1.1 also builds out Spark's Python support and adds new components to the Spark Streaming module. Visit the release notes [1] to read about the new features, or download [2] the release today. [1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html [2] http://spark.eu.apache.org/downloads.html NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS. Please e-mail me directly for any type-o's in the release notes or name listing. Thanks, and congratulations! - Patrick - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: compiling spark source code
I have been doing that. All the modifications to the code are not being compiled. On Thu, Sep 11, 2014 at 10:45 PM, Daniil Osipov daniil.osi...@shazam.com wrote: In the spark source folder, execute `sbt/sbt assembly` On Thu, Sep 11, 2014 at 8:27 AM, rapelly kartheek kartheek.m...@gmail.com wrote: HI, Can someone please tell me how to compile the spark source code to effect the changes in the source code. I was trying to ship the jars to all the slaves, but in vain. -Karthik
RE: Spark SQL JDBC
I copied the 3 datanucleus jars (datanucleus-api-jdo-3.2.1.jar, datanucleus-core-3.2.2.jar, datanucleus-rdbms-3.2.1.jar) to the fold lib/ manually, and it works for me. From: Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 11:28 AM To: alexandria1101 Cc: u...@spark.incubator.apache.org Subject: Re: Spark SQL JDBC When you re-ran sbt did you clear out the packages first and ensure that the datanucleus jars were generated within lib_managed? I remembered having to do that when I was working testing out different configs. On Thu, Sep 11, 2014 at 10:50 AM, alexandria1101 alexandria.shea...@gmail.commailto:alexandria.shea...@gmail.com wrote: Even when I comment out those 3 lines, I still get the same error. Did someone solve this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-JDBC-tp11369p13992.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org
Re: Table not found: using jdbc console to query sparksql hive thriftserver
SchemaRDD has a method insertInto(table). When the table is partitioned, it would be more sensible and convenient to extend it with a list of partition key and values. From: Denny Lee denny.g@gmail.commailto:denny.g@gmail.com Date: Thursday, September 11, 2014 at 6:39 PM To: Du Li l...@yahoo-inc.commailto:l...@yahoo-inc.com Cc: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org, alexandria1101 alexandria.shea...@gmail.commailto:alexandria.shea...@gmail.com Subject: Re: Table not found: using jdbc console to query sparksql hive thriftserver It sort of depends on the definition of efficiently. From a work flow perspective I would agree but from an I/O perspective, wouldn’t there be the same multi-pass from the standpoint of the Hive context needing to push the data into HDFS? Saying this, if you’re pushing the data into HDFS and then creating Hive tables via load (vs. a reference point ala external tables), I would agree with you. And thanks for correcting me, the registerTempTable is in the SqlContext. On September 10, 2014 at 13:47:24, Du Li (l...@yahoo-inc.commailto:l...@yahoo-inc.com) wrote: Hi Denny, There is a related question by the way. I have a program that reads in a stream of RDD¹s, each of which is to be loaded into a hive table as one partition. Currently I do this by first writing the RDD¹s to HDFS and then loading them to hive, which requires multiple passes of HDFS I/O and serialization/deserialization. I wonder if it is possible to do it more efficiently with Spark 1.1 streaming + SQL, e.g., by registering the RDDs into a hive context so that the data is loaded directly into the hive table in cache and meanwhile visible to jdbc/odbc clients. In the spark source code, the method registerTempTable you mentioned works on SqlContext instead of HiveContext. Thanks, Du On 9/10/14, 1:21 PM, Denny Lee denny.g@gmail.commailto:denny.g@gmail.com wrote: Actually, when registering the table, it is only available within the sc context you are running it in. For Spark 1.1, the method name is changed to RegisterAsTempTable to better reflect that. The Thrift server process runs under a different process meaning that it cannot see any of the tables generated within the sc context. You would need to save the sc table into Hive and then the Thrift process would be able to see them. HTH! On Sep 10, 2014, at 13:08, alexandria1101 alexandria.shea...@gmail.commailto:alexandria.shea...@gmail.com wrote: I used the hiveContext to register the tables and the tables are still not being found by the thrift server. Do I have to pass the hiveContext to JDBC somehow? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Table-not-found-using -jdbc-console-to-query-sparksql-hive-thriftserver-tp13840p13922.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org