date:20140911

2014-09-11 Thread Erik van oosten


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Multi-tenancy for Spark (Streaming) Applications

2014-09-11 Thread Tobias Pfeiffer

Hi,

by now I understood maybe a bit better how spark-submit and YARN play
together and how Spark driver and slaves play together on YARN.

Now for my usecase, as described on 
https://spark.apache.org/docs/latest/submitting-applications.html, I would
probably have a end-user-facing gateway that submits my Spark (Streaming)
application to the YARN cluster in yarn-cluster mode.

I have a couple of questions regarding that setup:
* That gateway does not need to be written in Scala or Java, it actually
has no contact with the Spark libraries; it is just executing a program on
the command line (./spark-submit ...), right?
* Since my application is a streaming application, it won't finish by
itself. What is the best way to terminate the application on the cluster
from my gateway program? Can I just send SIGTERM to the spark-submit
program? Is it recommended?
* I guess there are many possibilities to achieve that, but what is a good
way to send commands/instructions to the running Spark application? If I
want to push some commands from the gateway to the Spark driver, I guess I
need to get its IP address - how? If I want the Spark driver to pull its
instructions, what is a good way to do so? Any suggestions?

Thanks,
Tobias

Re: can fileStream() or textFileStream() remember state?

2014-09-11 Thread vasiliy

When you get a stream from sc.fileStream() spark will process only files with
file timestamp  then current timestamp so all data from HDFS should not be
processed  again. You may have a another problem - spark will not process
files that moved to your HDFS folder between your application restarts. To
avoid this you should use the checkpoints as discribed here :
https://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-the-driver-node


akso wrote
 When streaming from HDFS through eihter sc.fileStream() or
 sc.textFileStream(), how can state info be saved so that it won't process
 duplicate data.
 When app is stop and restart, all data from HDFS is processed again.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/can-fileStream-or-textFileStream-remember-state-tp9105p13950.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SchemaRDD saveToCassandra

2014-09-11 Thread lmk

Hi,
My requirement is to extract certain fields from json files, run queries on
them and save the result to cassandra.
I was able to parse json , filter the result and save the rdd(regular) to
cassandra.

Now, when I try to read the json file through sqlContext , execute some
queries on the same and then save the SchemaRDD to cassandra using
saveToCassandra function, I am getting the following error:

java.lang.NoSuchMethodException: Cannot resolve any suitable constructor for
class org.apache.spark.sql.catalyst.expressions.Row

Pls let me know if a spark SchemaRDD can be directly saved to cassandra just
like the regular rdd?

If that is not possible, is there any way to convert the schema RDD to a
regular RDD ?

Please advise.

Regards,
lmk



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-saveToCassandra-tp13951.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark not installed + no access to web UI

2014-09-11 Thread mrm

Hi,

I have been launching Spark in the same ways for the past months, but I have
only recently started to have problems with it. I launch Spark using
spark-ec2 script, but then I cannot access the web UI when I type
address:8080 into the browser (it doesn't work with lynx either from the
master node), and I cannot find pyspark in the usual spark/bin/pyspark
folder. Any hints as to what might be happening?

Thanks in advance!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-not-installed-no-access-to-web-UI-tp13952.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Unpersist

2014-09-11 Thread Deep Pradhan

I want to create a temporary variables in a spark code.
Can I do this?

for (i - num)
{
val temp = ..
   {
   do something
   }
temp.unpersist()
}

Thank You

Re: Spark not installed + no access to web UI

2014-09-11 Thread Akhil Das

Which version of spark are you having?

Thanks
Best Regards

On Thu, Sep 11, 2014 at 3:10 PM, mrm ma...@skimlinks.com wrote:

 Hi,

 I have been launching Spark in the same ways for the past months, but I
 have
 only recently started to have problems with it. I launch Spark using
 spark-ec2 script, but then I cannot access the web UI when I type
 address:8080 into the browser (it doesn't work with lynx either from the
 master node), and I cannot find pyspark in the usual spark/bin/pyspark
 folder. Any hints as to what might be happening?

 Thanks in advance!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-not-installed-no-access-to-web-UI-tp13952.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Unpersist

2014-09-11 Thread Akhil Das

like this?

var temp = ...
for (i - num)
{
 temp = ..
   {
   do something
   }
temp.unpersist()
}

Thanks
Best Regards

On Thu, Sep 11, 2014 at 3:26 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 I want to create a temporary variables in a spark code.
 Can I do this?

 for (i - num)
 {
 val temp = ..
{
do something
}
 temp.unpersist()
 }

 Thank You

Re: Spark not installed + no access to web UI

2014-09-11 Thread mrm

I tried 1.0.0, 1.0.1 and 1.0.2. I also tried the latest github commit. 

After several hours trying to launch it, now it seems to be working, this is
what I did (not sure if any of these steps helped):
1/ clone the spark repo into the master node
2/ run sbt/sbt assembly
3/ copy spark and spark-ec2 directories to my slaves
4/ launch the cluster again with --resume

Now I can finally access the web UI and spark is properly installed!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-not-installed-no-access-to-web-UI-tp13952p13957.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

JMXSink for YARN deployment

2014-09-11 Thread Vladimir Tretyakov

Hello, we are in Sematext (https://apps.sematext.com/) are writing
Monitoring tool for Spark and we came across one question:

How to enable JMX metrics for YARN deployment?

We put *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
to file $SPARK_HOME/conf/metrics.properties but it doesn't work.

Everything works in Standalone mode, but not in YARN mode.

Can somebody help?

Thx!

PS: I've found also
https://stackoverflow.com/questions/23529404/spark-on-yarn-how-to-send-metrics-to-graphite-sink/25786112
without answer.

Re: How to scale more consumer to Kafka stream

2014-09-11 Thread richiesgr

Thanks for all 
I'm going to check both solution



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-scale-more-consumer-to-Kafka-stream-tp13883p13959.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-11 Thread Aniket Bhatnagar

Hi all

I am trying to run kinesis spark streaming application on a standalone
spark cluster. The job works find in local mode but when I submit it (using
spark-submit), it doesn't do anything. I enabled logs
for org.apache.spark.streaming.kinesis package and I regularly get the
following in worker logs:

14/09/11 11:41:25 DEBUG KinesisRecordProcessor: Stored:  Worker
x.x.x.x:b88a9210-cbb9-4c31-8da7-35fd92faba09 stored 34 records for shardId
shardId-
14/09/11 11:41:25 DEBUG KinesisRecordProcessor: Stored:  Worker
x.x.x.x:b2e9c20f-470a-44fe-a036-630c776919fb stored 31 records for shardId
shardId-0001

But the job does not perform any operations defined on DStream. To
investigate this further, I changed the kinesis-asl's KinesisUtils to
perform the following computation on the DStream created
using ssc.receiverStream(new KinesisReceiver...):

stream.count().foreachRDD(rdd = rdd.foreach(tuple = logInfo(Emitted  +
tuple)))

Even the above line does not results in any corresponding log entries both
in driver and worker logs. The only relevant logs that I could find in
driver logs are:
14/09/11 11:40:58 INFO DAGScheduler: Stage 2 (foreach at
KinesisUtils.scala:68) finished in 0.398 s
14/09/11 11:40:58 INFO SparkContext: Job finished: foreach at
KinesisUtils.scala:68, took 4.926449985 s
14/09/11 11:40:58 INFO JobScheduler: Finished job streaming job
1410435653000 ms.0 from job set of time 1410435653000 ms
14/09/11 11:40:58 INFO JobScheduler: Starting job streaming job
1410435653000 ms.1 from job set of time 1410435653000 ms
14/09/11 11:40:58 INFO SparkContext: Starting job: foreach at
KinesisUtils.scala:68
14/09/11 11:40:58 INFO DAGScheduler: Registering RDD 13 (union at
DStream.scala:489)
14/09/11 11:40:58 INFO DAGScheduler: Got job 3 (foreach at
KinesisUtils.scala:68) with 2 output partitions (allowLocal=false)
14/09/11 11:40:58 INFO DAGScheduler: Final stage: Stage 5(foreach at
KinesisUtils.scala:68)

After the above logs, nothing shows up corresponding to KinesisUtils. I am
out of ideas on this one and any help on this would greatly appreciated.

Thanks,
Aniket

problem in using Spark-Cassandra connector

2014-09-11 Thread Karunya Padala


Hi,

I am new to spark.  I  encountered an issue when trying to connect to Cassandra 
using Spark Cassandra connector. Can anyone help me. Following are the details.

1) Following Spark and Cassandra versions I am using on LUbuntu12.0.
i)spark-1.0.2-bin-hadoop2
ii) apache-cassandra-2.0.10

2) In the Cassandra, i created a key space, table and inserted some data.

3)Following libs are specified when starting the spark-shell.
antworks@INHN1I-DW1804:$ spark-shell --jars 
/home/antworks/lib-spark-cassandra/apache-cassandra-clientutil-2.0.10.jar,/home/antworks/lib-spark-cassandra/apache-cassandra-thrift-2.0.10.jar,/home/antworks/lib-spark-cassandra/cassandra-driver-core-2.0.2.jar,/home/antworks/lib-spark-cassandra/guava-15.0.jar,/home/antworks/lib-spark-cassandra/joda-convert-1.6.jar,/home/antworks/lib-spark-cassandra/joda-time-2.3.jar,/home/antworks/lib-spark-cassandra/libthrift-0.9.1.jar,/home/antworks/lib-spark-cassandra/spark-cassandra-connector_2.10-1.0.0-rc3.jar

4) when running the stmt  val rdd = sc.cassandraTable(EmailKeySpace, 
Emails)encountered following issue.

My application connecting to Cassandra and immediately disconnecting and 
throwing java.io.IOException: Table not found: EmailKeySpace.Emails
Here is the stack trace.

scala import com.datastax.spark.connector._
import com.datastax.spark.connector._

scala val rdd = sc.cassandraTable(EmailKeySpace, Emails)
14/09/11 23:06:51 WARN FrameCompressor: Cannot find LZ4 class, you should make 
sure the LZ4 library is in the classpath if you intend to use it. LZ4 
compression will not be available for the protocol.
14/09/11 23:06:51 INFO Cluster: New Cassandra host /172.23.1.68:9042 added
14/09/11 23:06:51 INFO CassandraConnector: Connected to Cassandra cluster: 
AWCluster
14/09/11 23:06:52 INFO CassandraConnector: Disconnected from Cassandra cluster: 
AWCluster
java.io.IOException: Table not found: EmailKeySpace.Emails
at 
com.datastax.spark.connector.rdd.CassandraRDD.tableDef$lzycompute(CassandraRDD.scala:208)
at 
com.datastax.spark.connector.rdd.CassandraRDD.tableDef(CassandraRDD.scala:205)
at 
com.datastax.spark.connector.rdd.CassandraRDD.init(CassandraRDD.scala:212)
at 
com.datastax.spark.connector.SparkContextFunctions.cassandraTable(SparkContextFunctions.scala:48)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:15)
at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:20)
at $iwC$$iwC$$iwC$$iwC.init(console:22)
at $iwC$$iwC$$iwC.init(console:24)
at $iwC$$iwC.init(console:26)
at $iwC.init(console:28)
at init(console:30)
at .init(console:34)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at

Re: problem in using Spark-Cassandra connector

2014-09-11 Thread Reddy Raja

You will have to create create KeySpace and Table.
See the message,
Table not found: EmailKeySpace.Emails

Looks like you have not created the Emails table.


On Thu, Sep 11, 2014 at 6:04 PM, Karunya Padala 
karunya.pad...@infotech-enterprises.com wrote:



 Hi,



 I am new to spark.  I  encountered an issue when trying to connect to
 Cassandra using Spark Cassandra connector. Can anyone help me. Following
 are the details.



 1) Following Spark and Cassandra versions I am using on LUbuntu12.0.

 i)spark-1.0.2-bin-hadoop2

 ii) apache-cassandra-2.0.10



 2) In the Cassandra, i created a key space, table and inserted some data.



 3)Following libs are specified when starting the spark-shell.

 antworks@INHN1I-DW1804:$ spark-shell --jars
 /home/antworks/lib-spark-cassandra/apache-cassandra-clientutil-2.0.10.jar,/home/antworks/lib-spark-cassandra/apache-cassandra-thrift-2.0.10.jar,/home/antworks/lib-spark-cassandra/cassandra-driver-core-2.0.2.jar,/home/antworks/lib-spark-cassandra/guava-15.0.jar,/home/antworks/lib-spark-cassandra/joda-convert-1.6.jar,/home/antworks/lib-spark-cassandra/joda-time-2.3.jar,/home/antworks/lib-spark-cassandra/libthrift-0.9.1.jar,/home/antworks/lib-spark-cassandra/spark-cassandra-connector_2.10-1.0.0-rc3.jar



 4) when running the stmt  val rdd = sc.cassandraTable(EmailKeySpace,
 Emails)encountered following issue.



 My application connecting to Cassandra and immediately disconnecting and
 throwing java.io.IOException: Table not found: EmailKeySpace.Emails

 Here is the stack trace.



 scala import com.datastax.spark.connector._

 import com.datastax.spark.connector._



 scala val rdd = sc.cassandraTable(EmailKeySpace, Emails)

 14/09/11 23:06:51 WARN FrameCompressor: Cannot find LZ4 class, you should
 make sure the LZ4 library is in the classpath if you intend to use it. LZ4
 compression will not be available for the protocol.

 14/09/11 23:06:51 INFO Cluster: New Cassandra host /172.23.1.68:9042 added

 14/09/11 23:06:51 INFO CassandraConnector: Connected to Cassandra cluster:
 AWCluster

 14/09/11 23:06:52 INFO CassandraConnector: Disconnected from Cassandra
 cluster: AWCluster

 java.io.IOException: Table not found: EmailKeySpace.Emails

 at
 com.datastax.spark.connector.rdd.CassandraRDD.tableDef$lzycompute(CassandraRDD.scala:208)

 at
 com.datastax.spark.connector.rdd.CassandraRDD.tableDef(CassandraRDD.scala:205)

 at
 com.datastax.spark.connector.rdd.CassandraRDD.init(CassandraRDD.scala:212)

 at
 com.datastax.spark.connector.SparkContextFunctions.cassandraTable(SparkContextFunctions.scala:48)

 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:15)

 at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:20)

 at $iwC$$iwC$$iwC$$iwC.init(console:22)

 at $iwC$$iwC$$iwC.init(console:24)

 at $iwC$$iwC.init(console:26)

 at $iwC.init(console:28)

 at init(console:30)

 at .init(console:34)

 at .clinit(console)

 at .init(console:7)

 at .clinit(console)

 at $print(console)

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:606)

 at
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)

 at
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)

 at
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)

 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)

 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)

 at
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)

 at
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)

 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)

 at
 org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)

 at
 org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)

 at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)

 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)

 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)

 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)

 at
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)

 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)

 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)

 at org.apache.spark.repl.Main$.main(Main.scala:31)

 at org.apache.spark.repl.Main.main(Main.scala)

 at

RE: problem in using Spark-Cassandra connector

2014-09-11 Thread Karunya Padala

I have created key space called EmailKeySpace’and table called Emails and 
inserted some data in the Cassandra. See my Cassandra console screen shot.


[cid:image001.png@01CFCDEB.8FB55CB0]


Regards,
Karunya.

From: Reddy Raja [mailto:areddyr...@gmail.com]
Sent: 11 September 2014 18:07
To: Karunya Padala
Cc: u...@spark.incubator.apache.org
Subject: Re: problem in using Spark-Cassandra connector

You will have to create create KeySpace and Table.
See the message,
Table not found: EmailKeySpace.Emails

Looks like you have not created the Emails table.


On Thu, Sep 11, 2014 at 6:04 PM, Karunya Padala 
karunya.pad...@infotech-enterprises.commailto:karunya.pad...@infotech-enterprises.com
 wrote:

Hi,

I am new to spark.  I  encountered an issue when trying to connect to Cassandra 
using Spark Cassandra connector. Can anyone help me. Following are the details.

1) Following Spark and Cassandra versions I am using on LUbuntu12.0.
i)spark-1.0.2-bin-hadoop2
ii) apache-cassandra-2.0.10

2) In the Cassandra, i created a key space, table and inserted some data.

3)Following libs are specified when starting the spark-shell.
antworks@INHN1I-DW1804:$ spark-shell --jars 
/home/antworks/lib-spark-cassandra/apache-cassandra-clientutil-2.0.10.jar,/home/antworks/lib-spark-cassandra/apache-cassandra-thrift-2.0.10.jar,/home/antworks/lib-spark-cassandra/cassandra-driver-core-2.0.2.jar,/home/antworks/lib-spark-cassandra/guava-15.0.jar,/home/antworks/lib-spark-cassandra/joda-convert-1.6.jar,/home/antworks/lib-spark-cassandra/joda-time-2.3.jar,/home/antworks/lib-spark-cassandra/libthrift-0.9.1.jar,/home/antworks/lib-spark-cassandra/spark-cassandra-connector_2.10-1.0.0-rc3.jar

4) when running the stmt  val rdd = sc.cassandraTable(EmailKeySpace, 
Emails)encountered following issue.

My application connecting to Cassandra and immediately disconnecting and 
throwing java.io.IOException: Table not found: EmailKeySpace.Emails
Here is the stack trace.

scala import com.datastax.spark.connector._
import com.datastax.spark.connector._

scala val rdd = sc.cassandraTable(EmailKeySpace, Emails)
14/09/11 23:06:51 WARN FrameCompressor: Cannot find LZ4 class, you should make 
sure the LZ4 library is in the classpath if you intend to use it. LZ4 
compression will not be available for the protocol.
14/09/11 23:06:51 INFO Cluster: New Cassandra host 
/172.23.1.68:9042http://172.23.1.68:9042 added
14/09/11 23:06:51 INFO CassandraConnector: Connected to Cassandra cluster: 
AWCluster
14/09/11 23:06:52 INFO CassandraConnector: Disconnected from Cassandra cluster: 
AWCluster
java.io.IOException: Table not found: EmailKeySpace.Emails
at 
com.datastax.spark.connector.rdd.CassandraRDD.tableDef$lzycompute(CassandraRDD.scala:208)
at 
com.datastax.spark.connector.rdd.CassandraRDD.tableDef(CassandraRDD.scala:205)
at 
com.datastax.spark.connector.rdd.CassandraRDD.init(CassandraRDD.scala:212)
at 
com.datastax.spark.connector.SparkContextFunctions.cassandraTable(SparkContextFunctions.scala:48)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.init(console:15)
at $iwC$$iwC$$iwC$$iwC$$iwC.init(console:20)
at $iwC$$iwC$$iwC$$iwC.init(console:22)
at $iwC$$iwC$$iwC.init(console:24)
at $iwC$$iwC.init(console:26)
at $iwC.init(console:28)
at init(console:30)
at .init(console:34)
at .clinit(console)
at .init(console:7)
at .clinit(console)
at $print(console)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
at 
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
at 
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at 
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
at 
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at 
org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at

Spark on Raspberry Pi?

2014-09-11 Thread Sandeep Singh

Has anyone tried using Raspberry Pi for Spark? How efficient is it to use
around 10 Pi's for local testing env ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Raspberry-Pi-tp13965.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to scale more consumer to Kafka stream

2014-09-11 Thread Dibyendu Bhattacharya

I agree Gerard. Thanks for pointing this..

Dib

On Thu, Sep 11, 2014 at 5:28 PM, Gerard Maas gerard.m...@gmail.com wrote:

 This pattern works.

 One note, thought: Use 'union' only if you need to group the data from all
 RDDs into one RDD for processing (like count distinct or need a groupby).
 If your process can be parallelized over every stream of incoming data, I
 suggest you just apply the required transformations on every dstream and
 avoid 'union' altogether.

 -kr, Gerard.



 On Wed, Sep 10, 2014 at 8:17 PM, Tim Smith secs...@gmail.com wrote:

 How are you creating your kafka streams in Spark?

 If you have 10 partitions for a topic, you can call createStream ten
 times to create 10 parallel receivers/executors and then use union to
 combine all the dStreams.



 On Wed, Sep 10, 2014 at 7:16 AM, richiesgr richie...@gmail.com wrote:

 Hi (my previous post as been used by someone else)

 I'm building a application the read from kafka stream event. In
 production
 we've 5 consumers that share 10 partitions.
 But on spark streaming kafka only 1 worker act as a consumer then
 distribute
 the tasks to workers so I can have only 1 machine acting as consumer but
 I
 need more because only 1 consumer means Lags.

 Do you've any idea what I can do ? Another point is interresting the
 master
 is not loaded at all I can get up more than 10 % CPU

 I've tried to increase the queued.max.message.chunks on the kafka client
 to
 read more records thinking it'll speed up the read but I only get

 ERROR consumer.ConsumerFetcherThread:

 [ConsumerFetcherThread-SparkEC2_ip-10-138-59-194.ec2.internal-1410182950783-5c49c8e8-0-174167372],
 Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 73;
 ClientId:

 SparkEC2-ConsumerFetcherThread-SparkEC2_ip-10-138-59-194.ec2.internal-1410182950783-5c49c8e8-0-174167372;
 ReplicaId: -1; MaxWait: 100 ms; MinBytes: 1 bytes; RequestInfo: [IA2,7]
 -
 PartitionFetchInfo(929838589,1048576),[IA2,6] -
 PartitionFetchInfo(929515796,1048576),[IA2,9] -
 PartitionFetchInfo(929577946,1048576),[IA2,8] -
 PartitionFetchInfo(930751599,1048576),[IA2,2] -
 PartitionFetchInfo(926457704,1048576),[IA2,5] -
 PartitionFetchInfo(930774385,1048576),[IA2,0] -
 PartitionFetchInfo(929913213,1048576),[IA2,3] -
 PartitionFetchInfo(929268891,1048576),[IA2,4] -
 PartitionFetchInfo(929949877,1048576),[IA2,1] -
 PartitionFetchInfo(930063114,1048576)
 java.lang.OutOfMemoryError: Java heap space

 Is someone have ideas ?
 Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-scale-more-consumer-to-Kafka-stream-tp13883.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Fwd: Spark on Raspberry Pi?

2014-09-11 Thread Chen He

Pi's bus speed, memory size and access speed, and processing ability are
limited. The only benefit could be the power consumption.

On Thu, Sep 11, 2014 at 8:04 AM, Sandeep Singh sand...@techaddict.me
wrote:

 Has anyone tried using Raspberry Pi for Spark? How efficient is it to use
 around 10 Pi's for local testing env ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Raspberry-Pi-tp13965.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

RE: JMXSink for YARN deployment

2014-09-11 Thread Shao, Saisai

Hi，

I’m guessing the problem is that driver or executor cannot get the 
metrics.properties configuration file in the yarn container, so metrics system 
cannot load the right sinks.

Thanks
Jerry

From: Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com]
Sent: Thursday, September 11, 2014 7:30 PM
To: user@spark.apache.org
Subject: JMXSink for YARN deployment

Hello, we are in Sematext (https://apps.sematext.com/) are writing Monitoring 
tool for Spark and we came across one question:

How to enable JMX metrics for YARN deployment?

We put *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
to file $SPARK_HOME/conf/metrics.properties but it doesn't work.

Everything works in Standalone mode, but not in YARN mode.

Can somebody help?

Thx!

PS: I've found also 
https://stackoverflow.com/questions/23529404/spark-on-yarn-how-to-send-metrics-to-graphite-sink/25786112
 without answer.

unable to create new native thread

2014-09-11 Thread arthur.hk.c...@gmail.com

Hi 

I am trying the Spark sample program “SparkPi”, I got an error unable to 
create new native thread, how to resolve this?

14/09/11 21:36:16 INFO scheduler.DAGScheduler: Completed ResultTask(0, 644)
14/09/11 21:36:16 INFO scheduler.TaskSetManager: Finished TID 643 in 43 ms on 
node1 (progress: 636/10)
14/09/11 21:36:16 INFO scheduler.DAGScheduler: Completed ResultTask(0, 643)
14/09/11 21:36:16 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread 
[spark-akka.actor.default-dispatcher-16] shutting down ActorSystem [spark]
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at 
scala.concurrent.forkjoin.ForkJoinPool.tryAddWorker(ForkJoinPool.java:1672)
at 
scala.concurrent.forkjoin.ForkJoinPool.signalWork(ForkJoinPool.java:1966)
at 
scala.concurrent.forkjoin.ForkJoinPool.externalPush(ForkJoinPool.java:1829)
at 
scala.concurrent.forkjoin.ForkJoinPool.execute(ForkJoinPool.java:2955)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool.execute(AbstractDispatcher.scala:374)
at 
akka.dispatch.ExecutorServiceDelegate$class.execute(ThreadPoolBuilder.scala:212)
at 
akka.dispatch.Dispatcher$LazyExecutorServiceDelegate.execute(Dispatcher.scala:43)
at akka.dispatch.Dispatcher.registerForExecution(Dispatcher.scala:118)
at akka.dispatch.Dispatcher.dispatch(Dispatcher.scala:59)
at akka.actor.dungeon.Dispatch$class.sendMessage(Dispatch.scala:120)
at akka.actor.ActorCell.sendMessage(ActorCell.scala:338)
at akka.actor.Cell$class.sendMessage(ActorCell.scala:259)
at akka.actor.ActorCell.sendMessage(ActorCell.scala:338)
at akka.actor.LocalActorRef.$bang(ActorRef.scala:389)
at akka.actor.ActorRef.tell(ActorRef.scala:125)
at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:489)
at akka.actor.ActorCell.invoke(ActorCell.scala:455)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Regards
Arthur

Re: JMXSink for YARN deployment

2014-09-11 Thread Vladimir Tretyakov

Hi Shao, thx for explanation, any ideas how to fix it? Where should I put
metrics.properties file?

On Thu, Sep 11, 2014 at 4:18 PM, Shao, Saisai saisai.s...@intel.com wrote:

  Hi，



 I’m guessing the problem is that driver or executor cannot get the
 metrics.properties configuration file in the yarn container, so metrics
 system cannot load the right sinks.



 Thanks

 Jerry



 *From:* Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com]
 *Sent:* Thursday, September 11, 2014 7:30 PM
 *To:* user@spark.apache.org
 *Subject:* JMXSink for YARN deployment



 Hello, we are in Sematext (https://apps.sematext.com/) are writing
 Monitoring tool for Spark and we came across one question:



 How to enable JMX metrics for YARN deployment?



 We put *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink

 to file $SPARK_HOME/conf/metrics.properties but it doesn't work.



 Everything works in Standalone mode, but not in YARN mode.



 Can somebody help?



 Thx!



 PS: I've found also
 https://stackoverflow.com/questions/23529404/spark-on-yarn-how-to-send-metrics-to-graphite-sink/25786112
 without answer.

RE: JMXSink for YARN deployment

2014-09-11 Thread Shao, Saisai

I think you can try to use ” spark.metrics.conf” to manually specify the path 
of metrics.properties, but the prerequisite is that each container should find 
this file in their local FS because this file is loaded locally.

Besides I think this might be a kind of workaround, a better solution is to fix 
this by some other solutions.

Thanks
Jerry

From: Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com]
Sent: Thursday, September 11, 2014 10:08 PM
Cc: user@spark.apache.org
Subject: Re: JMXSink for YARN deployment

Hi Shao, thx for explanation, any ideas how to fix it? Where should I put 
metrics.properties file?

On Thu, Sep 11, 2014 at 4:18 PM, Shao, Saisai 
saisai.s...@intel.commailto:saisai.s...@intel.com wrote:
Hi，

I’m guessing the problem is that driver or executor cannot get the 
metrics.properties configuration file in the yarn container, so metrics system 
cannot load the right sinks.

Thanks
Jerry

From: Vladimir Tretyakov 
[mailto:vladimir.tretya...@sematext.commailto:vladimir.tretya...@sematext.com]
Sent: Thursday, September 11, 2014 7:30 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: JMXSink for YARN deployment

Hello, we are in Sematext (https://apps.sematext.com/) are writing Monitoring 
tool for Spark and we came across one question:

How to enable JMX metrics for YARN deployment?

We put *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
to file $SPARK_HOME/conf/metrics.properties but it doesn't work.

Everything works in Standalone mode, but not in YARN mode.

Can somebody help?

Thx!

PS: I've found also 
https://stackoverflow.com/questions/23529404/spark-on-yarn-how-to-send-metrics-to-graphite-sink/25786112
 without answer.

Re: Unpersist

2014-09-11 Thread Deep Pradhan

After every loop I want the temp variable to cease to exist

On Thu, Sep 11, 2014 at 4:33 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 like this?

 var temp = ...
 for (i - num)
 {
  temp = ..
{
do something
}
 temp.unpersist()
 }

 Thanks
 Best Regards

 On Thu, Sep 11, 2014 at 3:26 PM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 I want to create a temporary variables in a spark code.
 Can I do this?

 for (i - num)
 {
 val temp = ..
{
do something
}
 temp.unpersist()
 }

 Thank You

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-11 Thread Nan Zhu

Hi,   

Can you attach more logs to see if there is some entry from ContextCleaner?

I met very similar issue before…but haven’t get resolved  

Best,  

--  
Nan Zhu


On Thursday, September 11, 2014 at 10:13 AM, Dibyendu Bhattacharya wrote:

 Dear All,  
  
 Not sure if this is a false alarm. But wanted to raise to this to understand 
 what is happening.  
  
 I am testing the Kafka Receiver which I have written 
 (https://github.com/dibbhatt/kafka-spark-consumer) which basically a low 
 level Kafka Consumer implemented custom Receivers for every Kafka topic 
 partitions and pulling data in parallel. Individual streams from all topic 
 partitions are then merged to create Union stream which used for further 
 processing.
  
 The custom Receiver working fine in normal load with no issues. But when I 
 tested this with huge amount of backlog messages from Kafka ( 50 million + 
 messages), I see couple of major issue in Spark Streaming. Wanted to get some 
 opinion on this
  
 I am using latest Spark 1.1 taken from the source and built it. Running in 
 Amazon EMR , 3 m1.xlarge Node Spark cluster running in Standalone Mode.
  
 Below are two main question I have..
  
 1. What I am seeing when I run the Spark Streaming with my Kafka Consumer 
 with a huge backlog in Kafka ( around 50 Million), Spark is completely busy 
 performing the Receiving task and hardly schedule any processing task. Can 
 you let me if this is expected ? If there is large backlog, Spark will take 
 long time pulling them . Why Spark not doing any processing ? Is it because 
 of resource limitation ( say all cores are busy puling ) or it is by design ? 
 I am setting the executor-memory to 10G and driver-memory to 4G .
  
 2. This issue seems to be more serious. I have attached the Driver trace with 
 this email. What I can see very frequently Block are selected to be 
 Removed...This kind of entries are all over the place. But when a Block is 
 removed , below problem happen May be this issue cause the issue 1 that 
 no Jobs are getting processed ..
  
  
 INFO : org.apache.spark.storage.MemoryStore - 1 blocks selected for dropping
 INFO : org.apache.spark.storage.BlockManager - Dropping block 
 input-0-1410443074600 from memory
 INFO : org.apache.spark.storage.MemoryStore - Block input-0-1410443074600 of 
 size 12651900 dropped from memory (free 21220667)
 INFO : org.apache.spark.storage.BlockManagerInfo - Removed 
 input-0-1410443074600 on ip-10-252-5-113.asskickery.us:53752 
 (http://ip-10-252-5-113.asskickery.us:53752) in memory (size: 12.1 MB, free: 
 100.6 MB)
  
 ...
  
 INFO : org.apache.spark.storage.BlockManagerInfo - Removed 
 input-0-1410443074600 on ip-10-252-5-62.asskickery.us:37033 
 (http://ip-10-252-5-62.asskickery.us:37033) in memory (size: 12.1 MB, free: 
 154.6 MB)
 ..
  
  
 WARN : org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 7.0 
 (TID 118, ip-10-252-5-62.asskickery.us 
 (http://ip-10-252-5-62.asskickery.us)): java.lang.Exception: Could not 
 compute split, block input-0-1410443074600 not found
  
 ...
  
 INFO : org.apache.spark.scheduler.TaskSetManager - Lost task 0.1 in stage 7.0 
 (TID 126) on executor ip-10-252-5-62.asskickery.us 
 (http://ip-10-252-5-62.asskickery.us): java.lang.Exception (Could not compute 
 split, block input-0-1410443074600 not found) [duplicate 1]
  
  
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 
 (TID 139, ip-10-252-5-62.asskickery.us 
 (http://ip-10-252-5-62.asskickery.us)): java.lang.Exception: Could not 
 compute split, block input-0-1410443074600 not found
 org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:744)
  
  
 Regards,  
 Dibyendu
  
  
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
  
  
  
  
 Attachments:  
 - driver-trace.txt

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-11 Thread Nan Zhu

This is my case about broadcast variable:  

14/07/21 19:49:13 INFO Executor: Running task ID 4 14/07/21 19:49:13 INFO 
DAGScheduler: Completed ResultTask(0, 2) 14/07/21 19:49:13 INFO TaskSetManager: 
Finished TID 2 in 95 ms on localhost (progress: 3/106) 14/07/21 19:49:13 INFO 
TableOutputFormat: Created table instance for hdfstest_customers 14/07/21 
19:49:13 INFO Executor: Serialized size of result for 3 is 596 14/07/21 
19:49:13 INFO Executor: Sending result for 3 directly to driver 14/07/21 
19:49:13 INFO BlockManager: Found block broadcast_0 locally 14/07/21 19:49:13 
INFO Executor: Finished task ID 3 14/07/21 19:49:13 INFO TaskSetManager: 
Starting task 0.0:5 as TID 5 on executor localhost: localhost (PROCESS_LOCAL) 
14/07/21 19:49:13 INFO TaskSetManager: Serialized task 0.0:5 as 11885 bytes in 
0 ms 14/07/21 19:49:13 INFO Executor: Running task ID 5 14/07/21 19:49:13 INFO 
BlockManager: Removing broadcast 0 14/07/21 19:49:13 INFO DAGScheduler: 
Completed ResultTask(0, 3) 14/07/21 19:49:13 INFO ContextCleaner: Cleaned 
broadcast 0 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 3 in 97 ms on 
localhost (progress: 4/106) 14/07/21 19:49:13 INFO BlockManager: Found block 
broadcast_0 locally 14/07/21 19:49:13 INFO BlockManager: Removing block 
broadcast_0 14/07/21 19:49:13 INFO MemoryStore: Block broadcast_0 of size 
202564 dropped from memory (free 886623436) 14/07/21 19:49:13 INFO 
ContextCleaner: Cleaned shuffle 0 14/07/21 19:49:13 INFO ShuffleBlockManager: 
Deleted all files for shuffle 0 14/07/21 19:49:13 INFO HadoopRDD: Input split: 
hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:25+5 14/07/21 19:49:13 
INFO HadoopRDD: Input split: 
hdfs://172.31.34.184:9000/etltest/hdfsData/customer.csv:20+5 14/07/21 19:49:13 
INFO TableOutputFormat: Created table instance for hdfstest_customers 14/07/21 
19:49:13 INFO Executor: Serialized size of result for 4 is 596 14/07/21 
19:49:13 INFO Executor: Sending result for 4 directly to driver 14/07/21 
19:49:13 INFO Executor: Finished task ID 4 14/07/21 19:49:13 INFO 
TaskSetManager: Starting task 0.0:6 as TID 6 on executor localhost: localhost 
(PROCESS_LOCAL) 14/07/21 19:49:13 INFO TaskSetManager: Serialized task 0.0:6 as 
11885 bytes in 0 ms 14/07/21 19:49:13 INFO Executor: Running task ID 6 14/07/21 
19:49:13 INFO DAGScheduler: Completed ResultTask(0, 4) 14/07/21 19:49:13 INFO 
TaskSetManager: Finished TID 4 in 80 ms on localhost (progress: 5/106) 14/07/21 
19:49:13 INFO TableOutputFormat: Created table instance for hdfstest_customers 
14/07/21 19:49:13 INFO Executor: Serialized size of result for 5 is 596 
14/07/21 19:49:13 INFO Executor: Sending result for 5 directly to driver 
14/07/21 19:49:13 INFO Executor: Finished task ID 5 14/07/21 19:49:13 INFO 
TaskSetManager: Starting task 0.0:7 as TID 7 on executor localhost: localhost 
(PROCESS_LOCAL) 14/07/21 19:49:13 INFO TaskSetManager: Serialized task 0.0:7 as 
11885 bytes in 0 ms 14/07/21 19:49:13 INFO Executor: Running task ID 7 14/07/21 
19:49:13 INFO DAGScheduler: Completed ResultTask(0, 5) 14/07/21 19:49:13 INFO 
TaskSetManager: Finished TID 5 in 77 ms on localhost (progress: 6/106) 14/07/21 
19:49:13 INFO HttpBroadcast: Started reading broadcast variable 0 14/07/21 
19:49:13 INFO HttpBroadcast: Started reading broadcast variable 0 14/07/21 
19:49:13 ERROR Executor: Exception in task ID 6 java.io.FileNotFoundException: 
http://172.31.34.174:52070/broadcast_0 at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream 
(http://www.protocol.http.HttpURLConnection.getInputStream)(HttpURLConnection.java:1624)
 at org.apache.spark.broadcast.HttpBroadcast$.read(HttpBroadcast.scala:196) at 
org.apache.spark.broadcast.HttpBroadcast.readObject(HttpBroadcast.scala:89) at 
sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source) at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606) at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at 
scala.collection.immutable.$colon$colon.readObject(List.scala:362) at

Re[2]: HBase 0.96+ with Spark 1.0+

2014-09-11 Thread spark

Hi guys,

any luck with this issue, anyone?

I aswell tried all the possible exclusion combos to a no avail.

thanks for your ideas
reinis

-Original-Nachricht- 
 Von: Stephen Boesch java...@gmail.com 
 An: user user@spark.apache.org 
 Datum: 28-06-2014 15:12 
 Betreff: Re: HBase 0.96+ with Spark 1.0+ 
 
 Hi Siyuan,
 Thanks for the input. We are preferring to use the SparkBuild.scala instead of 
maven.  I did not see any protobuf.version  related settings in that file. But 
- as noted by Sean Owen - in any case the issue we are facing presently is 
about the duplicate incompatible javax.servlet entries - apparently from the 
org.mortbay artifacts.
 
 
 
 2014-06-28 6:01 GMT-07:00 Siyuan he hsy...@gmail.com:
 Hi Stephen,
 
I am using spark1.0+ HBase0.96.2. This is what I did:
1) rebuild spark using: mvn -Dhadoop.version=2.3.0 -Dprotobuf.version=2.5.0 
-DskipTests clean package
2) In spark-env.sh, set SPARK_CLASSPATH = 
/path-to/hbase-protocol-0.96.2-hadoop2.jar 

 
Hopefully it can help.
Siyuan
 
 
 
 On Sat, Jun 28, 2014 at 8:52 AM, Stephen Boesch java...@gmail.com wrote:
  
 
Thanks Sean.  I had actually already added exclusion rule for org.mortbay.jetty 
- and that had not resolved it.
 
Just in case I used your precise formulation:

 
val excludeMortbayJetty = ExclusionRule(organization = org.mortbay.jetty)
..
 
  ,(org.apache.spark % spark-core_2.10 % sparkVersion  
withSources()).excludeAll(excludeMortbayJetty)
  ,(org.apache.spark % spark-sql_2.10 % sparkVersion  
withSources()).excludeAll(excludeMortbayJetty)

 
However the same error still recurs:

 
14/06/28 05:48:35 INFO HttpServer: Starting HTTP Server
[error] (run-main-0) java.lang.SecurityException: class 
javax.servlet.FilterRegistration's signer information does not match signer 
information of other classes in the same package
java.lang.SecurityException: class javax.servlet.FilterRegistration's signer 
information does not match signer information of other classes in the same 
package
 
 

 

 

 
 2014-06-28 4:22 GMT-07:00 Sean Owen so...@cloudera.com:

 This sounds like an instance of roughly the same item as in
 https://issues.apache.org/jira/browse/SPARK-1949  Have a look at
 adding that exclude to see if it works.
 

 On Fri, Jun 27, 2014 at 10:21 PM, Stephen Boesch java...@gmail.com wrote:
  The present trunk is built and tested against HBase 0.94.
 
 
  I have tried various combinations of versions of HBase 0.96+ and Spark 1.0+
  and all end up with
 
  14/06/27 20:11:15 INFO HttpServer: Starting HTTP Server
  [error] (run-main-0) java.lang.SecurityException: class
  javax.servlet.FilterRegistration's signer information does not match
  signer information of other classes in the same package
  java.lang.SecurityException: class javax.servlet.FilterRegistration's
  signer information does not match signer information of other classes in the
  same package
  at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)
 
 
  I have tried a number of different ways to exclude javax.servlet related
  jars. But none have avoided this error.
 
  Anyone have a (small-ish) build.sbt that works with later versions of HBase?
 
 
  
 
 
  
 
 
  
 
  




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: JMXSink for YARN deployment

2014-09-11 Thread Vladimir Tretyakov

Hi again, yeah , I've tried to use ” spark.metrics.conf” before my question
in ML, had no  luck:(
Any other ideas from somebody?
Seems nobody use metrics in YARN deployment mode.
How about Mesos? I didn't try but maybe Spark has the same difficulties on
Mesos?

PS: Spark is great thing in general, will be nice to see metrics in
YARN/Mesos mode, not only in Standalone:)


On Thu, Sep 11, 2014 at 5:25 PM, Shao, Saisai saisai.s...@intel.com wrote:

  I think you can try to use ” spark.metrics.conf” to manually specify the
 path of metrics.properties, but the prerequisite is that each container
 should find this file in their local FS because this file is loaded locally.



 Besides I think this might be a kind of workaround, a better solution is
 to fix this by some other solutions.



 Thanks

 Jerry



 *From:* Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com]
 *Sent:* Thursday, September 11, 2014 10:08 PM
 *Cc:* user@spark.apache.org
 *Subject:* Re: JMXSink for YARN deployment



 Hi Shao, thx for explanation, any ideas how to fix it? Where should I put
 metrics.properties file?



 On Thu, Sep 11, 2014 at 4:18 PM, Shao, Saisai saisai.s...@intel.com
 wrote:

 Hi，



 I’m guessing the problem is that driver or executor cannot get the
 metrics.properties configuration file in the yarn container, so metrics
 system cannot load the right sinks.



 Thanks

 Jerry



 *From:* Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com]
 *Sent:* Thursday, September 11, 2014 7:30 PM
 *To:* user@spark.apache.org
 *Subject:* JMXSink for YARN deployment



 Hello, we are in Sematext (https://apps.sematext.com/) are writing
 Monitoring tool for Spark and we came across one question:



 How to enable JMX metrics for YARN deployment?



 We put *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink

 to file $SPARK_HOME/conf/metrics.properties but it doesn't work.



 Everything works in Standalone mode, but not in YARN mode.



 Can somebody help?



 Thx!



 PS: I've found also
 https://stackoverflow.com/questions/23529404/spark-on-yarn-how-to-send-metrics-to-graphite-sink/25786112
 without answer.

Re: JMXSink for YARN deployment

2014-09-11 Thread Kousuke Saruta


Hi Vladimir

How about use --files option with spark-submit?

- Kousuke

(2014/09/11 23:43), Vladimir Tretyakov wrote:
Hi again, yeah , I've tried to use ” spark.metrics.conf” before my 
question in ML, had no  luck:(

Any other ideas from somebody?
Seems nobody use metrics in YARN deployment mode.
How about Mesos? I didn't try but maybe Spark has the same 
difficulties on Mesos?


PS: Spark is great thing in general, will be nice to see metrics in 
YARN/Mesos mode, not only in Standalone:)



On Thu, Sep 11, 2014 at 5:25 PM, Shao, Saisai saisai.s...@intel.com 
mailto:saisai.s...@intel.com wrote:


I think you can try to use ”spark.metrics.conf” to manually
specify the path of metrics.properties, but the prerequisite is
that each container should find this file in their local FS
because this file is loaded locally.

Besides I think this might be a kind of workaround, a better
solution is to fix this by some other solutions.

Thanks

Jerry

*From:*Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com
mailto:vladimir.tretya...@sematext.com]
*Sent:* Thursday, September 11, 2014 10:08 PM
*Cc:* user@spark.apache.org mailto:user@spark.apache.org
*Subject:* Re: JMXSink for YARN deployment

Hi Shao, thx for explanation, any ideas how to fix it? Where
should I put metrics.properties file?

On Thu, Sep 11, 2014 at 4:18 PM, Shao, Saisai
saisai.s...@intel.com mailto:saisai.s...@intel.com wrote:

Hi，

I’m guessing the problem is that driver or executor cannot get the
metrics.properties configuration file in the yarn container, so
metrics system cannot load the right sinks.

Thanks

Jerry

*From:*Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com
mailto:vladimir.tretya...@sematext.com]
*Sent:* Thursday, September 11, 2014 7:30 PM
*To:* user@spark.apache.org mailto:user@spark.apache.org
*Subject:* JMXSink for YARN deployment

Hello, we are in Sematext (https://apps.sematext.com/) are writing
Monitoring tool for Spark and we came across one question:

How to enable JMX metrics for YARN deployment?

We put *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink

to file $SPARK_HOME/conf/metrics.properties but it doesn't work.

Everything works in Standalone mode, but not in YARN mode.

Can somebody help?

Thx!

PS: I've found also

https://stackoverflow.com/questions/23529404/spark-on-yarn-how-to-send-metrics-to-graphite-sink/25786112
without answer.

Python execution support on clusters

2014-09-11 Thread david_allanus

Is there some doc that I missed that describes what execution engines Python
is support for with Spark? If we use spark-submit, with a yarn cluster an
error is produced saying 'Error: Cannot currently run Python driver programs
on cluster'.

Thanks in advance
David



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Python-execution-support-on-clusters-tp13977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-11 Thread alexandria1101

Thank you!! I can do this using saveAsTable with the schemaRDD, right? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Table-not-found-using-jdbc-console-to-query-sparksql-hive-thriftserver-tp13840p13979.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

compiling spark source code

2014-09-11 Thread rapelly kartheek

HI,


Can someone please tell me how to compile the spark source code to effect
the changes in the source code. I was trying to ship the jars to all the
slaves, but in vain.

-Karthik

Out of memory with Spark Streaming

2014-09-11 Thread Aniket Bhatnagar

I am running a simple Spark Streaming program that pulls in data from
Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps
data and persists to a store.

The program is running in local mode right now and runs out of memory after
a while. I am yet to investigate heap dumps but I think Spark isn't
releasing memory after processing is complete. I have even tried changing
storage level to disk only.

Help!

Thanks,
Aniket

Re: efficient zipping of lots of RDDs

2014-09-11 Thread Mohit Jaggi

filed  jira SPARK-3489  https://issues.apache.org/jira/browse/SPARK-3489

On Thu, Sep 4, 2014 at 9:36 AM, Mohit Jaggi mohitja...@gmail.com wrote:

 Folks,
 I sent an email announcing
 https://github.com/AyasdiOpenSource/df

 This dataframe is basically a map of RDDs of columns(along with DSL
 sugar), as column based operations seem to be most common. But row
 operations are not uncommon. To get rows out of columns right now I zip the
 column RDDs together. I use RDD.zip then flatten the tuples I get. I
 realize that RDD.zipPartitions might be faster. However, I believe an even
 better approach should be possible. Surely we can have a zip method that
 can combine a large variable number of RDDs? Can that be added to
 Spark-core? Or is there an alternative equally good or better approach?

 Cheers,
 Mohit.

Re: Setting up jvm in pyspark from shell

2014-09-11 Thread Davies Liu

The heap size of JVM can not been changed dynamically, so you
 need to config it before running pyspark.

If you run it in local mode, you should config spark.driver.memory
 (in 1.1 or master).

Or, you can use --driver-memory 2G (should work in 1.0+)

On Wed, Sep 10, 2014 at 10:43 PM, Mohit Singh mohit1...@gmail.com wrote:
 Hi,
   I am using pyspark shell and am trying to create an rdd from numpy matrix
 rdd = sc.parallelize(matrix)
 I am getting the following error:
 JVMDUMP039I Processing dump event systhrow, detail
 java/lang/OutOfMemoryError at 2014/09/10 22:41:44 - please wait.
 JVMDUMP032I JVM requested Heap dump using
 '/global/u2/m/msingh/heapdump.20140910.224144.29660.0005.phd' in response to
 an event
 JVMDUMP010I Heap dump written to
 /global/u2/m/msingh/heapdump.20140910.224144.29660.0005.phd
 JVMDUMP032I JVM requested Java dump using
 '/global/u2/m/msingh/javacore.20140910.224144.29660.0006.txt' in response to
 an event
 JVMDUMP010I Java dump written to
 /global/u2/m/msingh/javacore.20140910.224144.29660.0006.txt
 JVMDUMP032I JVM requested Snap dump using
 '/global/u2/m/msingh/Snap.20140910.224144.29660.0007.trc' in response to an
 event
 JVMDUMP010I Snap dump written to
 /global/u2/m/msingh/Snap.20140910.224144.29660.0007.trc
 JVMDUMP013I Processed dump event systhrow, detail
 java/lang/OutOfMemoryError.
 Exception AttributeError: 'SparkContext' object has no attribute '_jsc' in
 bound method SparkContext.__del__ of pyspark.context.SparkContext object
 at 0x11f9450 ignored
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /usr/common/usg/spark/1.0.2/python/pyspark/context.py, line 271, in
 parallelize
 jrdd = readRDDFromFile(self._jsc, tempFile.name, numSlices)
   File
 /usr/common/usg/spark/1.0.2/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py,
 line 537, in __call__
   File
 /usr/common/usg/spark/1.0.2/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py,
 line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling
 z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.
 : java.lang.OutOfMemoryError: Java heap space
 at
 org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:279)
 at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:88)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
 at java.lang.reflect.Method.invoke(Method.java:618)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
 at py4j.Gateway.invoke(Gateway.java:259)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Thread.java:804)

 I did try to setSystemProperty
 sc.setSystemProperty(spark.executor.memory, 20g)
 How do i increase jvm heap from the shell?

 --
 Mohit

 When you want success as badly as you want the air, then you will get it.
 There is no other secret of success.
 -Socrates

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: compiling spark source code

2014-09-11 Thread Daniil Osipov

In the spark source folder, execute `sbt/sbt assembly`

On Thu, Sep 11, 2014 at 8:27 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 HI,


 Can someone please tell me how to compile the spark source code to effect
 the changes in the source code. I was trying to ship the jars to all the
 slaves, but in vain.

 -Karthik

Re: JMXSink for YARN deployment

2014-09-11 Thread Vladimir Tretyakov

Hi, Kousuke,

Can you please explain a bit detailed what do you mean, I am new in Spark,
looked at https://spark.apache.org/docs/latest/submitting-applications.html
seems there is no '--files' option.

I just have to add '--files /path-to-metrics.properties' ? Undocumented
ability?

Thx for answer.



On Thu, Sep 11, 2014 at 5:55 PM, Kousuke Saruta saru...@oss.nttdata.co.jp
wrote:

  Hi Vladimir

 How about use --files option with spark-submit?

 - Kousuke


 (2014/09/11 23:43), Vladimir Tretyakov wrote:

  Hi again, yeah , I've tried to use ” spark.metrics.conf” before my
 question in ML, had no  luck:(
 Any other ideas from somebody?
 Seems nobody use metrics in YARN deployment mode.
 How about Mesos? I didn't try but maybe Spark has the same difficulties on
 Mesos?

  PS: Spark is great thing in general, will be nice to see metrics in
 YARN/Mesos mode, not only in Standalone:)


 On Thu, Sep 11, 2014 at 5:25 PM, Shao, Saisai saisai.s...@intel.com
 wrote:

  I think you can try to use ” spark.metrics.conf” to manually specify
 the path of metrics.properties, but the prerequisite is that each container
 should find this file in their local FS because this file is loaded locally.



 Besides I think this might be a kind of workaround, a better solution is
 to fix this by some other solutions.



 Thanks

 Jerry



 *From:* Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com]
 *Sent:* Thursday, September 11, 2014 10:08 PM
 *Cc:* user@spark.apache.org
 *Subject:* Re: JMXSink for YARN deployment



 Hi Shao, thx for explanation, any ideas how to fix it? Where should I put
 metrics.properties file?



 On Thu, Sep 11, 2014 at 4:18 PM, Shao, Saisai saisai.s...@intel.com
 wrote:

 Hi，



 I’m guessing the problem is that driver or executor cannot get the
 metrics.properties configuration file in the yarn container, so metrics
 system cannot load the right sinks.



 Thanks

 Jerry



 *From:* Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com]
 *Sent:* Thursday, September 11, 2014 7:30 PM
 *To:* user@spark.apache.org
 *Subject:* JMXSink for YARN deployment



 Hello, we are in Sematext (https://apps.sematext.com/) are writing
 Monitoring tool for Spark and we came across one question:



 How to enable JMX metrics for YARN deployment?



 We put *.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink

 to file $SPARK_HOME/conf/metrics.properties but it doesn't work.



 Everything works in Standalone mode, but not in YARN mode.



 Can somebody help?



 Thx!



 PS: I've found also
 https://stackoverflow.com/questions/23529404/spark-on-yarn-how-to-send-metrics-to-graphite-sink/25786112
 without answer.

Re: Spark on Raspberry Pi?

2014-09-11 Thread Daniil Osipov

Limited memory could also cause you some problems and limit usability. If
you're looking for a local testing environment, vagrant boxes may serve you
much better.

On Thu, Sep 11, 2014 at 6:18 AM, Chen He airb...@gmail.com wrote:




 Pi's bus speed, memory size and access speed, and processing ability are
 limited. The only benefit could be the power consumption.

 On Thu, Sep 11, 2014 at 8:04 AM, Sandeep Singh sand...@techaddict.me
 wrote:

 Has anyone tried using Raspberry Pi for Spark? How efficient is it to use
 around 10 Pi's for local testing env ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Raspberry-Pi-tp13965.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Out of memory with Spark Streaming

2014-09-11 Thread Aniket Bhatnagar

I did change it to be 1 gb. It still ran out of memory but a little later.

The streaming job isnt handling a lot of data. In every 2 seconds, it
doesn't get more than 50 records. Each record size is not more than 500
bytes.
 On Sep 11, 2014 10:54 PM, Bharat Venkat bvenkat.sp...@gmail.com wrote:

 You could set spark.executor.memory to something bigger than the
 default (512mb)


 On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I am running a simple Spark Streaming program that pulls in data from
 Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps
 data and persists to a store.

 The program is running in local mode right now and runs out of memory
 after a while. I am yet to investigate heap dumps but I think Spark isn't
 releasing memory after processing is complete. I have even tried changing
 storage level to disk only.

 Help!

 Thanks,
 Aniket

Re: Spark on Raspberry Pi?

2014-09-11 Thread Aniket Bhatnagar

Just curiois... What's the use case you are looking to implement?
On Sep 11, 2014 10:50 PM, Daniil Osipov daniil.osi...@shazam.com wrote:

 Limited memory could also cause you some problems and limit usability. If
 you're looking for a local testing environment, vagrant boxes may serve you
 much better.

 On Thu, Sep 11, 2014 at 6:18 AM, Chen He airb...@gmail.com wrote:




 Pi's bus speed, memory size and access speed, and processing ability are
 limited. The only benefit could be the power consumption.

 On Thu, Sep 11, 2014 at 8:04 AM, Sandeep Singh sand...@techaddict.me
 wrote:

 Has anyone tried using Raspberry Pi for Spark? How efficient is it to use
 around 10 Pi's for local testing env ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Raspberry-Pi-tp13965.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark on Raspberry Pi?

2014-09-11 Thread Chanwit Kaewkasi

We've found that Raspberry Pi is not enough for Hadoop/Spark mainly
because the memory consumption. What we've built is a cluster form
with 22 Cubieboards, each contains 1 GB RAM.

Best regards,

-chanwit

--
Chanwit Kaewkasi
linkedin.com/in/chanwit


On Thu, Sep 11, 2014 at 8:04 PM, Sandeep Singh sand...@techaddict.me wrote:
 Has anyone tried using Raspberry Pi for Spark? How efficient is it to use
 around 10 Pi's for local testing env ?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Raspberry-Pi-tp13965.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL JDBC

2014-09-11 Thread alexandria1101

Even when I comment out those 3 lines, I still get the same error.  Did
someone solve this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-JDBC-tp11369p13992.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark on yarn history server + hdfs permissions issue

2014-09-11 Thread Greg Hill

To answer my own question, in case someone else runs into this.  The spark user 
needs to be in the same group on the namenode, and hdfs caches that information 
for it seems like at least an hour.  Magically started working on its own.

Greg

From: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Date: Tuesday, September 9, 2014 2:30 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: spark on yarn history server + hdfs permissions issue

I am running Spark on Yarn with the HDP 2.1 technical preview.  I'm having 
issues getting the spark history server permissions to read the spark event 
logs from hdfs.  Both sides are configured to write/read logs from:

hdfs:///apps/spark/events

The history server is running as user spark, the jobs are running as user 
lavaqe.  Both users are in the  hdfs group on all the nodes in the cluster.

That root logs folder is globally writeable, but owned by the spark user:

drwxrwxrwx   - spark hdfs  0 2014-09-09 18:19 /apps/spark/events

All good so far.  Spark jobs create subfolders and put their event logs in 
there just fine.  The problem is that the history server, running as the spark 
user, cannot read those logs.  They're written as the user that initiates the 
job, but still in the same hdfs group:

drwxrwx---   - lavaqe hdfs  0 2014-09-09 19:24 
/apps/spark/events/spark-pi-1410290714996

The files are group readable/writable, but this is the error I get:

Permission denied: user=spark, access=READ_EXECUTE, 
inode=/apps/spark/events/spark-pi-1410290714996:lavaqe:hdfs:drwxrwx---

So, two questions, I guess:

1. Do group permissions just plain not work in hdfs or am I missing something?
2. Is there a way to tell Spark to log with more permissive permissions so the 
history server can read the generated logs?

Greg

Re: Re[2]: HBase 0.96+ with Spark 1.0+

2014-09-11 Thread Aniket Bhatnagar

Dependency hell... My fav problem :).

I had run into a similar issue with hbase and jetty. I cant remember thw
exact fix, but is are excerpts from my dependencies that may be relevant:

val hadoop2Common = org.apache.hadoop % hadoop-common % hadoop2Version
excludeAll(

ExclusionRule(organization = javax.servlet),

ExclusionRule(organization = javax.servlet.jsp),

ExclusionRule(organization = org.mortbay.jetty)

   )

  val hadoop2MapRedClient = org.apache.hadoop %
hadoop-mapreduce-client-core % hadoop2Version

  val hbase = org.apache.hbase % hbase % hbaseVersion excludeAll(

ExclusionRule(organization = org.apache.maven.wagon),

ExclusionRule(organization = org.jboss.netty),

ExclusionRule(organization = org.mortbay.jetty),

ExclusionRule(organization = org.jruby) // Don't need
HBASE's jruby. It pulls in whole lot of other dependencies like joda-time.

)

val sparkCore = org.apache.spark %% spark-core % sparkVersion

  val sparkStreaming = org.apache.spark %% spark-streaming %
sparkVersion

  val sparkSQL = org.apache.spark %% spark-sql % sparkVersion

  val sparkHive = org.apache.spark %% spark-hive % sparkVersion

  val sparkRepl = org.apache.spark %% spark-repl % sparkVersion

  val sparkAll = Seq (

sparkCore excludeAll(

  ExclusionRule(organization = org.apache.hadoop)), // We assume
hadoop 2 and hence omit hadoop 1 dependencies

sparkSQL,

sparkStreaming,

hadoop2MapRedClient,

hadoop2Common,

org.mortbay.jetty % servlet-api % 3.0.20100224

  )

On Sep 11, 2014 8:05 PM, sp...@orbit-x.de wrote:

 Hi guys,

 any luck with this issue, anyone?

 I aswell tried all the possible exclusion combos to a no avail.

 thanks for your ideas
 reinis

 -Original-Nachricht-
  Von: Stephen Boesch java...@gmail.com
  An: user user@spark.apache.org
  Datum: 28-06-2014 15:12
  Betreff: Re: HBase 0.96+ with Spark 1.0+
 
  Hi Siyuan,
  Thanks for the input. We are preferring to use the SparkBuild.scala
 instead of maven.  I did not see any protobuf.version  related settings in
 that file. But - as noted by Sean Owen - in any case the issue we are
 facing presently is about the duplicate incompatible javax.servlet entries
 - apparently from the org.mortbay artifacts.


 
  2014-06-28 6:01 GMT-07:00 Siyuan he hsy...@gmail.com:
  Hi Stephen,
 
 I am using spark1.0+ HBase0.96.2. This is what I did:
 1) rebuild spark using: mvn -Dhadoop.version=2.3.0
 -Dprotobuf.version=2.5.0 -DskipTests clean package
 2) In spark-env.sh, set SPARK_CLASSPATH =
 /path-to/hbase-protocol-0.96.2-hadoop2.jar

 
 Hopefully it can help.
 Siyuan


 
  On Sat, Jun 28, 2014 at 8:52 AM, Stephen Boesch java...@gmail.com
 wrote:
 
 
 Thanks Sean.  I had actually already added exclusion rule for
 org.mortbay.jetty - and that had not resolved it.
 
 Just in case I used your precise formulation:

 
 val excludeMortbayJetty = ExclusionRule(organization = org.mortbay.jetty)
 ..

   ,(org.apache.spark % spark-core_2.10 % sparkVersion
 withSources()).excludeAll(excludeMortbayJetty)
   ,(org.apache.spark % spark-sql_2.10 % sparkVersion
 withSources()).excludeAll(excludeMortbayJetty)

 
 However the same error still recurs:

 
 14/06/28 05:48:35 INFO HttpServer: Starting HTTP Server
 [error] (run-main-0) java.lang.SecurityException: class
 javax.servlet.FilterRegistration's signer information does not match
 signer information of other classes in the same package
 java.lang.SecurityException: class javax.servlet.FilterRegistration's
 signer information does not match signer information of other classes in
 the same package



 

 

 
  2014-06-28 4:22 GMT-07:00 Sean Owen so...@cloudera.com:

  This sounds like an instance of roughly the same item as in
  https://issues.apache.org/jira/browse/SPARK-1949  Have a look at
  adding that exclude to see if it works.
 

  On Fri, Jun 27, 2014 at 10:21 PM, Stephen Boesch java...@gmail.com
 wrote:
   The present trunk is built and tested against HBase 0.94.
  
  
   I have tried various combinations of versions of HBase 0.96+ and Spark
 1.0+
   and all end up with
  
   14/06/27 20:11:15 INFO HttpServer: Starting HTTP Server
   [error] (run-main-0) java.lang.SecurityException: class
   javax.servlet.FilterRegistration's signer information does not match
   signer information of other classes in the same package
   java.lang.SecurityException: class javax.servlet.FilterRegistration's
   signer information does not match signer information of other classes
 in the
   same package
   at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)
  
  
   I have tried a number of different ways to exclude javax.servlet
 related
   jars. But none have avoided this error.
  
   Anyone have a (small-ish) build.sbt that works with later versions of
 HBase?
  
  
 


 


 

 




 -
 To unsubscribe, e-mail:

Network requirements between Driver, Master, and Slave

2014-09-11 Thread Jim Carroll

Hello all,

I'm trying to run a Driver on my local network with a deployment on EC2 and
it's not working. I was wondering if either the master or slave instances
(in standalone) connect back to the driver program.

I outlined the details of my observations in a previous post but here is
what I'm seeing:

I have v1.1.0 installed (the new tag) on ec2 using the spark-ec2 script.
I have the same version of the code built locally.
I edited the master security group to allow inbound access from anywhere to
7077 and 8080.
I see a connection take place.
I see the workers fail with a timeout when any job is run.
The master eventually removes the driver's job.

I supposed this makes sense if there's a requirement for either the worker
or the master to be on the same network as the driver. Is that the case?

Thanks
Jim




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Network-requirements-between-Driver-Master-and-Slave-tp13997.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SparkSQL HiveContext TypeTag compile error

2014-09-11 Thread Du Li

Hi,

I have the following code snippet. It works fine on spark-shell but in a 
standalone app it reports No TypeTag available for MySchema” at compile time 
when calling hc.createScheamaRdd(rdd). Anybody knows what might be missing?

Thanks,
Du

--
Import org.apache.spark.sql.hive.HiveContext

case class MySchema(key: Int, value: String)
val rdd = sc.parallelize((1 to 10).map(i = MySchema(i, sval$i)))
val schemaRDD = hc.createSchemaRDD(rdd)
schemaRDD.registerTempTable(data)
val rows = hc.sql(select * from data)
rows.collect.foreach(println)

Re: SchemaRDD saveToCassandra

2014-09-11 Thread Michael Armbrust

This might be a better question to ask on the cassandra mailing list as I
believe that is where the exception is coming from.

On Thu, Sep 11, 2014 at 2:37 AM, lmk lakshmi.muralikrish...@gmail.com
wrote:

 Hi,
 My requirement is to extract certain fields from json files, run queries on
 them and save the result to cassandra.
 I was able to parse json , filter the result and save the rdd(regular) to
 cassandra.

 Now, when I try to read the json file through sqlContext , execute some
 queries on the same and then save the SchemaRDD to cassandra using
 saveToCassandra function, I am getting the following error:

 java.lang.NoSuchMethodException: Cannot resolve any suitable constructor
 for
 class org.apache.spark.sql.catalyst.expressions.Row

 Pls let me know if a spark SchemaRDD can be directly saved to cassandra
 just
 like the regular rdd?

 If that is not possible, is there any way to convert the schema RDD to a
 regular RDD ?

 Please advise.

 Regards,
 lmk



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-saveToCassandra-tp13951.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Reading from multiple sockets

2014-09-11 Thread Varad Joshi

Still fairly new to Spark so please bear with me. I am trying to write a
streaming app that has multiple workers that read from sockets and process
the data. Here is a very simplified version of what I am trying to do:

val carStreamSeq = (1 to 2).map( _ = ssc.socketTextStream(host, port)
).toArray
val unionCarStream = ssc.union(carStreamSeq)
val connectedCars = unionCarStream.count()
connectedCars.foreachRDD(r = println(count:  + r.collect().mkString))

I can see the workers are running and the data is coming through but
anything I put inside the 'foreachRDD' does not seem to get executed. I
don't see the output of println in either stdout or in the master output
file. What am I doing wrong?

Varad



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-from-multiple-sockets-tp14000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: cannot read file form a local path

2014-09-11 Thread Mozumder, Monir

I am seeing this same issue with Spark 1.0.1 (tried with file:// for local file 
) :



scala val lines = sc.textFile(file:///home/monir/.bashrc)
lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at 
console:12

scala val linecount = lines.count
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/home/monir/.bashrc
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

-Original Message-
From: wsun
Sent: Feb 03, 2014; 12:44pm
To: u...@spark.incubator.apache.org
Subject: cannot read file form a local path


After installing spark 0.8.1 on a EC2 cluster, I launched Spark shell on the 
master. This is what happened to me: 

scalaval textFile=sc.textFile(README.md) 
14/02/03 20:38:08 INFO storage.MemoryStore: ensureFreeSpace(34380) called with 
c  urMem=0, maxMem=4082116853 
14/02/03 20:38:08 INFO storage.MemoryStore: Block broadcast_0 stored as values 
t  o memory (estimated size 33.6 KB, free 
3.8 GB) 
textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at 
consol  e:12 


scala textFile.count() 
14/02/03 20:38:39 WARN snappy.LoadSnappy: Snappy native library is available 
14/02/03 20:38:39 INFO util.NativeCodeLoader: Loaded the native-hadoop library 
14/02/03 20:38:39 INFO snappy.LoadSnappy: Snappy native library loaded 
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
hdfs:  
//ec2-54-234-136-50.compute-1.amazonaws.com:9000/user/root/README.md 
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j   
   ava:197) 
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja   
   va:208) 
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:141) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) 
at scala.Option.getOrElse(Option.scala:108) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) 
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:26) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) 
at scala.Option.getOrElse(Option.scala:108) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) 
at org.apache.spark.SparkContext.runJob(SparkContext.scala:886) 
at org.apache.spark.rdd.RDD.count(RDD.scala:698) 


Spark seems looking for README.md in hdfs. However, I did not specify the 
file is located in hdfs. I am just wondering if there any configuration in 
Spark that force Spark to read files from local file system. Thanks in advance 
for any helps. 

wp

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL and running parquet tables?

2014-09-11 Thread DanteSama

Michael Armbrust wrote
 You'll need to run parquetFile(path).registerTempTable(name) to
 refresh the table.

I'm not seeing that function on SchemaRDD in 1.0.2, is there something I'm
missing?

SchemaRDD Scaladoc
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-and-running-parquet-tables-tp13987p14002.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkSQL HiveContext TypeTag compile error

2014-09-11 Thread Du Li

Solved it.

The problem occurred because the case class was defined within a test case in 
FunSuite. Moving the case class definition out of test fixed the problem.

From: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID
Date: Thursday, September 11, 2014 at 11:25 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: SparkSQL HiveContext TypeTag compile error

Hi,

I have the following code snippet. It works fine on spark-shell but in a 
standalone app it reports No TypeTag available for MySchema” at compile time 
when calling hc.createScheamaRdd(rdd). Anybody knows what might be missing?

Thanks,
Du

--
Import org.apache.spark.sql.hive.HiveContext

case class MySchema(key: Int, value: String)
val rdd = sc.parallelize((1 to 10).map(i = MySchema(i, sval$i)))
val schemaRDD = hc.createSchemaRDD(rdd)
schemaRDD.registerTempTable(data)
val rows = hc.sql(select * from data)
rows.collect.foreach(println)

single worker vs multiple workers on each machine

2014-09-11 Thread Mike Sam

Hi There,

I am new to Spark and I was wondering when you have so much memory on each
machine of the cluster, is it better to run multiple workers with limited
memory on each machine or is it better to run a single worker with access
to the majority of the machine memory? If the answer is it depends, would
you please elaborate?

Thanks,
Mike

spark sql - create new_table as select * from table

2014-09-11 Thread jamborta

Hi, 

I am trying to create a new table from a select query as follows:

CREATE TABLE IF NOT EXISTS new_table ROW FORMAT DELIMITED FIELDS TERMINATED
BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION
'/user/test/new_table' AS select * from table

this works in Hive, but in Spark SQL (1.0.2) I am getting Unsupported
language features in query error.

Could you suggest why I am getting this?

thanks,




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark sql - create new_table as select * from table

2014-09-11 Thread Du Li


The implementation of SparkSQL is currently incomplete. You may try it out
with HiveContext instead of SQLContext.





On 9/11/14, 1:21 PM, jamborta jambo...@gmail.com wrote:

Hi, 

I am trying to create a new table from a select query as follows:

CREATE TABLE IF NOT EXISTS new_table ROW FORMAT DELIMITED FIELDS
TERMINATED
BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION
'/user/test/new_table' AS select * from table

this works in Hive, but in Spark SQL (1.0.2) I am getting Unsupported
language features in query error.

Could you suggest why I am getting this?

thanks,




--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-t
able-as-select-from-table-tp14006.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: SparkSQL HiveContext TypeTag compile error

2014-09-11 Thread Du Li

Just moving it out of test is not enough. Must move the case class definition 
to the top level. Otherwise it would report a runtime error of  task not 
serializable when executing collect().

From: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID
Date: Thursday, September 11, 2014 at 12:33 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: SparkSQL HiveContext TypeTag compile error

Solved it.

The problem occurred because the case class was defined within a test case in 
FunSuite. Moving the case class definition out of test fixed the problem.

From: Du Li l...@yahoo-inc.com.INVALIDmailto:l...@yahoo-inc.com.INVALID
Date: Thursday, September 11, 2014 at 11:25 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: SparkSQL HiveContext TypeTag compile error

Hi,

I have the following code snippet. It works fine on spark-shell but in a 
standalone app it reports No TypeTag available for MySchema” at compile time 
when calling hc.createScheamaRdd(rdd). Anybody knows what might be missing?

Thanks,
Du

--
Import org.apache.spark.sql.hive.HiveContext

case class MySchema(key: Int, value: String)
val rdd = sc.parallelize((1 to 10).map(i = MySchema(i, sval$i)))
val schemaRDD = hc.createSchemaRDD(rdd)
schemaRDD.registerTempTable(data)
val rows = hc.sql(select * from data)
rows.collect.foreach(println)

Re: spark sql - create new_table as select * from table

2014-09-11 Thread jamborta

thanks. this was actually using hivecontext.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14009.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: cannot read file form a local path

2014-09-11 Thread Mozumder, Monir

Seems starting spark-shell in local mode solves this. But still then it cannot 
recognize file beginning with a '.' 

MASTER=local[4] ./bin/spark-shell

.
scala val lineCount = sc.textFile(/home/monir/ref).count
lineCount: Long = 68

scala val lineCount2 = sc.textFile(/home/monir/.ref).count
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/home/monir/.ref


Though I am ok with running spark-shell in  local mode to basic examples run, I 
was wondering if getting to local files on the cluster nodes is possible when 
all of the worker nodes have the file in question in their local file system.

Still fairly new to Spark so bear with me if this is easily tunable by some 
config params.

Bests,
-Monir



-Original Message-
From: Mozumder, Monir 
Sent: Thursday, September 11, 2014 12:15 PM
To: user@spark.apache.org
Subject: RE: cannot read file form a local path

I am seeing this same issue with Spark 1.0.1 (tried with file:// for local file 
) :



scala val lines = sc.textFile(file:///home/monir/.bashrc)
lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at 
console:12

scala val linecount = lines.count
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
file:/home/monir/.bashrc
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:175)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)

-Original Message-
From: wsun
Sent: Feb 03, 2014; 12:44pm
To: u...@spark.incubator.apache.org
Subject: cannot read file form a local path


After installing spark 0.8.1 on a EC2 cluster, I launched Spark shell on the 
master. This is what happened to me: 

scalaval textFile=sc.textFile(README.md)
14/02/03 20:38:08 INFO storage.MemoryStore: ensureFreeSpace(34380) called with 
c  urMem=0, maxMem=4082116853 
14/02/03 20:38:08 INFO storage.MemoryStore: Block broadcast_0 stored as values 
t  o memory (estimated size 33.6 KB, free 
3.8 GB) 
textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at 
consol  e:12 


scala textFile.count()
14/02/03 20:38:39 WARN snappy.LoadSnappy: Snappy native library is available
14/02/03 20:38:39 INFO util.NativeCodeLoader: Loaded the native-hadoop library
14/02/03 20:38:39 INFO snappy.LoadSnappy: Snappy native library loaded 
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
hdfs:  
//ec2-54-234-136-50.compute-1.amazonaws.com:9000/user/root/README.md 
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.j   
   ava:197) 
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.ja   
   va:208) 
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:141) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) 
at scala.Option.getOrElse(Option.scala:108) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) 
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:26) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:201) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:199) 
at scala.Option.getOrElse(Option.scala:108) 
at org.apache.spark.rdd.RDD.partitions(RDD.scala:199) 
at org.apache.spark.SparkContext.runJob(SparkContext.scala:886) 
at org.apache.spark.rdd.RDD.count(RDD.scala:698) 


Spark seems looking for README.md in hdfs. However, I did not specify the 
file is located in hdfs. I am just wondering if there any configuration in 
Spark that force Spark to read files from local file system. Thanks in advance 
for any helps. 

wp

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re[2]: HBase 0.96+ with Spark 1.0+

2014-09-11 Thread spark

Thank you, Aniket for your hint!

Alas, I am facing really hellish situation as it seems, because I have 
integration tests using BOTH spark and HBase (Minicluster). Thus I get either:

class javax.servlet.ServletRegistration's signer information does not match 
signer information of other classes in the same package
java.lang.SecurityException: class javax.servlet.ServletRegistration's signer 
information does not match signer information of other classes in the same 
package
    at java.lang.ClassLoader.checkCerts(ClassLoader.java:943)
    at java.lang.ClassLoader.preDefineClass(ClassLoader.java:657)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:785)

or:

[info]   Cause: java.lang.ClassNotFoundException: 
org.mortbay.jetty.servlet.Context
[info]   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
[info]   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
[info]   at java.security.AccessController.doPrivileged(Native Method)
[info]   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
[info]   at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
[info]   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
[info]   at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
[info]   at 
org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:661)
[info]   at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:552)
[info]   at 
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:720)

I am searching the web already for a week trying to figure out how to make this 
work :-/

all the help or hints are greatly appreciated
reinis



-Original-Nachricht-
Von: Aniket Bhatnagar aniket.bhatna...@gmail.com
An: sp...@orbit-x.de
Cc: user user@spark.apache.org
Datum: 11-09-2014 20:00
Betreff: Re: Re[2]: HBase 0.96+ with Spark 1.0+


Dependency hell... My fav problem :).
I had run into a similar issue with hbase and jetty. I cant remember thw exact 
fix, but is are excerpts from my dependencies that may be relevant:
val hadoop2Common = org.apache.hadoop % hadoop-common % hadoop2Version 
excludeAll(
    ExclusionRule(organization = javax.servlet),
    ExclusionRule(organization = javax.servlet.jsp),
ExclusionRule(organization = org.mortbay.jetty)
   )
  val hadoop2MapRedClient = org.apache.hadoop % 
hadoop-mapreduce-client-core % hadoop2Version
  val hbase = org.apache.hbase % hbase % hbaseVersion excludeAll(
    ExclusionRule(organization = org.apache.maven.wagon),
    ExclusionRule(organization = org.jboss.netty),
ExclusionRule(organization = org.mortbay.jetty),
    ExclusionRule(organization = org.jruby) // Don't need HBASE's 
jruby. It pulls in whole lot of other dependencies like joda-time.
    )
val sparkCore = org.apache.spark %% spark-core % sparkVersion
  val sparkStreaming = org.apache.spark %% spark-streaming % sparkVersion
  val sparkSQL = org.apache.spark %% spark-sql % sparkVersion
  val sparkHive = org.apache.spark %% spark-hive % sparkVersion
  val sparkRepl = org.apache.spark %% spark-repl % sparkVersion
  val sparkAll = Seq (
    sparkCore excludeAll(
  ExclusionRule(organization = org.apache.hadoop)), // We assume hadoop 2 
and hence omit hadoop 1 dependencies
    sparkSQL,
    sparkStreaming,
    hadoop2MapRedClient,
    hadoop2Common,
    org.mortbay.jetty % servlet-api % 3.0.20100224
  )

On Sep 11, 2014 8:05 PM, sp...@orbit-x.de wrote:
Hi guys,

any luck with this issue, anyone?

I aswell tried all the possible exclusion combos to a no avail.

thanks for your ideas
reinis

-Original-Nachricht-
 Von: Stephen Boesch java...@gmail.com
 An: user user@spark.apache.org
 Datum: 28-06-2014 15:12
 Betreff: Re: HBase 0.96+ with Spark 1.0+

 Hi Siyuan,
 Thanks for the input. We are preferring to use the SparkBuild.scala instead of 
maven.  I did not see any protobuf.version  related settings in that file. But 
- as noted by Sean Owen - in any case the issue we are facing presently is 
about the duplicate incompatible javax.servlet entries - apparently from the 
org.mortbay artifacts.
 
 

 2014-06-28 6:01 GMT-07:00 Siyuan he hsy...@gmail.com:
 Hi Stephen,

I am using spark1.0+ HBase0.96.2. This is what I did:
1) rebuild spark using: mvn -Dhadoop.version=2.3.0 -Dprotobuf.version=2.5.0 
-DskipTests clean package
2) In spark-env.sh, set SPARK_CLASSPATH = 
/path-to/hbase-protocol-0.96.2-hadoop2.jar 


Hopefully it can help.
Siyuan
 
 

 On Sat, Jun 28, 2014 at 8:52 AM, Stephen Boesch java...@gmail.com wrote:
  

Thanks Sean.  I had actually already added exclusion rule for org.mortbay.jetty 
- and that had not resolved it.

Just in case I used your precise formulation:


val excludeMortbayJetty = ExclusionRule(organization = org.mortbay.jetty)
..
 
  ,(org.apache.spark % spark-core_2.10 % sparkVersion  
withSources()).excludeAll(excludeMortbayJetty)
  ,(org.apache.spark % spark-sql_2.10 %

Re: Out of memory with Spark Streaming

2014-09-11 Thread Tathagata Das

Which version of spark are you running?

If you are running the latest one, then could try running not a window but
a simple event count on every 2 second batch, and see if you are still
running out of memory?

TD


On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar 
aniket.bhatna...@gmail.com wrote:

 I did change it to be 1 gb. It still ran out of memory but a little later.

 The streaming job isnt handling a lot of data. In every 2 seconds, it
 doesn't get more than 50 records. Each record size is not more than 500
 bytes.
  On Sep 11, 2014 10:54 PM, Bharat Venkat bvenkat.sp...@gmail.com
 wrote:

 You could set spark.executor.memory to something bigger than the
 default (512mb)


 On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar 
 aniket.bhatna...@gmail.com wrote:

 I am running a simple Spark Streaming program that pulls in data from
 Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps
 data and persists to a store.

 The program is running in local mode right now and runs out of memory
 after a while. I am yet to investigate heap dumps but I think Spark isn't
 releasing memory after processing is complete. I have even tried changing
 storage level to disk only.

 Help!

 Thanks,
 Aniket

Re: Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-11 Thread Tathagata Das

This is very puzzling, given that this works in the local mode.

Does running the kinesis example work with your spark-submit?

https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala

The instructions are present in the streaming guide.
https://github.com/apache/spark/blob/master/docs/streaming-kinesis-integration.md

If that does not work on cluster, then I would see the streaming UI for the
number records that are being received, and the stages page for whether
jobs are being executed for every batch or not. Can tell use whether that
is working well.

Also ccing, chris fregly who wrote Kinesis integration.

TD




On Thu, Sep 11, 2014 at 4:51 AM, Aniket Bhatnagar 
aniket.bhatna...@gmail.com wrote:

 Hi all

 I am trying to run kinesis spark streaming application on a standalone
 spark cluster. The job works find in local mode but when I submit it (using
 spark-submit), it doesn't do anything. I enabled logs
 for org.apache.spark.streaming.kinesis package and I regularly get the
 following in worker logs:

 14/09/11 11:41:25 DEBUG KinesisRecordProcessor: Stored:  Worker
 x.x.x.x:b88a9210-cbb9-4c31-8da7-35fd92faba09 stored 34 records for shardId
 shardId-
 14/09/11 11:41:25 DEBUG KinesisRecordProcessor: Stored:  Worker
 x.x.x.x:b2e9c20f-470a-44fe-a036-630c776919fb stored 31 records for shardId
 shardId-0001

 But the job does not perform any operations defined on DStream. To
 investigate this further, I changed the kinesis-asl's KinesisUtils to
 perform the following computation on the DStream created
 using ssc.receiverStream(new KinesisReceiver...):

 stream.count().foreachRDD(rdd = rdd.foreach(tuple = logInfo(Emitted  +
 tuple)))

 Even the above line does not results in any corresponding log entries both
 in driver and worker logs. The only relevant logs that I could find in
 driver logs are:
 14/09/11 11:40:58 INFO DAGScheduler: Stage 2 (foreach at
 KinesisUtils.scala:68) finished in 0.398 s
 14/09/11 11:40:58 INFO SparkContext: Job finished: foreach at
 KinesisUtils.scala:68, took 4.926449985 s
 14/09/11 11:40:58 INFO JobScheduler: Finished job streaming job
 1410435653000 ms.0 from job set of time 1410435653000 ms
 14/09/11 11:40:58 INFO JobScheduler: Starting job streaming job
 1410435653000 ms.1 from job set of time 1410435653000 ms
 14/09/11 11:40:58 INFO SparkContext: Starting job: foreach at
 KinesisUtils.scala:68
 14/09/11 11:40:58 INFO DAGScheduler: Registering RDD 13 (union at
 DStream.scala:489)
 14/09/11 11:40:58 INFO DAGScheduler: Got job 3 (foreach at
 KinesisUtils.scala:68) with 2 output partitions (allowLocal=false)
 14/09/11 11:40:58 INFO DAGScheduler: Final stage: Stage 5(foreach at
 KinesisUtils.scala:68)

 After the above logs, nothing shows up corresponding to KinesisUtils. I am
 out of ideas on this one and any help on this would greatly appreciated.

 Thanks,
 Aniket

Re: Re[2]: HBase 0.96+ with Spark 1.0+

2014-09-11 Thread Sean Owen

This was already answered at the bottom of this same thread -- read below.

On Thu, Sep 11, 2014 at 9:51 PM,  sp...@orbit-x.de wrote:
 class javax.servlet.ServletRegistration's signer information does not
 match signer information of other classes in the same package
 java.lang.SecurityException: class javax.servlet.ServletRegistration's
 signer information does not match signer information of other classes in the
 same package
 at java.lang.ClassLoader.checkCerts(ClassLoader.java:943)
 at java.lang.ClassLoader.preDefineClass(ClassLoader.java:657)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:785)

  2014-06-28 4:22 GMT-07:00 Sean Owen so...@cloudera.com:

  This sounds like an instance of roughly the same item as in
  https://issues.apache.org/jira/browse/SPARK-1949 Have a look at
  adding that exclude to see if it works.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

SparkContext and multi threads

2014-09-11 Thread moon soo Lee

Hi,

I'm trying to make spark work on multithreads java application.
What i'm trying to do is,

- Create a Single SparkContext
- Create Multiple SparkILoop and SparkIMain
- Inject created SparkContext into SparkIMain interpreter.

Thread is created by every user request and take a SparkILoop and interpret
some code.

My problem is
 - If a thread take first SparkILoop instance, than everything works fine.
 - If a thread take other SparkILoop instance, Spark can not find closure /
case classes that i defined inside of interpreter.

I read some previous topic and I think it's related with SparkEnv and
ClosureCleaner. tried SparkEnv.set(env) with the env i can get right after
SparkContext created. i not still no class found exception.

Can anyone give me some idea?
Thanks.


Best,
moon

Re: spark sql - create new_table as select * from table

2014-09-11 Thread Yin Huai

What is the schema of table?

On Thu, Sep 11, 2014 at 4:30 PM, jamborta jambo...@gmail.com wrote:

 thanks. this was actually using hivecontext.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14009.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Re: Spark SQL -- more than two tables for join

2014-09-11 Thread Yin Huai

1.0.1 does not have the support on outer joins (added in 1.1). Can you try
1.1 branch?

On Wed, Sep 10, 2014 at 9:28 PM, boyingk...@163.com boyingk...@163.com
wrote:

  Hi,michael :

 I think Arthur.hk.chan arthur.hk.c...@gmail.com isn't here now，I Can
 Show something:
 1)my spark version is 1.0.1
 2) when I use multiple join ，like this:
 sql(SELECT * FROM youhao_data left join youhao_age on
 (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on
 (youhao_age.rowkey=youhao_totalKiloMeter.rowkey))

youhao_data,youhao_age,youhao_totalKiloMeter  were registerAsTable 。

  I take the Exception:
  Exception in thread main java.lang.RuntimeException: [1.90] failure:
 ``UNION'' expected but `left' found

 SELECT * FROM youhao_data left join youhao_age on
 (youhao_data.rowkey=youhao_age.rowkey) left join youhao_totalKiloMeter on
 (youhao_age.rowkey=youhao_totalKiloMeter.rowkey)

 ^
 at scala.sys.package$.error(package.scala:27)
 at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
 at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:69)
 at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:181)
 at
 org.apache.spark.examples.sql.SparkSQLHBaseRelation$.main(SparkSQLHBaseRelation.scala:140)
 at
 org.apache.spark.examples.sql.SparkSQLHBaseRelation.main(SparkSQLHBaseRelation.scala)
 --
  boyingk...@163.com

  *From:* Michael Armbrust mich...@databricks.com
 *Date:* 2014-09-11 00:28
 *To:* arthur.hk.c...@gmail.com arthur.hk.c...@gmail.com
 *CC:* arunshell87 shell.a...@gmail.com; u...@spark.incubator.apache.org
 *Subject:* Re: Spark SQL -- more than two tables for join
What version of Spark SQL are you running here?  I think a lot of your
 concerns have likely been addressed in more recent versions of the code /
 documentation.  (Spark 1.1 should be published in the next few days)

 In particular, for serious applications you should use a HiveContext and
 HiveQL as this is a much more complete implementation of a SQL Parser.  The
 one in SQL context is only suggested if the Hive dependencies conflict with
 your application.


 1)  spark sql does not support multiple join


 This is not true.  What problem were you running into?


 2)  spark left join: has performance issue


 Can you describe your data and query more?


 3)  spark sql’s cache table: does not support two-tier query


 I'm not sure what you mean here.


 4)  spark sql does not support repartition


 You can repartition SchemaRDDs in the same way as normal RDDs.

Spark Streaming in 1 hour batch duration RDD files gets lost

2014-09-11 Thread Jeoffrey Lim

Hi,


Our spark streaming app is configured to pull data from Kafka in 1 hour
batch duration which performs aggregation of data by specific keys and
store the related RDDs to HDFS in the transform phase. We have tried
checkpoint of 7 days on the DStream of Kafka to ensure that the generated
stream does not expire/lost.

The first hour gets completed, but on the succeeding hours it always fails
with exception:


Job aborted due to stage failure: Task 39.0:1 failed 64 times, most recent
failure: Exception failure in TID 27578 on host X.ec2.internal:
java.io.FileNotFoundException:
/data/run/spark/work/spark-local-20140911175744-4ddf/0d/shuffle_3_1_311 (No
such file or directory) java.io.FileOutputStream.open(Native Method)
java.io.FileOutputStream.init(FileOutputStream.java:221)
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:116)
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:177)
org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:161)
org.apache.spark.scheduler.ShuffleMapTask$$anonfun$runTask$1.apply(ShuffleMapTask.scala:158)
scala.collection.Iterator$class.foreach(Iterator.scala:727)


Environment:

CDH version: 2.3.0-cdh5.1.0
Spark version: 1.0.0-cdh5.1.0


Spark settings:

spark.io.compression.codec : org.apache.spark.io.SnappyCompressionCodec
spark.serializer : org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.mb : 2
spark.local.dir : /data/run/spark/work/
spark.scheduler.mode : FAIR
spark.rdd.compress : false
spark.task.maxFailures : 64
spark.shuffle.use.netty : false
spark.shuffle.spill : true
spark.streaming.checkpoint.dir :
hdfs://X.ec2.internal:8020/user/spark/checkpoints/event-storage
spark.akka.threads : 4
spark.cores.max : 4
spark.executor.memory : 3g
spark.shuffle.consolidateFiles : false
spark.streaming.unpersist : true
spark.logConf : true
spark.shuffle.spill.compress : true


Thanks,

JL




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-in-1-hour-batch-duration-RDD-files-gets-lost-tp14027.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Backwards RDD

2014-09-11 Thread Victor Tso-Guillen

Iterating an RDD gives you each partition in order of their split index.
I'd like to be able to get each partition in reverse order, but I'm having
difficultly implementing the compute() method. I thought I could do
something like this:

  override def getDependencies: Seq[Dependency[_]] = {
Seq(new NarrowDependency[T](prev) {
  def getParents(partitionId: Int): Seq[Int] = {
Seq(prev.partitions.size - partitionId - 1)
  }
})
  }

  override def compute(split: Partition, context: TaskContext): Iterator[T]
= {
firstParent[T].iterator(split, context).toArray.reverseIterator
  }

But that doesn't work. How do I get one split to depend on exactly one
split from the parent that does not match indices?

Announcing Spark 1.1.0!

2014-09-11 Thread Patrick Wendell

I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
the second release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 171 developers!

This release brings operational and performance improvements in Spark
core including a new implementation of the Spark shuffle designed for
very large scale workloads. Spark 1.1 adds significant extensions to
the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a
JDBC server, byte code generation for fast expression evaluation, a
public types API, JSON support, and other features and optimizations.
MLlib introduces a new statistics library along with several new
algorithms and optimizations. Spark 1.1 also builds out Spark's Python
support and adds new components to the Spark Streaming module.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

[1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html
[2] http://spark.eu.apache.org/downloads.html

NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS.

Please e-mail me directly for any type-o's in the release notes or name listing.

Thanks, and congratulations!
- Patrick

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Configuring Spark for heterogenous hardware

2014-09-11 Thread Victor Tso-Guillen

So I have a bunch of hardware with different core and memory setups. Is
there a way to do one of the following:

1. Express a ratio of cores to memory to retain. The spark worker config
would represent all of the cores and all of the memory usable for any
application, and the application would take a fraction that sustains the
ratio. Say I have 4 cores and 20G of RAM. I'd like it to have the worker
take 4/20 and the executor take 5 G for each of the 4 cores, thus maxing
both out. If there were only 16G with the same ratio requirement, it would
only take 3 cores and 12G in a single executor and leave the rest.

2. Have the executor take whole number ratios of what it needs. Say it is
configured for 2/8G and the worker has 4/20. So we can give the executor
2/8G (which is true now) or we can instead give it 4/16G, maxing out one of
the two parameters.

Either way would allow me to get my heterogenous hardware all participating
in the work of my spark cluster, presumably without endangering spark's
assumption of homogenous execution environments in the dimensions of memory
and cores. If there's any way to do this, please enlighten me.

History server: ERROR ReplayListenerBus: Exception in parsing Spark event log

2014-09-11 Thread SK


Hi,

I am using Spark 1.0.2 on a mesos cluster. After I run my job, when I try to
look at the detailed application stats using a history server@18080, the
stats don't show up for some of the jobs even though the job completed
successfully and the event logs are written to the log folder. The log from
the history server execution is  attached below - looks like it is
encountering some parsing error when reading the EVENT_LOG file ( I have not
modified this file). Basically the line that says Malformed line  seems to
be truncating the first path (instead of amd64, it shows up as a d64).  Does
the history server have any String buffer limitations that would be causing
this problem?  Also, I want to point out that this problem does not happen
all the time - during some runs the app details do show up. However this is
quite unpredictable. 

The same job when I ran using Spark 1.0.1 in standalone mode (i.e without
using a history server), showed up on the application details page.  I am
not sure if this is a problem with the history server or specifically with
version 1.0.2. Is it possible to fix this problem, as I would like to  use
the application details?


thanks




14/09/11 20:50:55 ERROR ReplayListenerBus: Exception in parsing Spark event
log file:/mapr/applogs_spark_mesos/spark_test-1410468489529/EVENT_LOG_1
com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'd64': was
expecting 
 at [Source: java.io.StringReader@2d51a56a; line: 1, column: 4]
at
com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1524)
at
com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:557)
at
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._reportInvalidToken(ReaderBasedJsonParser.java:2042)
at
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._handleOddValue(ReaderBasedJsonParser.java:1412)
at
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:679)
at
com.fasterxml.jackson.databind.ObjectMapper._initForReading(ObjectMapper.java:3024)
at
com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:2971)
at
com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2091)
at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19)
at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44)
at
org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71)
at
org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69)
at
org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55)
at
org.apache.spark.deploy.history.HistoryServer.org$apache$spark$deploy$history$HistoryServer$$renderSparkUI(HistoryServer.scala:182)
at
org.apache.spark.deploy.history.HistoryServer$$anonfun$checkForLogs$3.apply(HistoryServer.scala:149)
at
org.apache.spark.deploy.history.HistoryServer$$anonfun$checkForLogs$3.apply(HistoryServer.scala:146)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.deploy.history.HistoryServer.checkForLogs(HistoryServer.scala:146)
at
org.apache.spark.deploy.history.HistoryServer$$anon$1$$anonfun$run$1.apply$mcV$sp(HistoryServer.scala:77)
at
org.apache.spark.deploy.history.HistoryServer$$anon$1$$anonfun$run$1.apply(HistoryServer.scala:74)
at
org.apache.spark.deploy.history.HistoryServer$$anon$1$$anonfun$run$1.apply(HistoryServer.scala:74)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1160)
at
org.apache.spark.deploy.history.HistoryServer$$anon$1.run(HistoryServer.scala:73)

ReplayListenerBus: Malformed line:
d64/jre/lib/jsse.jar:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/jce.jar:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/charsets.jar:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rhino.jar:/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/jfr.jar:/usr/lib/jvm/java-7-openjdk-amd64/jre/classes,file.encoding:ISO-8859-1,user.timezone:Etc/UTC,java.specification.vendor:Oracle
Corporation,sun.java.launcher:SUN_STANDARD,os.version:3.13.0-32-generic,sun.os.patch.level:unknown,java.vm.specification.vendor:Oracle

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Haopu Wang

I see the binary packages include hadoop 1, 2.3 and 2.4.
Does Spark 1.1.0 support hadoop 2.5.0 at below address?

http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com] 
Sent: Friday, September 12, 2014 8:13 AM
To: d...@spark.apache.org; user@spark.apache.org
Subject: Announcing Spark 1.1.0!

I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
the second release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 171 developers!

This release brings operational and performance improvements in Spark
core including a new implementation of the Spark shuffle designed for
very large scale workloads. Spark 1.1 adds significant extensions to
the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a
JDBC server, byte code generation for fast expression evaluation, a
public types API, JSON support, and other features and optimizations.
MLlib introduces a new statistics library along with several new
algorithms and optimizations. Spark 1.1 also builds out Spark's Python
support and adds new components to the Spark Streaming module.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

[1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html
[2] http://spark.eu.apache.org/downloads.html

NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS.

Please e-mail me directly for any type-o's in the release notes or name listing.

Thanks, and congratulations!
- Patrick

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

coalesce on SchemaRDD in pyspark

2014-09-11 Thread Brad Miller

Hi All,

I'm having some trouble with the coalesce and repartition functions for
SchemaRDD objects in pyspark.  When I run:

sqlCtx.jsonRDD(sc.parallelize(['{foo:bar}',
'{foo:baz}'])).coalesce(1)

I get this error:

Py4JError: An error occurred while calling o94.coalesce. Trace:
py4j.Py4JException: Method coalesce([class java.lang.Integer, class
java.lang.Boolean]) does not exist

For context, I have a dataset stored in a parquet file, and I'm using
SQLContext to make several queries against the data.  I then register the
results of these as queries new tables in the SQLContext.  Unfortunately
each new table has the same number of partitions as the original (despite
being much smaller).  Hence my interest in coalesce and repartition.

Has anybody else encountered this bug?  Is there an alternate workflow I
should consider?

I am running the 1.1.0 binaries released today.

best,
-Brad

Re: Announcing Spark 1.1.0!

2014-09-11 Thread Tobias Pfeiffer

Hi,

On Fri, Sep 12, 2014 at 9:12 AM, Patrick Wendell pwend...@gmail.com wrote:

 I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
 the second release on the API-compatible 1.X line. It is Spark's
 largest release ever, with contributions from 171 developers!


Great, congratulations!! The release notes read great!
Seems like if I wait long enough for new Spark releases, my applications
will build themselves in the end ;-)

Tobias

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Denny Lee

I’m not sure if I’m completely answering your question here but I’m currently 
working (on OSX) with Hadoop 2.5 and I used the Spark 1.1 with Hadoop 2.4 
without any issues.


On September 11, 2014 at 18:11:46, Haopu Wang (hw...@qilinsoft.com) wrote:

I see the binary packages include hadoop 1, 2.3 and 2.4.  
Does Spark 1.1.0 support hadoop 2.5.0 at below address?  

http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available
  

-Original Message-  
From: Patrick Wendell [mailto:pwend...@gmail.com]  
Sent: Friday, September 12, 2014 8:13 AM  
To: d...@spark.apache.org; user@spark.apache.org  
Subject: Announcing Spark 1.1.0!  

I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is  
the second release on the API-compatible 1.X line. It is Spark's  
largest release ever, with contributions from 171 developers!  

This release brings operational and performance improvements in Spark  
core including a new implementation of the Spark shuffle designed for  
very large scale workloads. Spark 1.1 adds significant extensions to  
the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a  
JDBC server, byte code generation for fast expression evaluation, a  
public types API, JSON support, and other features and optimizations.  
MLlib introduces a new statistics library along with several new  
algorithms and optimizations. Spark 1.1 also builds out Spark's Python  
support and adds new components to the Spark Streaming module.  

Visit the release notes [1] to read about the new features, or  
download [2] the release today.  

[1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html  
[2] http://spark.eu.apache.org/downloads.html  

NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS. 
 

Please e-mail me directly for any type-o's in the release notes or name 
listing.  

Thanks, and congratulations!  
- Patrick  

-  
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org  
For additional commands, e-mail: user-h...@spark.apache.org

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-11 Thread Denny Lee

It sort of depends on the definition of efficiently.  From a work flow 
perspective I would agree but from an I/O perspective, wouldn’t there be the 
same multi-pass from the standpoint of the Hive context needing to push the 
data into HDFS?  Saying this, if you’re pushing the data into HDFS and then 
creating Hive tables via load (vs. a reference point ala external tables), I 
would agree with you.  

And thanks for correcting me, the registerTempTable is in the SqlContext.


On September 10, 2014 at 13:47:24, Du Li (l...@yahoo-inc.com) wrote:

Hi Denny,  

There is a related question by the way.  

I have a program that reads in a stream of RDD¹s, each of which is to be  
loaded into a hive table as one partition. Currently I do this by first  
writing the RDD¹s to HDFS and then loading them to hive, which requires  
multiple passes of HDFS I/O and serialization/deserialization.  

I wonder if it is possible to do it more efficiently with Spark 1.1  
streaming + SQL, e.g., by registering the RDDs into a hive context so that  
the data is loaded directly into the hive table in cache and meanwhile  
visible to jdbc/odbc clients. In the spark source code, the method  
registerTempTable you mentioned works on SqlContext instead of HiveContext.  

Thanks,  
Du  



On 9/10/14, 1:21 PM, Denny Lee denny.g@gmail.com wrote:  

Actually, when registering the table, it is only available within the sc  
context you are running it in. For Spark 1.1, the method name is changed  
to RegisterAsTempTable to better reflect that.  
  
The Thrift server process runs under a different process meaning that it  
cannot see any of the tables generated within the sc context. You would  
need to save the sc table into Hive and then the Thrift process would be  
able to see them.  
  
HTH!  
  
 On Sep 10, 2014, at 13:08, alexandria1101  
alexandria.shea...@gmail.com wrote:  
  
 I used the hiveContext to register the tables and the tables are still  
not  
 being found by the thrift server. Do I have to pass the hiveContext to  
JDBC  
 somehow?  
  
  
  
 --  
 View this message in context:  
http://apache-spark-user-list.1001560.n3.nabble.com/Table-not-found-using  
-jdbc-console-to-query-sparksql-hive-thriftserver-tp13840p13922.html  
 Sent from the Apache Spark User List mailing list archive at Nabble.com.  
  
 -  
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org  
 For additional commands, e-mail: user-h...@spark.apache.org  
  
  
-  
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org  
For additional commands, e-mail: user-h...@spark.apache.org

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Denny Lee

Please correct me if I’m wrong but I was under the impression as per the maven 
repositories that it was just to stay more in sync with the various version of 
Hadoop.  Looking at the latest documentation 
(https://spark.apache.org/docs/latest/building-with-maven.html), there are 
multiple Hadoop versions called out.

As for the potential differences in Spark, this is more about ensuring the 
various jars and library dependencies of the correct version of Hadoop are 
included so there can be proper connectivity to Hadoop from Spark vs. any 
differences in Spark itself.   Another good reference on this topic is call out 
for Hadoop versions within github: https://github.com/apache/spark

HTH!


On September 11, 2014 at 18:39:10, Haopu Wang (hw...@qilinsoft.com) wrote:

Danny, thanks for the response.

 

I raise the question because in Spark 1.0.2, I saw one binary package for 
hadoop2, but in Spark 1.1.0, there are separate packages for hadoop 2.3 and 2.4.

That implies some difference in Spark according to hadoop version.

 

From:Denny Lee [mailto:denny.g@gmail.com]
Sent: Friday, September 12, 2014 9:35 AM
To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; Patrick Wendell
Subject: RE: Announcing Spark 1.1.0!

 

I’m not sure if I’m completely answering your question here but I’m currently 
working (on OSX) with Hadoop 2.5 and I used the Spark 1.1 with Hadoop 2.4 
without any issues.

 

 

On September 11, 2014 at 18:11:46, Haopu Wang (hw...@qilinsoft.com) wrote:

I see the binary packages include hadoop 1, 2.3 and 2.4.
Does Spark 1.1.0 support hadoop 2.5.0 at below address?

http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Friday, September 12, 2014 8:13 AM
To: d...@spark.apache.org; user@spark.apache.org
Subject: Announcing Spark 1.1.0!

I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
the second release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 171 developers!

This release brings operational and performance improvements in Spark
core including a new implementation of the Spark shuffle designed for
very large scale workloads. Spark 1.1 adds significant extensions to
the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a
JDBC server, byte code generation for fast expression evaluation, a
public types API, JSON support, and other features and optimizations.
MLlib introduces a new statistics library along with several new
algorithms and optimizations. Spark 1.1 also builds out Spark's Python
support and adds new components to the Spark Streaming module.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

[1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html
[2] http://spark.eu.apache.org/downloads.html

NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS.

Please e-mail me directly for any type-o's in the release notes or name listing.

Thanks, and congratulations!
- Patrick

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Haopu Wang

From the web page 
(https://spark.apache.org/docs/latest/building-with-maven.html) which is 
pointed out by you, it’s saying “Because HDFS is not protocol-compatible across 
versions, if you want to read from HDFS, you’ll need to build Spark against the 
specific HDFS version in your environment.”

 

Did you try to read a hadoop 2.5.0 file using Spark 1.1 with hadoop 2.4?

 

Thanks!

 



From: Denny Lee [mailto:denny.g@gmail.com] 
Sent: Friday, September 12, 2014 10:00 AM
To: Patrick Wendell; Haopu Wang; d...@spark.apache.org; user@spark.apache.org
Subject: RE: Announcing Spark 1.1.0!

 

Please correct me if I’m wrong but I was under the impression as per the maven 
repositories that it was just to stay more in sync with the various version of 
Hadoop.  Looking at the latest documentation 
(https://spark.apache.org/docs/latest/building-with-maven.html), there are 
multiple Hadoop versions called out.

 

As for the potential differences in Spark, this is more about ensuring the 
various jars and library dependencies of the correct version of Hadoop are 
included so there can be proper connectivity to Hadoop from Spark vs. any 
differences in Spark itself.   Another good reference on this topic is call out 
for Hadoop versions within github: https://github.com/apache/spark

 

HTH!

 

 

On September 11, 2014 at 18:39:10, Haopu Wang (hw...@qilinsoft.com) wrote:

Danny, thanks for the response.

 

I raise the question because in Spark 1.0.2, I saw one binary package 
for hadoop2, but in Spark 1.1.0, there are separate packages for hadoop 2.3 and 
2.4.

That implies some difference in Spark according to hadoop version.

 





From:Denny Lee [mailto:denny.g@gmail.com]
Sent: Friday, September 12, 2014 9:35 AM
To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; Patrick 
Wendell
Subject: RE: Announcing Spark 1.1.0!

 

I’m not sure if I’m completely answering your question here but I’m 
currently working (on OSX) with Hadoop 2.5 and I used the Spark 1.1 with Hadoop 
2.4 without any issues.

 

 

On September 11, 2014 at 18:11:46, Haopu Wang (hw...@qilinsoft.com) 
wrote:

I see the binary packages include hadoop 1, 2.3 and 2.4.
Does Spark 1.1.0 support hadoop 2.5.0 at below address?


http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Friday, September 12, 2014 8:13 AM
To: d...@spark.apache.org; user@spark.apache.org
Subject: Announcing Spark 1.1.0!

I am happy to announce the availability of Spark 1.1.0! Spark 
1.1.0 is
the second release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 171 developers!

This release brings operational and performance improvements in 
Spark
core including a new implementation of the Spark shuffle 
designed for
very large scale workloads. Spark 1.1 adds significant 
extensions to
the newest Spark modules, MLlib and Spark SQL. Spark SQL 
introduces a
JDBC server, byte code generation for fast expression 
evaluation, a
public types API, JSON support, and other features and 
optimizations.
MLlib introduces a new statistics library along with several new
algorithms and optimizations. Spark 1.1 also builds out Spark's 
Python
support and adds new components to the Spark Streaming module.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

[1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html
[2] http://spark.eu.apache.org/downloads.html

NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE 
FOR SEVERAL HOURS.

Please e-mail me directly for any type-o's in the release notes 
or name listing.

Thanks, and congratulations!
- Patrick


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Applications status missing when Spark HA(zookeeper) enabled

2014-09-11 Thread jason chen

Hi guys,

I configured Spark with the configuration in spark-env.sh:
export SPARK_DAEMON_JAVA_OPTS=-Dspark.deploy.recoveryMode=ZOOKEEPER
-Dspark.deploy.zookeeper.url=host1:2181,host2:2181,host3:2181
-Dspark.deploy.zookeeper.dir=/spark

And I started spark-shell on one master host1(active):
MASTER=spark://host1:7077,host2:7077 bin/spark-shell

I stop-master.sh on host1, then access host2 web ui, the worker
successfully registered to new master host2,
but the running application, even the completed applications shows nothing,
did I missing anything when I configure spark HA ?

Thanks !

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Denny Lee

Yes, atleast for my query scenarios, I have been able to use Spark 1.1 with 
Hadoop 2.4 against Hadoop 2.5.  Note, Hadoop 2.5 is considered a relatively 
minor release 
(http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available)
 where Hadoop 2.4 and 2.3 were considered more significant releases.



On September 11, 2014 at 19:22:05, Haopu Wang (hw...@qilinsoft.com) wrote:

From the web page 
(https://spark.apache.org/docs/latest/building-with-maven.html) which is 
pointed out by you, it’s saying “Because HDFS is not protocol-compatible across 
versions, if you want to read from HDFS, you’ll need to build Spark against the 
specific HDFS version in your environment.”

 

Did you try to read a hadoop 2.5.0 file using Spark 1.1 with hadoop 2.4?

 

Thanks!

 

From:Denny Lee [mailto:denny.g@gmail.com]
Sent: Friday, September 12, 2014 10:00 AM
To: Patrick Wendell; Haopu Wang; d...@spark.apache.org; user@spark.apache.org
Subject: RE: Announcing Spark 1.1.0!

 

Please correct me if I’m wrong but I was under the impression as per the maven 
repositories that it was just to stay more in sync with the various version of 
Hadoop.  Looking at the latest documentation 
(https://spark.apache.org/docs/latest/building-with-maven.html), there are 
multiple Hadoop versions called out.

 

As for the potential differences in Spark, this is more about ensuring the 
various jars and library dependencies of the correct version of Hadoop are 
included so there can be proper connectivity to Hadoop from Spark vs. any 
differences in Spark itself.   Another good reference on this topic is call out 
for Hadoop versions within github: https://github.com/apache/spark

 

HTH!

 

 

On September 11, 2014 at 18:39:10, Haopu Wang (hw...@qilinsoft.com) wrote:

Danny, thanks for the response.

 

I raise the question because in Spark 1.0.2, I saw one binary package for 
hadoop2, but in Spark 1.1.0, there are separate packages for hadoop 2.3 and 2.4.

That implies some difference in Spark according to hadoop version.

 

From:Denny Lee [mailto:denny.g@gmail.com]
Sent: Friday, September 12, 2014 9:35 AM
To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; Patrick Wendell
Subject: RE: Announcing Spark 1.1.0!

 

I’m not sure if I’m completely answering your question here but I’m currently 
working (on OSX) with Hadoop 2.5 and I used the Spark 1.1 with Hadoop 2.4 
without any issues.

 

 

On September 11, 2014 at 18:11:46, Haopu Wang (hw...@qilinsoft.com) wrote:

I see the binary packages include hadoop 1, 2.3 and 2.4.
Does Spark 1.1.0 support hadoop 2.5.0 at below address?

http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Friday, September 12, 2014 8:13 AM
To: d...@spark.apache.org; user@spark.apache.org
Subject: Announcing Spark 1.1.0!

I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
the second release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 171 developers!

This release brings operational and performance improvements in Spark
core including a new implementation of the Spark shuffle designed for
very large scale workloads. Spark 1.1 adds significant extensions to
the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a
JDBC server, byte code generation for fast expression evaluation, a
public types API, JSON support, and other features and optimizations.
MLlib introduces a new statistics library along with several new
algorithms and optimizations. Spark 1.1 also builds out Spark's Python
support and adds new components to the Spark Streaming module.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

[1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html
[2] http://spark.eu.apache.org/downloads.html

NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS.

Please e-mail me directly for any type-o's in the release notes or name listing.

Thanks, and congratulations!
- Patrick

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Haopu Wang

Got it, thank you, Denny!

From: Denny Lee [mailto:denny.g@gmail.com] 
Sent: Friday, September 12, 2014 11:04 AM
To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; Patrick Wendell
Subject: RE: Announcing Spark 1.1.0!

Yes, atleast for my query scenarios, I have been able to use Spark 1.1 with 
Hadoop 2.4 against Hadoop 2.5.  Note, Hadoop 2.5 is considered a relatively 
minor release 
(http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available)
 where Hadoop 2.4 and 2.3 were considered more significant releases.

On September 11, 2014 at 19:22:05, Haopu Wang (hw...@qilinsoft.com) wrote:

From the web page 
(https://spark.apache.org/docs/latest/building-with-maven.html) which is 
pointed out by you, it’s saying “Because HDFS is not protocol-compatible across 
versions, if you want to read from HDFS, you’ll need to build Spark against the 
specific HDFS version in your environment.”

Did you try to read a hadoop 2.5.0 file using Spark 1.1 with hadoop 2.4?

Thanks!

From:Denny Lee [mailto:denny.g@gmail.com]
Sent: Friday, September 12, 2014 10:00 AM
To: Patrick Wendell; Haopu Wang; d...@spark.apache.org; 
user@spark.apache.org
Subject: RE: Announcing Spark 1.1.0!

Please correct me if I’m wrong but I was under the impression as per 
the maven repositories that it was just to stay more in sync with the various 
version of Hadoop.  Looking at the latest documentation 
(https://spark.apache.org/docs/latest/building-with-maven.html), there are 
multiple Hadoop versions called out.

As for the potential differences in Spark, this is more about ensuring 
the various jars and library dependencies of the correct version of Hadoop are 
included so there can be proper connectivity to Hadoop from Spark vs. any 
differences in Spark itself.   Another good reference on this topic is call out 
for Hadoop versions within github: https://github.com/apache/spark

HTH!

On September 11, 2014 at 18:39:10, Haopu Wang (hw...@qilinsoft.com) 
wrote:

Danny, thanks for the response.

I raise the question because in Spark 1.0.2, I saw one binary 
package for hadoop2, but in Spark 1.1.0, there are separate packages for hadoop 
2.3 and 2.4.

That implies some difference in Spark according to hadoop 
version.

From:Denny Lee [mailto:denny.g@gmail.com]
Sent: Friday, September 12, 2014 9:35 AM
To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; 
Patrick Wendell
Subject: RE: Announcing Spark 1.1.0!

I’m not sure if I’m completely answering your question here but 
I’m currently working (on OSX) with Hadoop 2.5 and I used the Spark 1.1 with 
Hadoop 2.4 without any issues.

On September 11, 2014 at 18:11:46, Haopu Wang 
(hw...@qilinsoft.com) wrote:

I see the binary packages include hadoop 1, 2.3 and 2.4.
Does Spark 1.1.0 support hadoop 2.5.0 at below address?

http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available

-Original Message-
From: Patrick Wendell [mailto:pwend...@gmail.com]
Sent: Friday, September 12, 2014 8:13 AM
To: d...@spark.apache.org; user@spark.apache.org
Subject: Announcing Spark 1.1.0!

I am happy to announce the availability of Spark 1.1.0! 
Spark 1.1.0 is
the second release on the API-compatible 1.X line. It 
is Spark's
largest release ever, with contributions from 171 
developers!

This release brings operational and performance 
improvements in Spark
core including a new implementation of the Spark 
shuffle designed for
very large scale workloads. Spark 1.1 adds significant 
extensions to
the newest Spark modules, MLlib and Spark SQL. Spark 
SQL introduces a
JDBC server, byte code generation for fast expression 
evaluation, a
public types API, JSON support, and other features and 
optimizations.
MLlib introduces a new statistics library along with 
several new
algorithms and optimizations. Spark 1.1 also builds out 
Spark's Python

Re: DistCP - Spark-based

2014-09-11 Thread Nicholas Chammas

I've created SPARK-3499 https://issues.apache.org/jira/browse/SPARK-3499 to
track creating a Spark-based distcp utility.

Nick

On Tue, Aug 12, 2014 at 4:20 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Good question; I don't know of one but I believe people at Cloudera had
 some thoughts of porting Sqoop to Spark in the future, and maybe they'd
 consider DistCP as part of this effort. I agree it's missing right now.

 Matei

 On August 12, 2014 at 11:04:28 AM, Gary Malouf (malouf.g...@gmail.com)
 wrote:

 We are probably still the minority, but our analytics platform based on
 Spark + HDFS does not have map/reduce installed.  I'm wondering if there is
 a distcp equivalent that leverages Spark to do the work.

 Our team is trying to find the best way to do cross-datacenter replication
 of our HDFS data to minimize the impact of outages/dc failure.

Re: Spark SQL JDBC

2014-09-11 Thread Denny Lee

When you re-ran sbt did you clear out the packages first and ensure that
the datanucleus jars were generated within lib_managed?  I remembered
having to do that when I was working testing out different configs.

On Thu, Sep 11, 2014 at 10:50 AM, alexandria1101 
alexandria.shea...@gmail.com wrote:

 Even when I comment out those 3 lines, I still get the same error.  Did
 someone solve this?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-JDBC-tp11369p13992.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark SQL Thrift JDBC server deployment for production

2014-09-11 Thread Denny Lee

Could you provide some context about running this in yarn-cluster mode?
The Thrift server that's included within Spark 1.1 is based on Hive 0.12.
Hive has been able to work against YARN since Hive 0.10.  So when you start
the thrift server, provided you copied the hive-site.xml over to the Spark
conf folder, it should be able to connect to the same Hive metastore and
then execute Hive against your YARN cluster.

On Wed, Sep 10, 2014 at 11:55 PM, vasiliy zadonsk...@gmail.com wrote:

 Hi, i have a question about spark sql Thrift JDBC server.

 Is there a best practice for spark SQL deployement ? If i understand right
 script

 ./sbin/start-thriftserver.sh

 starts Thrift JDBC server in local mode. Is there an script options for
 running this server on yarn-cluster mode ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thrift-JDBC-server-deployment-for-production-tp13947.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Announcing Spark 1.1.0!

2014-09-11 Thread Tim Smith

Thanks for all the good work. Very excited about seeing more features and
better stability in the framework.


On Thu, Sep 11, 2014 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote:

 I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
 the second release on the API-compatible 1.X line. It is Spark's
 largest release ever, with contributions from 171 developers!

 This release brings operational and performance improvements in Spark
 core including a new implementation of the Spark shuffle designed for
 very large scale workloads. Spark 1.1 adds significant extensions to
 the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a
 JDBC server, byte code generation for fast expression evaluation, a
 public types API, JSON support, and other features and optimizations.
 MLlib introduces a new statistics library along with several new
 algorithms and optimizations. Spark 1.1 also builds out Spark's Python
 support and adds new components to the Spark Streaming module.

 Visit the release notes [1] to read about the new features, or
 download [2] the release today.

 [1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html
 [2] http://spark.eu.apache.org/downloads.html

 NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL
 HOURS.

 Please e-mail me directly for any type-o's in the release notes or name
 listing.

 Thanks, and congratulations!
 - Patrick

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Announcing Spark 1.1.0!

2014-09-11 Thread Matei Zaharia

Thanks to everyone who contributed to implementing and testing this release!

Matei

On September 11, 2014 at 11:52:43 PM, Tim Smith (secs...@gmail.com) wrote:

Thanks for all the good work. Very excited about seeing more features and 
better stability in the framework.


On Thu, Sep 11, 2014 at 5:12 PM, Patrick Wendell pwend...@gmail.com wrote:
I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is
the second release on the API-compatible 1.X line. It is Spark's
largest release ever, with contributions from 171 developers!

This release brings operational and performance improvements in Spark
core including a new implementation of the Spark shuffle designed for
very large scale workloads. Spark 1.1 adds significant extensions to
the newest Spark modules, MLlib and Spark SQL. Spark SQL introduces a
JDBC server, byte code generation for fast expression evaluation, a
public types API, JSON support, and other features and optimizations.
MLlib introduces a new statistics library along with several new
algorithms and optimizations. Spark 1.1 also builds out Spark's Python
support and adds new components to the Spark Streaming module.

Visit the release notes [1] to read about the new features, or
download [2] the release today.

[1] http://spark.eu.apache.org/releases/spark-release-1-1-0.html
[2] http://spark.eu.apache.org/downloads.html

NOTE: SOME ASF DOWNLOAD MIRRORS WILL NOT CONTAIN THE RELEASE FOR SEVERAL HOURS.

Please e-mail me directly for any type-o's in the release notes or name listing.

Thanks, and congratulations!
- Patrick

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: compiling spark source code

2014-09-11 Thread rapelly kartheek

I have been doing that. All the modifications to the code  are not being
compiled.


On Thu, Sep 11, 2014 at 10:45 PM, Daniil Osipov daniil.osi...@shazam.com
wrote:

 In the spark source folder, execute `sbt/sbt assembly`

 On Thu, Sep 11, 2014 at 8:27 AM, rapelly kartheek kartheek.m...@gmail.com
  wrote:

 HI,


 Can someone please tell me how to compile the spark source code to effect
 the changes in the source code. I was trying to ship the jars to all the
 slaves, but in vain.

 -Karthik

RE: Spark SQL JDBC

2014-09-11 Thread Cheng, Hao

I copied the 3 datanucleus jars (datanucleus-api-jdo-3.2.1.jar, 
datanucleus-core-3.2.2.jar, datanucleus-rdbms-3.2.1.jar) to the fold lib/ 
manually, and it works for me.

From: Denny Lee [mailto:denny.g@gmail.com]
Sent: Friday, September 12, 2014 11:28 AM
To: alexandria1101
Cc: u...@spark.incubator.apache.org
Subject: Re: Spark SQL JDBC

When you re-ran sbt did you clear out the packages first and ensure that the 
datanucleus jars were generated within lib_managed?  I remembered having to do 
that when I was working testing out different configs.

On Thu, Sep 11, 2014 at 10:50 AM, alexandria1101 
alexandria.shea...@gmail.commailto:alexandria.shea...@gmail.com wrote:
Even when I comment out those 3 lines, I still get the same error.  Did
someone solve this?

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-JDBC-tp11369p13992.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-11 Thread Du Li

SchemaRDD has a method insertInto(table). When the table is partitioned, it 
would be more sensible and convenient to extend it with a list of partition key 
and values.

From: Denny Lee denny.g@gmail.commailto:denny.g@gmail.com
Date: Thursday, September 11, 2014 at 6:39 PM
To: Du Li l...@yahoo-inc.commailto:l...@yahoo-inc.com
Cc: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org 
u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org, 
alexandria1101 
alexandria.shea...@gmail.commailto:alexandria.shea...@gmail.com
Subject: Re: Table not found: using jdbc console to query sparksql hive 
thriftserver

It sort of depends on the definition of efficiently.  From a work flow 
perspective I would agree but from an I/O perspective, wouldn’t there be the 
same multi-pass from the standpoint of the Hive context needing to push the 
data into HDFS?  Saying this, if you’re pushing the data into HDFS and then 
creating Hive tables via load (vs. a reference point ala external tables), I 
would agree with you.

And thanks for correcting me, the registerTempTable is in the SqlContext.

On September 10, 2014 at 13:47:24, Du Li 
(l...@yahoo-inc.commailto:l...@yahoo-inc.com) wrote:

Hi Denny,

There is a related question by the way.

I have a program that reads in a stream of RDD¹s, each of which is to be
loaded into a hive table as one partition. Currently I do this by first
writing the RDD¹s to HDFS and then loading them to hive, which requires
multiple passes of HDFS I/O and serialization/deserialization.

I wonder if it is possible to do it more efficiently with Spark 1.1
streaming + SQL, e.g., by registering the RDDs into a hive context so that
the data is loaded directly into the hive table in cache and meanwhile
visible to jdbc/odbc clients. In the spark source code, the method
registerTempTable you mentioned works on SqlContext instead of HiveContext.

Thanks,
Du

On 9/10/14, 1:21 PM, Denny Lee 
denny.g@gmail.commailto:denny.g@gmail.com wrote:

Actually, when registering the table, it is only available within the sc
context you are running it in. For Spark 1.1, the method name is changed
to RegisterAsTempTable to better reflect that.

The Thrift server process runs under a different process meaning that it
cannot see any of the tables generated within the sc context. You would
need to save the sc table into Hive and then the Thrift process would be
able to see them.

HTH!

 On Sep 10, 2014, at 13:08, alexandria1101
alexandria.shea...@gmail.commailto:alexandria.shea...@gmail.com wrote:

 I used the hiveContext to register the tables and the tables are still
not
 being found by the thrift server. Do I have to pass the hiveContext to
JDBC
 somehow?

 --
 View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Table-not-found-using
-jdbc-console-to-query-sparksql-hive-thriftserver-tp13840p13922.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: 
 user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: 
 user-h...@spark.apache.orgmailto:user-h...@spark.apache.org

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org

86 matches

Mail list logo