Re: Remote jar file

2014-12-11 Thread rahulkumar-aws
Put Jar file in site HDFS, URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes. - Software Developer SigmoidAnalytics, Bangalore -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.c

Re: custom spark app name in yarn-cluster mode

2014-12-11 Thread Sandy Ryza
Hi Tomer, In yarn-cluster mode, the application has already been submitted to YARN by the time the SparkContext is created, so it's too late to set the app name there. I believe giving it with the --name property to spark-submit should work. -Sandy On Thu, Dec 11, 2014 at 10:28 AM, Tomer Benyam

Re: KryoSerializer exception in Spark Streaming JAVA

2014-12-11 Thread Tathagata Das
Also please make sure you are specifying the fully qualified name of registrator class in the sparkconf configuration correctly. On Dec 11, 2014 5:57 PM, "bonnahu" wrote: > class MyRegistrator implements KryoRegistrator { > > public void registerClasses(Kryo kryo) { > kryo

Re: Mllib Error

2014-12-11 Thread MEETHU MATHEW
Hi,Try this.Change spark-mllib to spark-mllib_2.10 libraryDependencies ++=Seq( "org.apache.spark" % "spark-core_2.10" % "1.1.1"  "org.apache.spark" % "spark-mllib_2.10" % "1.1.1" )  Thanks & Regards, Meethu M On Friday, 12 December 2014 12:22 PM, amin mohebbi wrote:  I'm trying to bu

Mllib Error

2014-12-11 Thread amin mohebbi
 I'm trying to build a very simple scala standalone app using the Mllib, but I get the following error when trying to bulid the program:Object Mllib is not a member of package org.apache.sparkThen, I realized that I have to add Mllib as dependency as follow :libraryDependencies ++= Seq( "org.apa

Adding a column to a SchemaRDD

2014-12-11 Thread Nathan Kronenfeld
Hi, there. I'm trying to understand how to augment data in a SchemaRDD. I can see how to do it if can express the added values in SQL - just run "SELECT *,valueCalculation AS newColumnName FROM table" I've been searching all over for how to do this if my added value is a scala function, with no

Using Spark at the U.S.Treasury

2014-12-11 Thread Max Funk
Kindly take a moment to look over this proposal to bring Spark into the U.S. Treasury: http://www.systemaccounting.org/sparking_the_data_driven_republic

GroupBy and nested Top on

2014-12-11 Thread sparkuser2014
I'm currently new to pyspark, thank you for your patience in advance - my current problem is the following: I have a RDD composed of the field A, B, and count => result1 = rdd.map(lambda x: (A,B),1).reduceByKey(lambda a,b: a + b) Then I wanted to group the results based on 'A', so I did t

Re: monitoring for spark standalone

2014-12-11 Thread Otis Gospodnetic
Hi Judy, SPM monitors Spark. Here are some screenshots: http://blog.sematext.com/2014/10/07/apache-spark-monitoring/ Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On Mon, Dec 8, 2014 at 2:35 AM, Judy Nash wro

Error on JavaSparkContext.stop()

2014-12-11 Thread 김태연
(Sorry if this mail is a duplicate, but it seems that my previous mail could not reach the mailing list.) Hi, When my spark program calls JavaSparkContext.stop(), the following errors occur. 14/12/11 16:24:19 INFO Main: sc.stop { 14/12/11 16:24:20 ERROR ConnectionManag

Re: what is the best way to implement mini batches?

2014-12-11 Thread Ilya Ganelin
Hi all. I've been working on a similar problem. One solution that is straightforward (if suboptimal) is to do the following. A.zipWithIndex().filter(_._2 >=range_start && _._2 < range_end). Lastly just put that in a for loop. I've found that this approach scales very well. As Matei said another o

Re: Spark Streaming in Production

2014-12-11 Thread Tathagata Das
Spark Streaming takes care of restarting receivers if it fails. Regarding the fault-tolerance properties and deployment options, we made some improvements in the upcoming Spark 1.2. Here is a staged version of the Spark Streaming programming guide that you can read for the up-to-date explanation of

Re: what is the best way to implement mini batches?

2014-12-11 Thread Imran Rashid
Minor correction: I think you want iterator.grouped(10) for non-overlapping mini batches On Dec 11, 2014 1:37 PM, "Matei Zaharia" wrote: > You can just do mapPartitions on the whole RDD, and then called sliding() > on the iterator in each one to get a sliding window. One problem is that > you wi

Re: KryoSerializer exception in Spark Streaming JAVA

2014-12-11 Thread bonnahu
class MyRegistrator implements KryoRegistrator { public void registerClasses(Kryo kryo) { kryo.register(ImpressionFactsValue.class); } } change this class to public and give a try -- View this message in context: http://apache-spark-user-list.1001560.n

Re: KryoRegistrator exception and Kryo class not found while compiling

2014-12-11 Thread bonnahu
Is the class com.dataken.spark.examples.MyRegistrator public? if not, change it to public and give a try. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KryoRegistrator-exception-and-Kryo-class-not-found-while-compiling-tp10396p20646.html Sent from the Apac

Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Andy Wagner
This is showing a factor of 200 between python and scala and 1400 when distributed. Is this really accurate? If not, what is the real performance difference expected on average between the 3 cases? On Thu, Dec 11, 2014 at 11:33 AM, Duy Huynh wrote: > just to give some reference point. with th

Re: Spark Server - How to implement

2014-12-11 Thread Marcelo Vanzin
Hi Manoj, I'm not aware of any public projects that do something like that, except for the Ooyala server which you say doesn't cover your needs. We've been playing with something like that inside Hive, though: On Thu, Dec 11, 2014 at 5:33 PM, Manoj Samel wrote: > Hi, > > If spark based services

Re: Spark Server - How to implement

2014-12-11 Thread Marcelo Vanzin
Oops, sorry, fat fingers. We've been playing with something like that inside Hive: https://github.com/apache/hive/tree/spark/spark-client That seems to have at least a few of the characteristics you're looking for; but it's a very young project, and at this moment we're not developing it as a pub

Spark Server - How to implement

2014-12-11 Thread Manoj Samel
Hi, If spark based services are to be exposed as a continuously available server, what are the options? * The API exposed to client will be proprietary and fine grained (RPC style ..), not a Job level API * The client API need not be SQL so the Thrift JDBC server does not seem to be option .. but

Error on JavaSparkContext.stop()

2014-12-11 Thread Taeyun Kim
(Sorry if this mail is duplicate, but it seems that my previous mail could not reach the mailing list.) Hi, When my spark program calls JavaSparkContext.stop(), the following errors occur. 14/12/11 16:24:19 INFO Main: sc.stop { 14/12/11 16:24:20 ERROR Conne

Spark Streaming in Production

2014-12-11 Thread twizansk
Hi, I'm looking for resources and examples for the deployment of spark streaming in production. Specifically, I would like to know how high availability and fault tolerance of receivers is typically achieved. The workers are managed by the spark framework and are therefore fault tolerant out

Re: Specifying number of executors in Mesos

2014-12-11 Thread Andrew Ash
Gerard, Are you familiar with spark.deploy.spreadOut in Standalone mode? It sounds like you want the same thing in Mesos mode. On Thu, Dec 11, 2014 at 6:48 AM, Tathagata Das wrote: > Not that I am aware of. Spark will try to spread th

Job status from Python

2014-12-11 Thread Michael Nazario
In PySpark, is there a way to get the status of a job which is currently running? My use case is that I have a long running job that users may not know whether or not the job is still running. It would be nice to have an idea of whether or not the job is progressing even if it isn't very granula

Running spark-submit from a remote machine using a YARN application

2014-12-11 Thread ryaminal
We are trying to submit a Spark application from a Tomcat application running our business logic. The Tomcat app lives in a seperate non-hadoop cluster. We first were doing this by using the spark-yarn package to directly call Client#runApp() but found that the API we were using in Spark is being m

Native library error when trying to use Spark with Snappy files

2014-12-11 Thread Rich Haase
I am running a Hadoop cluster with Spark on YARN. The cluster running the CDH5.2 distribution. When I try to run spark jobs against snappy compressed files I receive the following error. java.lang.UnsatisfiedLinkError: org.apache.hadoop.util.NativeCodeLoader.buildSupportsSnappy()Z org

Re: RDDs being cleaned too fast

2014-12-11 Thread Ranga
I was having similar issues with my persistent RDDs. After some digging around, I noticed that the partitions were not balanced evenly across the available nodes. After a "repartition", the RDD was spread evenly across all available memory. Not sure if that is something that would help your use-cas

Re: broadcast: OutOfMemoryError

2014-12-11 Thread Sameer Farooqui
Is the OOM happening to the Driver JVM or one of the Executor JVMs? What memory size is each JVM? How large is the data you're trying to broadcast? If it's large enough, you may want to consider just persisting the data to distributed storage (like HDFS) and read it in through the normal read RDD

Re: Key not valid / already cancelled using Spark Streaming

2014-12-11 Thread Tathagata Das
Aah yes, that makes sense. You could write first to HDFS, and when that works, copy from HDFS to S3. That should work as it wont depend on the temporary files to be in S3. I am not sure how much you can customize just for S3 in Spark code. In Spark, since we just use Hadoop API to write there isnt

Re: Error: Spark-streaming to Cassandra

2014-12-11 Thread Tathagata Das
This seems to be compilation errors. The second one seems to be that you are using CassandraJavaUtil.javafunctions wrong. Look at the documentation and set the parameter list correctly. TD On Mon, Dec 8, 2014 at 9:47 AM, wrote: > Hi, > > I am intending to save the streaming data from kafka int

Re: Does filter on an RDD scan every data item ?

2014-12-11 Thread dsiegel
Also, you may want to use .lookup() instead of .filter() def lookup(key: K): Seq[V] Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to. You might want to partition your first

Re: Is there an efficient way to append new data to a registered Spark SQL Table?

2014-12-11 Thread Rakesh Nair
TD, While looking at the API Ref(version 1.1.0) for SchemaRDD i did find these two methods: def insertInto(tableName: String): Unit def insertInto(tableName: String, overwrite: Boolean): Unit Wouldnt these be a nicer way of appending RDD's to a table or are these not recommended as of now? Also

Re: what is the best way to implement mini batches?

2014-12-11 Thread Duy Huynh
the dataset i'm working on has about 100,000 records. the batch that we're training on has a size around 10. can you repartition(10,000) into 10,000 partitions? On Thu, Dec 11, 2014 at 2:36 PM, Matei Zaharia wrote: > You can just do mapPartitions on the whole RDD, and then called sliding() > o

Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Sean Owen
In general, you would not expect a distributed computation framework to perform nearly as fast as a non-distributed one, when both are run on one machine. Spark has so much more overhead that doesn't go away just because it's on one machine. Of course, that's the very reason it scales past one mach

Proper way to check SparkContext's status within code

2014-12-11 Thread Edwin
Hi, Is there a way to check the status of the SparkContext regarding whether it's alive or not through the code, not through UI or anything else? Thanks Edwin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Proper-way-to-check-SparkContext-s-status-within

Re: what is the best way to implement mini batches?

2014-12-11 Thread Matei Zaharia
You can just do mapPartitions on the whole RDD, and then called sliding() on the iterator in each one to get a sliding window. One problem is that you will not be able to slide "forward" into the next partition at partition boundaries. If this matters to you, you need to do something more compli

Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Duy Huynh
just to give some reference point. with the same algorithm running on mnist dataset. 1. python implementation: ~10 miliseconds per iteration (can be faster if i switch to gpu) 2. local version (scala + breeze): ~2 seconds per iteration 3. distributed version (spark + scala + breeze): 15 s

Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Duy Huynh
both. first, the distributed version is so much slower than python. i tried a few things like broadcasting variables, replacing Seq with Array, and a few other little things. it helps to improve the performance, but still slower than the python code. so, i wrote a local version that's pretty mu

Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Natu Lauchande
Are you using Scala in a distributed enviroment or in a standalone mode ? Natu On Thu, Dec 11, 2014 at 8:23 PM, ll wrote: > hi.. i'm converting some of my machine learning python code into scala + > spark. i haven't been able to run it on large dataset yet, but on small > datasets (like http:/

Re: Spark streaming: missing classes when kafka consumer classes

2014-12-11 Thread Flávio Santos
Hi Mario, Try to include this to your libraryDependencies (in your sbt file): "org.apache.kafka" % "kafka_2.10" % "0.8.0" exclude("javax.jms", "jms") exclude("com.sun.jdmk", "jmxtools") exclude("com.sun.jmx", "jmxri") exclude("org.slf4j", "slf4j-simple") Regards, *--Flávio R.

Re: RDD.aggregate?

2014-12-11 Thread Gerard Maas
There's some explanation and an example here: http://stackoverflow.com/questions/26611471/spark-data-processing-with-grouping/26612246#26612246 -kr, Gerard. On Thu, Dec 11, 2014 at 7:15 PM, ll wrote: > any explaination on how aggregate works would be much appreciated. i > already > looked at t

Re: custom spark app name in yarn-cluster mode

2014-12-11 Thread Tomer Benyamini
On Thu, Dec 11, 2014 at 8:27 PM, Tomer Benyamini wrote: > Hi, > > I'm trying to set a custom spark app name when running a java spark app in > yarn-cluster mode. > > SparkConf sparkConf = new SparkConf(); > > sparkConf.setMaster(System.getProperty("spark.master")); > > sparkConf.setAppName("my

custom spark app name in yarn-cluster mode

2014-12-11 Thread Tomer Benyamini
Hi, I'm trying to set a custom spark app name when running a java spark app in yarn-cluster mode. SparkConf sparkConf = new SparkConf(); sparkConf.setMaster(System.getProperty("spark.master")); sparkConf.setAppName("myCustomName"); sparkConf.set("spark.logConf", "true"); JavaSparkContext

why is spark + scala code so slow, compared to python?

2014-12-11 Thread ll
hi.. i'm converting some of my machine learning python code into scala + spark. i haven't been able to run it on large dataset yet, but on small datasets (like http://yann.lecun.com/exdb/mnist/), my spark + scala code is much slower than my python code (5 to 10 times slower than python) i already

Re: what is the best way to implement mini batches?

2014-12-11 Thread ll
any advice/comment on this would be much appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini-batches-tp20264p20635.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: RDD.aggregate?

2014-12-11 Thread ll
any explaination on how aggregate works would be much appreciated. i already looked at the spark example and still am confused about the seqop and combop... thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-tp20434p20634.html Sent from t

broadcast: OutOfMemoryError

2014-12-11 Thread ll
hi. i'm running into this OutOfMemory issue when i'm broadcasting a large array. what is the best way to handle this? should i split the array into smaller arrays before broadcasting, and then combining them locally at each node? thanks! -- View this message in context: http://apache-spark-

Different Vertex Ids in Graph and Edges

2014-12-11 Thread th0rsten
Hello all, I'm using GraphX (1.1.0) to process RDF-data. I want to build an graph out of the data from the Berlin Benchmark ( BSBM ). The steps that I'm doing to load the data into a graph are: *1.* Split the RDF triples *

Exception using amazonaws library

2014-12-11 Thread Albert Manyà
Hi, I've made a simple script in scala that after doing a spark sql query it sends the result to AWS's cloudwatch. I've tested both parts individually (the spark sql one and the cloudwatch one) and they worked fine. The trouble comes when I execute the script through spark-submit that gives me th

Re: Spark streaming: missing classes when kafka consumer classes

2014-12-11 Thread Akhil Das
Last time i did an sbt assembly and this is how i added the dependencies. libraryDependencies ++= Seq( ("org.apache.spark" % "spark-streaming_2.10" % "1.1.0" % "provided"). exclude("org.eclipse.jetty.orbit", "javax.transaction"). exclude("org.eclipse.jetty.orbit", "javax.mail"). exc

RE: "Session" for connections?

2014-12-11 Thread Ashic Mahtab
Yeah, using that currently. Doing: dstream.foreachRDD(x => x.foreachPartition(y => cassandraConnector.withSessionDo(session =>{ val myHelper = MyHelper(session) y.foreach(m =>{ processMessage(m, myHelper) }) })) Is there a better approach? From: gerar

Newbie Question

2014-12-11 Thread Fernando O.
Hi guys, I'm planning to use spark on a project and I'm facing a problem, I couldn't find a log that explains what's wrong with what I'm doing. I have 2 vms that run a small hadoop (2.6.0) cluster. I added a file that has a 50 lines of json data Compiled spark, all tests passed, I run some si

Re: Error outputing to CSV file

2014-12-11 Thread Muhammad Ahsan
Hi saveAsTextFile is a member of RDD where as fields.map(_.mkString("|")).mkString("\n") is a string. You have to transform it into RDD using something like sc.parallel(...) before saveAsTextFile. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/E

Re: parquet file not loading (spark v 1.1.0)

2014-12-11 Thread Muhammad Ahsan
Hi It worked for me like this. Just define the case class outside of any class to write to parquet format successfully. I am using Spark version 1.1.1. case class person(id: Int, name: String, fathername: String, officeid: Int) object Program { def main (args: Array[String]) { val co

Re: Spark-SQL JDBC driver

2014-12-11 Thread Denny Lee
Yes, that is correct. A quick reference on this is the post https://www.linkedin.com/pulse/20141007143323-732459-an-absolutely-unofficial-way-to-connect-tableau-to-sparksql-spark-1-1?_mSplash=1 with the pertinent section being: It is important to note that when you create Spark tables (for example

Re: Key not valid / already cancelled using Spark Streaming

2014-12-11 Thread Flávio Santos
Hello guys, Thank you for your prompt reply. I followed Akhil suggestion with no success. Then, I tried again replacing S3 by HDFS and the job seems to work properly. TD, I'm not using speculative execution. I think I've just realized what is happening. Due to S3 eventual consistency, these tempo

Re: Specifying number of executors in Mesos

2014-12-11 Thread Tathagata Das
Not that I am aware of. Spark will try to spread the tasks evenly across executors, its not aware of the workers at all. So if the executors to worker allocation is uneven, I am not sure what can be done. Maybe others can get smoe ideas. On Tue, Dec 9, 2014 at 6:20 AM, Gerard Maas wrote: > Hi, >

Re: Is there an efficient way to append new data to a registered Spark SQL Table?

2014-12-11 Thread Tathagata Das
First of all, how long do you want to keep doing this? The data is going to increase infinitely and without any bounds, its going to get too big for any cluster to handle. If all that is within bounds, then try the following. - Maintain a global variable having the current RDD storing all the log

Re: Locking for shared RDDs

2014-12-11 Thread Tathagata Das
Aditya, I think you have the mental model of spark streaming a little off the mark. Unlike traditional streaming systems, where any kind of state is mutable, SparkStreaming is designed on Sparks immutable RDDs. Streaming data is received and divided into immutable blocks, then form immutable RDDs,

Re: Trouble with cache() and parquet

2014-12-11 Thread Yana Kadiyska
I see -- they are the same in design but the difference comes from partitioned Hive tables: when the RDD is generated by querying an external Hive metastore, the partition is appended as part of the row, and shows up as part of the schema. Can you shed some light on why this is a problem: last2Ho

Re: "Session" for connections?

2014-12-11 Thread Gerard Maas
>I'm doing the same thing for using Cassandra, For Cassandra, use the Spark-Cassandra connector [1], which does the Session management, as described by TD, for you. [1] https://github.com/datastax/spark-cassandra-connector -kr, Gerard. On Thu, Dec 11, 2014 at 1:55 PM, Ashic Mahtab wrote: > Th

Re: "Session" for connections?

2014-12-11 Thread Tathagata Das
Also, this is covered in the streaming programming guide in bits and pieces. http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Thu, Dec 11, 2014 at 4:55 AM, Ashic Mahtab wrote: > That makes sense. I'll try that. > > Thanks :) > >> From: t

Re: Spark streaming: missing classes when kafka consumer classes

2014-12-11 Thread Mario Pastorelli
Thanks akhil for the answer. I am using sbt assembly and the build.sbt is in the first email. Do you know why those classes are included in that way? Thanks, Mario On 11.12.2014 14:51, Akhil Das wrote: Yes. You can do/use *sbt assembly* and create a big fat jar with all dependencies bundled

Re: Spark streaming: missing classes when kafka consumer classes

2014-12-11 Thread Akhil Das
Yes. You can do/use *sbt assembly* and create a big fat jar with all dependencies bundled inside it. Thanks Best Regards On Thu, Dec 11, 2014 at 7:10 PM, Mario Pastorelli < mario.pastore...@teralytics.ch> wrote: > In this way it works but it's not portable and the idea of having a fat > jar is

Re: Spark streaming: missing classes when kafka consumer classes

2014-12-11 Thread Mario Pastorelli
In this way it works but it's not portable and the idea of having a fat jar is to avoid exactly this. Is there any system to create a self-contained portable fatJar? On 11.12.2014 13:57, Akhil Das wrote: Add these jars while creating the Context. val sc = new SparkContext(conf) sc.add

Standalone app: IOException due to broadcast.destroy()

2014-12-11 Thread Alberto Garcia
Hello. I'm pretty new with Spark I am developing an Spark application, conducting the test on local prior to deploy it on a cluster. I have a problem with a broacast variable. The application raises "Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Ta

Re: Can spark job have sideeffects (write files to FileSystem)

2014-12-11 Thread Daniel Darabos
Yes, this is perfectly "legal". This is what RDD.foreach() is for! You may be encountering an IO exception while writing, and maybe using() suppresses it. (?) I'd try writing the files with java.nio.file.Files.write() -- I'd expect there is less that can go wrong with that simple call. On Thu, Dec

Spark SQL Vs CQL performance on Cassandra

2014-12-11 Thread Ajay
Hi, To test Spark SQL Vs CQL performance on Cassandra, I did the following: 1) Cassandra standalone server (1 server in a cluster) 2) Spark Master and 1 Worker Both running in a Thinkpad laptop with 4 cores and 8GB RAM. 3) Written Spark SQL code using Cassandra-Spark Driver from Cassandra (JavaAp

Re: KafkaUtils explicit acks

2014-12-11 Thread Tathagata Das
I am updating the docs right now. Here is a staged copy that you can have sneak peek of. This will be part of the Spark 1.2. http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html The updated fault-tolerance section tries to simplify the explanation of when and what data c

Re: Spark streaming: missing classes when kafka consumer classes

2014-12-11 Thread Akhil Das
Add these jars while creating the Context. val sc = new SparkContext(conf) sc.addJar("/home/akhld/.ivy2/cache/org.apache.spark/spark-streaming-kafka_2.10/jars/ *spark-streaming-kafka_2.10-1.1.0.jar*") sc.addJar("/home/akhld/.ivy2/cache/com.101tec/zkclient/jars/ *zkclient-0.3.jar*"

RE: "Session" for connections?

2014-12-11 Thread Ashic Mahtab
That makes sense. I'll try that. Thanks :) > From: tathagata.das1...@gmail.com > Date: Thu, 11 Dec 2014 04:53:01 -0800 > Subject: Re: "Session" for connections? > To: as...@live.com > CC: user@spark.apache.org > > You could create a lazily initialized singleton factory and connection > pool. Whe

Re: "Session" for connections?

2014-12-11 Thread Tathagata Das
You could create a lazily initialized singleton factory and connection pool. Whenever an executor starts running the firt task that needs to push out data, it will create the connection pool as a singleton. And subsequent tasks running on the executor is going to use the connection pool. You will a

Spark streaming: missing classes when kafka consumer classes

2014-12-11 Thread Mario Pastorelli
Hi, I'm trying to use spark-streaming with kafka but I get a strange error on class that are missing. I would like to ask if my way to build the fat jar is correct or no. My program is val kafkaStream = KafkaUtils.createStream(ssc, zookeeperQuorum, kafkaGroupId, kafkaTopicsWithThreads)

Re: ERROR YarnClientClusterScheduler: Lost executor Akka client disassociated

2014-12-11 Thread Muhammad Ahsan
-- Code -- scala> import org.apache.spark.SparkContext._ import org.apache.spark.SparkContext._ scala> import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala> import org.apache.spark.sql.SchemaR

"Session" for connections?

2014-12-11 Thread Ashic Mahtab
Hi, I was wondering if there's any way of having long running session type behaviour in spark. For example, let's say we're using Spark Streaming to listen to a stream of events. Upon receiving an event, we process it, and if certain conditions are met, we wish to send a message to rabbitmq. Now

Re: Standalone spark cluster. Can't submit job programmatically -> java.io.InvalidClassException

2014-12-11 Thread sivarani
No able to get it , how did you exactly fix it? i am using maven build i downloaded spark1.1.1 and then packaged with mvn -Dhadoop.version=1.2.1 -DskipTests clean package but i keep getting invalid class exceptions -- View this message in context: http://apache-spark-user-list.1001560.n3.nabbl

Re: Spark steaming : work with collect() but not without collect()

2014-12-11 Thread Tathagata Das
What does process do? Maybe when this process function is being run in the Spark executor, it is causing the some static initialization, which fails causing this exception. For Oracle documentation, an ExceptionInInitializerError is thrown to indicate that an exception occurred during evaluation of

RE: Spark-SQL JDBC driver

2014-12-11 Thread Anas Mosaad
Actually I came to a conclusion that RDDs has to be persisted in hive in order to be able to access through thrift. Hope I didn't end up with incorrect conclusion. Please someone correct me if I am wrong. On Dec 11, 2014 8:53 AM, "Judy Nash" wrote: > Looks like you are wondering why you cannot s

Re: Key not valid / already cancelled using Spark Streaming

2014-12-11 Thread Tathagata Das
Following Gerard's thoughts, here are possible things that could be happening. 1. Is there another process in the background that is deleting files in the directory where you are trying to write? Seems like the temporary file generated by one of the tasks is getting delete before it is renamed to

Re: SchemaRDD partition on specific column values?

2014-12-11 Thread nitin
Can we take this as a performance improvement task in Spark-1.2.1? I can help contribute for this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20623.html Sent from the Apache Spark User List mailing

Can spark job have sideeffects (write files to FileSystem)

2014-12-11 Thread Paweł Szulc
Imagine simple Spark job, that will store each line of the RDD to a separate file val lines = sc.parallelize(1 to 100).map(n => s"this is line $n") lines.foreach(line => writeToFile(line)) def writeToFile(line: String) = { def filePath = "file://..." val file = new File(new URI(path).get

spark logging issue

2014-12-11 Thread Sourav Chandra
Hi, I am using spark 1.1.0 and setting below properties while creating spark context. *spark.executor.logs.rolling.maxRetainedFiles = 10* *spark.executor.logs.rolling.size.maxBytes = 104857600* *spark.executor.logs.rolling.strategy = size* Even though I am setting to rollover after 100 MB, the

Error on JavaSparkContext.stop()

2014-12-11 Thread Taeyun Kim
Hi, When my spark program calls JavaSparkContext.stop(), the following errors occur. 14/12/11 16:24:19 INFO Main: sc.stop { 14/12/11 16:24:20 ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(cluster02,38918) not found

Re: Key not valid / already cancelled using Spark Streaming

2014-12-11 Thread Gerard Maas
If the timestamps in the logs are to be trusted It looks like your driver is dying with that *java.io.FileNotFoundException*: and therefore the workers loose their connection and close down. -kr, Gerard. On Thu, Dec 11, 2014 at 7:39 AM, Akhil Das wrote: > Try to add the following to the sparkCo

Re: Spark steaming : work with collect() but not without collect()

2014-12-11 Thread Gerard Maas
Have you tried with kafkaStream.foreachRDD(rdd => {rdd.foreach(...)} ? Would that make a difference? On Thu, Dec 11, 2014 at 10:24 AM, david wrote: > Hi, > > We use the following Spark Streaming code to collect and process Kafka > event : > > kafkaStream.foreachRDD(rdd => { > rdd.c

Spark steaming : work with collect() but not without collect()

2014-12-11 Thread david
Hi, We use the following Spark Streaming code to collect and process Kafka event : kafkaStream.foreachRDD(rdd => { rdd.collect().foreach(event => { process(event._1, event._2) }) }) This work fine. But without /collect()/ function, the following exception is rais

Re: Decision Tree with libsvmtools datasets

2014-12-11 Thread Sean Owen
The implementation assumes classes are 0-indexed, not 1-indexed. You should set numClasses = 3 and change your labels to 0, 1, 2. On Thu, Dec 11, 2014 at 3:40 AM, Ge, Yao (Y.) wrote: > I am testing decision tree using iris.scale data set > (http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/m

Re: Compare performance of sqlContext.jsonFile and sqlContext.jsonRDD

2014-12-11 Thread Cheng Lian
There are several overloaded versions of both |jsonFile| and |jsonRDD|. Schema inferring is kinda expensive since it requires an extra Spark job. You can avoid schema inferring by storing the inferred schema and then use it together with the following two methods: * |def jsonFile(path: String