Standalone cluster on Windows

2014-07-09 Thread Chitturi Padma
Hi, I wanted to set up standalone cluster on windows machine. But unfortunately, spark-master.cmd file is not available. Can someone suggest how to proceed or is spark-1.0.0 has missed spark-master.cmd file ? -- View this message in context:

Re: spark Driver

2014-07-09 Thread Akhil Das
Can you try setting SPARK_MASTER_IP in the spark-env.sh file? Thanks Best Regards On Wed, Jul 9, 2014 at 10:58 AM, amin mohebbi aminn_...@yahoo.com wrote: Hi all, I have one master and two slave node, I did not set any ip for spark driver because I thought it uses its default (

Re: spark Driver

2014-07-09 Thread Akhil Das
Can you also paste a little bit more stacktrace? Thanks Best Regards On Wed, Jul 9, 2014 at 12:05 PM, amin mohebbi aminn_...@yahoo.com wrote: I have the following in spark-env.sh SPARK_MASTER_IP=master SPARK_MASTER_port=7077 Best Regards

Re: Spark Installation

2014-07-09 Thread 田毅
Hi Srikrishna the reason to this issue is you had uploaded assembly jar to HDFS twice. paste your command could be better diagnosis 田毅 === 橘云平台产品线 大数据产品部 亚信联创科技(中国)有限公司 手机:13910177261 电话:010-82166322 传真:010-82166617 Q Q:20057509

Re: spark Driver

2014-07-09 Thread amin mohebbi
This is exactly what I got  Spark Executor Command: java -cp :: /usr/local/spark-1.0.0/conf: /usr/local/spark-1.0.0 /assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.2.1.jar:/usr/local/hadoop/conf -XX:MaxPermSize=128m -Xms512M -Xmx512M

Need advice to create an objectfile of set of images from Spark

2014-07-09 Thread Jaonary Rabarisoa
Hi all, I need to run a spark job that need a set of images as input. I need something that load these images as RDD but I just don't know how to do that. Do some of you have any idea ? Cheers, Jao

Re: Requirements for Spark cluster

2014-07-09 Thread Akhil Das
You can use the spark-ec2/bdutil scripts to set it up on the AWS/GCE cloud quickly. If you want to set it up on your own then these are the things that you will need to do: 1. Make sure you have java (7) installed on all machines. 2. Install and configure spark (add all slave nodes in

How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Haopu Wang
Besides restarting the Master, is there any other way to clear the Completed Applications in Master web UI?

Re: Spark Streaming using File Stream in Java

2014-07-09 Thread Akhil Das
Try this out: JavaStreamingContext sc = new JavaStreamingContext(...);JavaDStreamString lines = ctx.fileStream(whatever);JavaDStreamString words = lines.flatMap( new FlatMapFunctionString, String() { public IterableString call(String s) { return Arrays.asList(s.split( )); } });

Re: error when spark access hdfs with Kerberos enable

2014-07-09 Thread Sandy Ryza
Hi Cheney, I haven't heard of anybody deploying non-secure YARN on top of secure HDFS. It's conceivable that you might be able to get work, but my guess is that you'd run into some issues. Also, without authentication on in YARN, you could be leaving your HDFS tokens exposed, which others could

Re: Re: Pig 0.13, Spark, Spork

2014-07-09 Thread Akhil Das
Hi Bertrand, We've updated the document http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.9.0 This is our working Github repo https://github.com/sigmoidanalytics/spork/tree/spork-0.9 Feel free to open issues over here https://github.com/sigmoidanalytics/spork/issues

Re: How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Patrick Wendell
There isn't currently a way to do this, but it will start dropping older applications once more than 200 are stored. On Wed, Jul 9, 2014 at 4:04 PM, Haopu Wang hw...@qilinsoft.com wrote: Besides restarting the Master, is there any other way to clear the Completed Applications in Master web UI?

how to host the drive node

2014-07-09 Thread aminn_524
I have one master and two slave nodes, I did not set any ip for spark driver. My question is should I set a ip for spark driver and can I host the driver inside the cluster in master node? if so, how to host it? will it be hosted automatically in that node we submit the application by

How to host spark driver

2014-07-09 Thread amin mohebbi
 I have one master and two slave nodes, I did not set any ip for spark driver. My question is should I set a ip for spark driver and can I host the driver inside the cluster in master node? if so, how to host it? will it be hosted automatically in that node we submit the application by

Re: Purpose of spark-submit?

2014-07-09 Thread Patrick Wendell
It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of

Re: Why doesn't the driver node do any work?

2014-07-09 Thread aminn_524
I have one master and two slave nodes, I did not set any ip for spark driver. My question is should I set a ip for spark driver and can I host the driver inside the cluster in master node? if so, how to host it? will it be hosted automatically in that node we submit the application by

Re: Comparative study

2014-07-09 Thread Sean Owen
On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote: Impala is *not* built on map/reduce, though it was built to replace Hive, which is map/reduce based. It has its own distributed query engine, though it does load data from HDFS, and is part of the hadoop ecosystem. Impala

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-09 Thread Juan Rodríguez Hortalá
Hi Jerry, it's all clear to me now, I will try with something like Apache DBCP for the connection pool Thanks a lot for your help! 2014-07-09 3:08 GMT+02:00 Shao, Saisai saisai.s...@intel.com: Yes, that would be the Java equivalence to use static class member, but you should carefully

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-09 Thread Martin Gammelsæter
Thanks for your input, Koert and DB. Rebuilding with 9.x didn't seem to work. For now we've downgraded dropwizard to 0.6.2 which uses a compatible version of jetty. Not optimal, but it works for now. On Tue, Jul 8, 2014 at 7:04 PM, DB Tsai dbt...@dbtsai.com wrote: We're doing similar thing to

Re: issues with ./bin/spark-shell for standalone mode

2014-07-09 Thread Patrick Wendell
Hey Mikhail, I think (hope?) the -em and -dm options were never in an official Spark release. They were just in the master branch at some point. Did you use these during a previous Spark release or were you just on master? - Patrick On Wed, Jul 9, 2014 at 9:18 AM, Mikhail Strebkov

Initial job has not accepted any resources means many things

2014-07-09 Thread Martin Gammelsæter
It seems like the Initial job has not accepted any resources; shows up for a wide variety of different errors (for example the obvious one where you've requested more memory than is available) but also for example in the case where the worker nodes does not have the appropriate code on their class

RE: Kryo is slower, and the size saving is minimal

2014-07-09 Thread innowireless TaeYun Kim
Thank you for your response. Maybe that applies to my case. In my test case, The types of almost all of the data are either primitive types, joda DateTime, or String. But I'm somewhat disappointed with the speed. At least it should not be slower than Java default serializer... -Original

Re: TaskContext stageId = 0

2014-07-09 Thread silvermast
Oh well, never mind. The problem is that ResultTask's stageId is immutable and is used to construct the Task superclass. Anyway, my solution now is to use this.id for the rddId and to gather all rddIds using a spark listener on stage completed to clean up for any activity registered for those

How to run a job on all workers?

2014-07-09 Thread silvermast
Is it possible to run a job that assigns work to every worker in the system? My bootleg right now is to have a spark listener hear whenever a block manager is added and to increase a split count by 1. It runs a spark job with that split count and hopes that it will at least run on the newest

Error using MLlib-NaiveBayes : Matrices are not aligned

2014-07-09 Thread Rahul Bhojwani
I am using Naive Bayes in MLlib . Below I have printed log of *model.theta*. after training on train data. You can check that it contains 9 features for 2 class classification. print numpy.log(model.theta) [[ 0.31618962 0.16636852 0.07200358 0.05411449 0.08542039 0.17620751 0.03711986

Filtering data during the read

2014-07-09 Thread Konstantin Kudryavtsev
Hi all, I wondered if you could help me to clarify the next situation: in the classic example val file = spark.textFile(hdfs://...) val errors = file.filter(line = line.contains(ERROR)) As I understand, the data is read in memory in first, and after that filtering is applying. Is it any way to

Re: Filtering data during the read

2014-07-09 Thread Mayur Rustagi
Hi, Spark does that out of the box for you :) It compresses down the execution steps as much as possible. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jul 9, 2014 at 3:15 PM, Konstantin Kudryavtsev

Docker Scripts

2014-07-09 Thread dmpour23
Hi, Regarding docker scripts I know i can change the base image easily but is there any specific reason why the base image is hadoop_1.2.1 . Why is this prefered to Hadoop2 [HDP2, CDH5]) distributions? Now that amazon supports docker could this replace ec2-scripts? Kind regards Dimitri

FW: memory question

2014-07-09 Thread michael.lewis
Hi, Does anyone know if it is possible to call the MetadaCleaner on demand? i.e. rather than set spark.cleaner.ttl and have this run periodically, I'd like to run it on demand. The problem with periodic cleaning is that it can remove rdd that we still require (some calcs are short, others very

Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$

2014-07-09 Thread gil...@gmail.com
Hello,While trying to run this example below I am getting errors.I have build Spark using the followng command:$ SPARK_HADOOP_VERSION=2.4.0 SPARK_YARN=true SPARK_HIVE=true sbt/sbt clean assembly-Running the example using

Re: Purpose of spark-submit?

2014-07-09 Thread Koert Kuipers
not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch

Re: Purpose of spark-submit?

2014-07-09 Thread Surendranauth Hiraman
Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs

Re: Need advice to create an objectfile of set of images from Spark

2014-07-09 Thread Jaonary Rabarisoa
The idea is to run a job that use images as input so that each work will process a subset of the images On Wed, Jul 9, 2014 at 2:30 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: RDD can only keep objects. How do you plan to encode these images so that they can be loaded. Keeping the whole

Re: Spark job tracker.

2014-07-09 Thread Mayur Rustagi
val sem = 0 sc.addSparkListener(new SparkListener { override def onTaskStart(taskStart: SparkListenerTaskStart) { sem +=1 } }) sc = spark context Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi

Cassandra driver Spark question

2014-07-09 Thread RodrigoB
Hi all, I am currently trying to save to Cassandra after some Spark Streaming computation. I call a myDStream.foreachRDD so that I can collect each RDD in the driver app runtime and inside I do something like this: myDStream.foreachRDD(rdd = { var someCol = Seq[MyType]() foreach(kv ={

how to convert JavaDStreamString to JavaRDDString

2014-07-09 Thread Madabhattula Rajesh Kumar
Hi Team, Could you please help me to resolve below query. My use case is : I'm using JavaStreamingContext to read text files from Hadoop - HDFS directory JavaDStreamString lines_2 = ssc.textFileStream(hdfs://localhost:9000/user/rajesh/EventsDirectory/); How to convert JavaDStreamString result

Re: Cassandra driver Spark question

2014-07-09 Thread Luis Ángel Vicente Sánchez
Is MyType serializable? Everything inside the foreachRDD closure has to be serializable. 2014-07-09 14:24 GMT+01:00 RodrigoB rodrigo.boav...@aspect.com: Hi all, I am currently trying to save to Cassandra after some Spark Streaming computation. I call a myDStream.foreachRDD so that I can

Re: Purpose of spark-submit?

2014-07-09 Thread Robert James
+1 to be able to do anything via SparkConf/SparkContext. Our app worked fine in Spark 0.9, but, after several days of wrestling with uber jars and spark-submit, and so far failing to get Spark 1.0 working, we'd like to go back to doing it ourself with SparkConf. As the previous poster said, a

Re: controlling the time in spark-streaming

2014-07-09 Thread Laeeq Ahmed
Hi, For QueueRDD, have a look here. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala   Regards, Laeeq On Friday, May 23, 2014 10:33 AM, Mayur Rustagi mayur.rust...@gmail.com wrote: Well its hard to use text data

RDD Cleanup

2014-07-09 Thread premdass
Hi, I using spark 1.0.0 , using Ooyala Job Server, for a low latency query system. Basically a long running context is created, which enables to run multiple jobs under the same context, and hence sharing of the data. It was working fine in 0.9.1. However in spark 1.0 release, the RDD's created

Re: how to convert JavaDStreamString to JavaRDDString

2014-07-09 Thread Laeeq Ahmed
Hi, First use foreachrdd and then use collect as DStream.foreachRDD(rdd = {    rdd.collect.foreach({ Also its better to use scala. Less verbose. Regards, Laeeq On Wednesday, July 9, 2014 3:29 PM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi Team, Could you please

Re: window analysis with Spark and Spark streaming

2014-07-09 Thread Laeeq Ahmed
Hi, For QueueRDD, have a look here. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/QueueStream.scala Regards, Laeeq, PhD candidatte, KTH, Stockholm.   On Sunday, July 6, 2014 10:20 AM, alessandro finamore alessandro.finam...@polito.it

Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam
+1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM,

Re: Cassandra driver Spark question

2014-07-09 Thread RodrigoB
Hi Luis, Yes it's actually an ouput of the previous RDD. Have you ever used the Cassandra Spark Driver on the driver app? I believe these limitations go around that - it's designed to save RDDs from the nodes. tnks, Rod -- View this message in context:

Re: RDD Cleanup

2014-07-09 Thread Koert Kuipers
did you explicitly cache the rdd? we cache rdds and share them between jobs just fine within one context in spark 1.0.x. but we do not use the ooyala job server... On Wed, Jul 9, 2014 at 10:03 AM, premdass premdas...@yahoo.co.in wrote: Hi, I using spark 1.0.0 , using Ooyala Job Server, for

Re: Cassandra driver Spark question

2014-07-09 Thread Luis Ángel Vicente Sánchez
Yes, I'm using it to count concurrent users from a kafka stream of events without problems. I'm currently testing it using the local mode but any serialization problem would have already appeared so I don't expect any serialization issue when I deployed to my cluster. 2014-07-09 15:39 GMT+01:00

Re: Spark Streaming and Storm

2014-07-09 Thread Dan H.
Xichen_tju, I recently evaluated Storm for a period of months (using 2Us, 2.4GHz CPU, 24GBRAM with 3 servers) and was not able to achieve a realistic scale for my business domain needs. Storm is really only a framework, which allows you to put in code to do whatever it is you need for a

Re: RDD Cleanup

2014-07-09 Thread premdass
Hi, Yes . I am caching the RDD's by calling cache method.. May i ask, how you are sharing RDD's across jobs in same context? By the RDD name. I tried printing the RDD's of the Spark context, and when the referenceTracking is enabled, i get empty list after the clean up. Thanks, Prem --

Re: RDD Cleanup

2014-07-09 Thread Koert Kuipers
we simply hold on to the reference to the rdd after it has been cached. so we have a single Map[String, RDD[X]] for cached RDDs for the application On Wed, Jul 9, 2014 at 11:00 AM, premdass premdas...@yahoo.co.in wrote: Hi, Yes . I am caching the RDD's by calling cache method.. May i

Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
I am trying to get my head around using Spark on Yarn from a perspective of a cluster. I can start a Spark Shell no issues in Yarn. Works easily. This is done in yarn-client mode and it all works well. In multiple examples, I see instances where people have setup Spark Clusters in Stand Alone

Re: Purpose of spark-submit?

2014-07-09 Thread Andrei
One another +1. For me it's a question of embedding. With SparkConf/SparkContext I can easily create larger projects with Spark as a separate service (just like MySQL and JDBC, for example). With spark-submit I'm bound to Spark as a main framework that defines how my application should look like.

Re: Purpose of spark-submit?

2014-07-09 Thread Sandy Ryza
Spark still supports the ability to submit jobs programmatically without shell scripts. Koert, The main reason that the unification can't be a part of SparkContext is that YARN and standalone support deploy modes where the driver runs in a managed process on the cluster. In this case, the

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Ron Gonzalez
The idea behind YARN is that you can run different application types like MapReduce, Storm and Spark. I would recommend that you build your spark jobs in the main method without specifying how you deploy it. Then you can use spark-submit to tell Spark how you would want to deploy to it using

Mechanics of passing functions to Spark?

2014-07-09 Thread Seref Arikan
Greetings, The documentation at http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark says: Note that while it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains

Re: Purpose of spark-submit?

2014-07-09 Thread Koert Kuipers
sandy, that makes sense. however i had trouble doing programmatic execution on yarn in client mode as well. the application-master in yarn came up but then bombed because it was looking for jars that dont exist (it was looking in the original file paths on the driver side, which are not available

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Sandy Ryza
To add to Ron's answer, this post explains what it means to run Spark against a YARN cluster, the difference between yarn-client and yarn-cluster mode, and the reason spark-shell only works in yarn-client mode.

Error with Stream Kafka Kryo

2014-07-09 Thread richiesgr
Hi My setup is to use localMode standalone, Sprak 1.0.0 release version, scala 2.10.4 I made a job that receive serialized object from Kafka broker. The objects are serialized using kryo. The code : val sparkConf = new SparkConf().setMaster(local[4]).setAppName(SparkTest)

Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam
Sandy, I experienced the similar behavior as Koert just mentioned. I don't understand why there is a difference between using spark-submit and programmatic execution. Maybe there is something else we need to add to the spark conf/spark context in order to launch spark jobs programmatically that

Apache Spark, Hadoop 2.2.0 without Yarn Integration

2014-07-09 Thread Nick R. Katsipoulakis
Hello, I am currently learning Apache Spark and I want to see how it integrates with an existing Hadoop Cluster. My current Hadoop configuration is version 2.2.0 without Yarn. I have build Apache Spark (v1.0.0) following the instructions in the README file. Only setting the

Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez
Koert, Yeah I had the same problems trying to do programmatic submission of spark jobs to my Yarn cluster. I was ultimately able to resolve it by reviewing the classpath and debugging through all the different things that the Spark Yarn client (Client.scala) did for submitting to Yarn (like env

Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez
I am able to use Client.scala or LauncherExecutor.scala as my programmatic entry point for Yarn. Thanks, Ron Sent from my iPad On Jul 9, 2014, at 7:14 AM, Jerry Lam chiling...@gmail.com wrote: +1 as well for being able to submit jobs programmatically without using shell script. we

Re: issues with ./bin/spark-shell for standalone mode

2014-07-09 Thread Mikhail Strebkov
Hi Patrick, I used 1.0 branch, but it was not an official release, I just git pulled whatever was there and compiled. Thanks, Mikhail -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issues-with-bin-spark-shell-for-standalone-mode-tp9107p9206.html Sent

Re: Terminal freeze during SVM

2014-07-09 Thread AlexanderRiggers
By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by master? Because I am using v 1.0.0 - Alex -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9208.html Sent from the Apache Spark User List

Spark 0.9.1 implementation of MLlib-NaiveBayes is having bug.

2014-07-09 Thread Rahul Bhojwani
According to me there is BUG in MLlib Naive Bayes implementation in spark 0.9.1. Whom should I report this to or with whom should I discuss? I can discuss this over call as well. My Skype ID : rahul.bhijwani Phone no: +91-9945197359 Thanks, -- Rahul K Bhojwani 3rd Year B.Tech Computer

Re: Spark 0.9.1 implementation of MLlib-NaiveBayes is having bug.

2014-07-09 Thread Sean Owen
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Make a JIRA with enough detail to reproduce the error ideally: https://issues.apache.org/jira/browse/SPARK and then even more ideally open a PR with a fix: https://github.com/apache/spark On Wed, Jul 9, 2014 at 5:57 PM,

Re: Terminal freeze during SVM

2014-07-09 Thread DB Tsai
It means pulling the code from latest development branch from git repository. On Jul 9, 2014 9:45 AM, AlexanderRiggers alexander.rigg...@gmail.com wrote: By latest branch you mean Apache Spark 1.0.0 ? and what do you mean by master? Because I am using v 1.0.0 - Alex -- View this message in

Compilation error in Spark 1.0.0

2014-07-09 Thread Silvina Caíno Lores
Hi everyone, I am new to Spark and I'm having problems to make my code compile. I have the feeling I might be misunderstanding the functions so I would be very glad to get some insight in what could be wrong. The problematic code is the following: JavaRDDBody bodies = lines.map(l - {Body b =

Re: Comparative study

2014-07-09 Thread Keith Simmons
Good point. Shows how personal use cases color how we interpret products. On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen so...@cloudera.com wrote: On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons ke...@pulse.io wrote: Impala is *not* built on map/reduce, though it was built to replace Hive, which

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
Thank you for the link. In that link the following is written: For those familiar with the Spark API, an application corresponds to an instance of the SparkContext class. An application can be used for a single batch job, an interactive session with multiple jobs spaced apart, or a long-lived

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
So basically, I have Spark on Yarn running (spark shell) how do I connect to it with another tool I am trying to test using the spark://IP:7077 URL it's expecting? If that won't work with spark shell, or yarn-client mode, how do I setup Spark on Yarn to be able to handle that? Thanks! On

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Sandy Ryza
Spark doesn't currently offer you anything special to do this. I.e. if you want to write a Spark application that fires off jobs on behalf of remote processes, you would need to implement the communication between those remote processes and your Spark application code yourself. On Wed, Jul 9,

Re: Execution stalls in LogisticRegressionWithSGD

2014-07-09 Thread Xiangrui Meng
We have maven-enforcer-plugin defined in the pom. I don't know why it didn't work for you. Could you try rebuild with maven2 and confirm that there is no error message? If that is the case, please create a JIRA for it. Thanks! -Xiangrui On Wed, Jul 9, 2014 at 3:53 AM, Bharath Ravi Kumar

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread John Omernik
So how do I do the long-lived server continually satisfying requests in the Cloudera application? I am very confused by that at this point. On Wed, Jul 9, 2014 at 12:49 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Spark doesn't currently offer you anything special to do this. I.e. if you

Re: Spark SQL - java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$

2014-07-09 Thread Michael Armbrust
At first glance that looks like an error with the class shipping in the spark shell. (i.e. the line that you type into the spark shell are compiled into classes and then shipped to the executors where they run). Are you able to run other spark examples with closures in the same shell? Michael

Spark Streaming - two questions about the streamingcontext

2014-07-09 Thread Yan Fang
I am using the Spark Streaming and have the following two questions: 1. If more than one output operations are put in the same StreamingContext (basically, I mean, I put all the output operations in the same class), are they processed one by one as the order they appear in the class? Or they are

How should I add a jar?

2014-07-09 Thread Nick Chammas
I’m just starting to use the Scala version of Spark’s shell, and I’d like to add in a jar I believe I need to access Twitter data live, twitter4j http://twitter4j.org/en/index.html. I’m confused over where and how to add this jar in. SPARK-1089 https://issues.apache.org/jira/browse/SPARK-1089

Re: Spark Streaming - two questions about the streamingcontext

2014-07-09 Thread Tathagata Das
1. Multiple output operations are processed in the order they are defined. That is because by default each one output operation is processed at a time. This *can* be parallelized using an undocumented config parameter spark.streaming.concurrentJobs which is by default set to 1. 2. Yes, the output

Re: Spark Streaming - two questions about the streamingcontext

2014-07-09 Thread Yan Fang
Great. Thank you! Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Wed, Jul 9, 2014 at 11:45 AM, Tathagata Das tathagata.das1...@gmail.com wrote: 1. Multiple output operations are processed in the order they are defined. That is because by default each one output operation is processed at

Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-09 Thread M Singh
Hi Folks: I am working on an application which uses spark streaming (version 1.1.0 snapshot on a standalone cluster) to process text file and save counters in cassandra based on fields in each row.  I am testing the application in two modes:  * Process each row and save the counter

RE: How should I add a jar?

2014-07-09 Thread Sameer Tilak
Hi Nicholas, I am using Spark 1.0 and I use this method to specify the additional jars. First jar is the dependency and the second one is my application. Hope this will work for you. ./spark-shell --jars

Some question about SQL and streaming

2014-07-09 Thread hsy...@gmail.com
Hi guys, I'm a new user to spark. I would like to know is there an example of how to user spark SQL and spark streaming together? My use case is I want to do some SQL on the input stream from kafka. Thanks! Best, Siyuan

Re: How should I add a jar?

2014-07-09 Thread Nicholas Chammas
Awww ye. That worked! Thank you Sameer. Is this documented somewhere? I feel there there's a slight doc deficiency here. Nick On Wed, Jul 9, 2014 at 2:50 PM, Sameer Tilak ssti...@live.com wrote: Hi Nicholas, I am using Spark 1.0 and I use this method to specify the additional jars.

Re: Compilation error in Spark 1.0.0

2014-07-09 Thread Silvina Caíno Lores
Right, the compile error is a casting issue telling me I cannot assign a JavaPairRDDPartition, Body to a JavaPairRDDObject, Object. It happens in the mapToPair() method. On 9 July 2014 19:52, Sean Owen so...@cloudera.com wrote: You forgot the compile error! On Wed, Jul 9, 2014 at 6:14 PM,

SPARK_CLASSPATH Warning

2014-07-09 Thread Nick R. Katsipoulakis
Hello, I have installed Apache Spark v1.0.0 in a machine with a proprietary Hadoop Distribution installed (v2.2.0 without yarn). Due to the fact that the Hadoop Distribution that I am using, uses a list of jars , I do the following changes to the conf/spark-env.sh #!/usr/bin/env bash export

Understanding how to install in HDP

2014-07-09 Thread Abel Coronado Iruegas
Hi everybody We have hortonworks cluster with many nodes, we want to test a deployment of Spark. Whats the recomended path to follow? I mean we can compile the sources in the Name Node. But i don't really understand how to pass the executable jar and configuration to the rest of the nodes.

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Bill Jay
Hi Tobias, I was using Spark 0.9 before and the master I used was yarn-standalone. In Spark 1.0, the master will be either yarn-cluster or yarn-client. I am not sure whether it is the reason why more machines do not provide better scalability. What is the difference between these two modes in

Re: Apache Spark, Hadoop 2.2.0 without Yarn Integration

2014-07-09 Thread Nick R. Katsipoulakis
Krishna, Ok, thank you. I just wanted to make sure that this can be done. Cheers, Nick On Wed, Jul 9, 2014 at 3:30 PM, Krishna Sankar ksanka...@gmail.com wrote: Nick, AFAIK, you can compile with yarn=true and still run spark in stand alone cluster mode. Cheers k/ On Wed, Jul 9,

Re: Requirements for Spark cluster

2014-07-09 Thread Krishna Sankar
I rsync the spark-1.0.1 directory to all the nodes. Yep, one needs Spark in all the nodes irrespective of Hadoop/YARN. Cheers k/ On Tue, Jul 8, 2014 at 6:24 PM, Robert James srobertja...@gmail.com wrote: I have a Spark app which runs well on local master. I'm now ready to put it on a

Number of executors change during job running

2014-07-09 Thread Bill Jay
Hi all, I have a Spark streaming job running on yarn. It consume data from Kafka and group the data by a certain field. The data size is 480k lines per minute where the batch size is 1 minute. For some batches, the program sometimes take more than 3 minute to finish the groupBy operation, which

CoarseGrainedExecutorBackend: Driver Disassociated‏

2014-07-09 Thread Sameer Tilak
Hi,This time instead of manually starting worker node using ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT I used start-slaves script on the master node. I also enabled -v (verbose flag) in ssh. Here is the o/p that I see. The log file for to the worker node was not

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-09 Thread vs
The Hortonworks Tech Preview of Spark is for Spark on YARN. It does not require Spark to be installed on all nodes manually. When you submit the Spark assembly jar it will have all its dependencies. YARN will instantiate Spark App Master Containers based on this jar. -- View this message in

spark1.0 principal component analysis

2014-07-09 Thread fintis
Hi, Can anyone please shed more light on the PCA implementation in spark? The documentation is a bit leaving as I am not sure I understand the output. According to the docs, the output is a local matrix with the columns as principal components and columns sorted in descending order of

Re: How should I add a jar?

2014-07-09 Thread Nicholas Chammas
Public service announcement: If you're trying to do some stream processing on Twitter data, you'll need version 3.0.6 of twitter4j http://twitter4j.org/archive/. That should work with the Spark Streaming 1.0.0 Twitter library. The latest version of twitter4j, 4.0.2, appears to have breaking

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Tobias Pfeiffer
Bill, I haven't worked with Yarn, but I would try adding a repartition() call after you receive your data from Kafka. I would be surprised if that didn't help. On Thu, Jul 10, 2014 at 6:23 AM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi Tobias, I was using Spark 0.9 before and the master

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Bill Jay
Hi Tobias, Now I did the re-partition and ran the program again. I find a bottleneck of the whole program. In the streaming, there is a stage marked as *combineByKey at ShuffledDStream.scala:42 *in spark UI. This stage is repeatedly executed. However, during some batches, the number of executors

Re: Some question about SQL and streaming

2014-07-09 Thread Tobias Pfeiffer
Siyuan, I do it like this: // get data from Kafka val ssc = new StreamingContext(...) val kvPairs = KafkaUtils.createStream(...) // we need to wrap the data in a case class for registerAsTable() to succeed val lines = kvPairs.map(_._2).map(s = StringWrapper(s)) val result = lines.transform((rdd,

Re: Use Spark Streaming to update result whenever data come

2014-07-09 Thread Tobias Pfeiffer
Bill, good to know you found your bottleneck. Unfortunately, I don't know how to solve this; until know, I have used Spark only with embarassingly parallel operations such as map or filter. I hope someone else might provide more insight here. Tobias On Thu, Jul 10, 2014 at 9:57 AM, Bill Jay

Spark Streaming - What does Spark Streaming checkpoint?

2014-07-09 Thread Yan Fang
Hi guys, I am a little confusing by the checkpointing in Spark Streaming. It checkpoints the intermediate data for the stateful operations for sure. Does it also checkpoint the information of StreamingContext? Because it seems we can recreate the SC from the checkpoint in a driver node failure

Restarting a Streaming Context

2014-07-09 Thread Nick Chammas
So I do this from the Spark shell: // set things up// snipped ssc.start() // let things happen for a few minutes ssc.stop(stopSparkContext = false, stopGracefully = true) Then I want to restart the Streaming Context: ssc.start() // still in the shell; Spark Context is still alive Which

Re: Purpose of spark-submit?

2014-07-09 Thread Andrew Or
I don't see why using SparkSubmit.scala as your entry point would be any different, because all that does is invoke the main class of Client.scala (e.g. for Yarn) after setting up all the class paths and configuration options. (Though I haven't tried this myself) 2014-07-09 9:40 GMT-07:00 Ron

  1   2   >