How to automatically relaunch a Driver program after crashes?

2015-08-18 Thread Spark Enthusiast
Folks, As I see, the Driver program is a single point of failure. Now, I have seen ways as to how to make it recover from failures on a restart (using Checkpointing) but I have not seen anything as to how to restart it automatically if it crashes. Will running the Driver as a Hadoop Yarn Applica

Re: broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-18 Thread Shenghua(Daniel) Wan
+1 I wish I have read this blog earlier. I am using Java and have just implemented a singleton producer per executor/JVM during the day. Yes, I did see that NonSerializableException when I was debugging the code ... Thanks for sharing. On Tue, Aug 18, 2015 at 10:59 PM, Tathagata Das wrote: > I

Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-18 Thread Rick Moritz
Dear list, I am observing a very strange difference in behaviour between a Spark 1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin interpreter (compiled with Java 6 and sourced from maven central). The workflow loads data from Hive, applies a number of transformations (incl

Re: Spark Interview Questions

2015-08-18 Thread Sandeep Giri
Thank you All. I have updated it to a little better version. Regards, Sandeep Giri, +1 347 781 4573 (US) +91-953-899-8962 (IN) www.KnowBigData.com. Phone: +1-253-397-1945 (Office) [image: linkedin icon] [image: other site icon

SaveAsTable changes the order of rows

2015-08-18 Thread Kevin Jung
I have a simple RDD with Key/Value and do val partitioned = rdd.partitionBy(new HashPartitioner(400)) val row = partitioned.first I can get a key "G2726" from a returned row. This first row is located on a partition #0 because "G2726".hashCode is 67114000 and 67114000%400 is 0. But the order of

Re: What am I missing that's preventing javac from finding the libraries (CLASSPATH is setup...)?

2015-08-18 Thread UMESH CHAUDHARY
Just add spark_1.4.1_yarn_shuffle.jar in ClassPath or create a New Maven project using below dependency: org.apache.spark spark-core_2.11 1.4.1 org.apache.spark spark-sql_2.11 1.4.1 On Tue, Aug 18, 2015 at 11:51 PM, Jerry wrote: > So from what I understand, those usually pull dependenc

Re: broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-18 Thread Tathagata Das
Its a cool blog post! Tweeted it! Broadcasting the configuration necessary for lazily instantiating the producer is a good idea. Nitpick: The first code example has an extra `}` ;) On Tue, Aug 18, 2015 at 10:49 PM, Marcin Kuthan wrote: > As long as Kafka producent is thread-safe you don't need

Repartitioning external table in Spark sql

2015-08-18 Thread James Pirz
I am using Spark 1.4.1 , in stand-alone mode, on a cluster of 3 nodes. Using Spark sql and Hive Context, I am trying to run a simple scan query on an existing Hive table (which is an external table consisting of rows in text files stored in HDFS - it is NOT parquet, ORC or any other richer format)

SparkR csv without headers

2015-08-18 Thread Franc Carter
Hi, Does anyone have an example of how to create a DataFrame in SparkR which specifies the column names - the csv files I have do not have column names in the first row. I can get read a csv nicely with com.databricks:spark-csv_2.10:1.0.3, but I end up with column names C1, C2, C3 etc thanks -

Re: broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-18 Thread Marcin Kuthan
As long as Kafka producent is thread-safe you don't need any pool at all. Just share single producer on every executor. Please look at my blog post for more details. http://allegro.tech/spark-kafka-integration.html 19 sie 2015 2:00 AM "Shenghua(Daniel) Wan" napisał(a): > All of you are right. > >

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread andy petrella
Hey, Actually, for Scala, I'd better using https://github.com/andypetrella/spark-notebook/ It's deployed at several places like *Alibaba*, *EBI*, *Cray* and is supported by both the Scala community and the company Data Fellas. For instance, it was part of the Big Scala Pipeline training given thi

Re: Too many files/dirs in hdfs

2015-08-18 Thread UMESH CHAUDHARY
Of course, Java or Scala can do that: 1) Create a FileWriter with append or roll over option 2) For each RDD create a StringBuilder after applying your filters 3) Write this StringBuilder to File when you want to write (The duration can be defined as a condition) On Tue, Aug 18, 2015 at 11:05 PM,

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Guru Medasani
For python it is really great. There is some work in progress in bringing Scala support to Jupyter as well. https://github.com/hohonuuli/sparknotebook https://github.com/alexarchambault/jupyter-scala

Failed to fetch block error

2015-08-18 Thread swetha
Hi, I see the following error in my Spark Job even after using like 100 cores and 16G memory. Did any of you experience the same problem earlier? 15/08/18 21:51:23 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block input-0-1439959114400, and will not retry (0 retries) java.lang.RuntimeExce

Re: to retrive full stack trace

2015-08-18 Thread Koert Kuipers
if you error is on executors you need to check the executor logs for full stacktrace On Tue, Aug 18, 2015 at 10:01 PM, satish chandra j wrote: > HI All, > Please let me know if any arguments to be passed in CLI to retrieve FULL > STACK TRACE in Apache Spark > > I am stuck in a issue for which it

to retrive full stack trace

2015-08-18 Thread satish chandra j
HI All, Please let me know if any arguments to be passed in CLI to retrieve FULL STACK TRACE in Apache Spark I am stuck in a issue for which it would be helpful to analyze full stack trace Regards, Satish Chandra

Re: Is it this a BUG?: Why Spark Flume Streaming job is not deploying the Receiver to the specified host?

2015-08-18 Thread Tathagata Das
I dont think there is a super clean way for doing this. Here is an idea. Run a dummy job with large number of partitions/tasks, which will access SparkEnv.get.blockManager().blockManagerId().host() and return it. sc.makeRDD(1 to 100, 100).map { _ => SparkEnv.get.blockManager().blockManagerId().host

Re: Is it this a BUG?: Why Spark Flume Streaming job is not deploying the Receiver to the specified host?

2015-08-18 Thread diplomatic Guru
Thank you Tathagata for your response. Yes, I'm using push model on Spark 1.2. For my scenario I do prefer the push model. Is this the case on the later version 1.4 too? I think I can find a workaround for this issue but only if I know how to obtain the worker(executor) ID. I can get the detail o

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi Prabeesh, That's even better! Thanks for sharing Jerry On Tue, Aug 18, 2015 at 1:31 PM, Prabeesh K. wrote: > Refer this post > http://blog.prabeeshk.com/blog/2015/06/19/pyspark-notebook-with-docker/ > > Spark + Jupyter + Docker > > On 18 August 2015 at 21:29, Jerry Lam wrote: > >> Hi Gur

Re: broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-18 Thread Shenghua(Daniel) Wan
All of you are right. I was trying to create too many producers. My idea was to create a pool(for now the pool contains only one producer) shared by all the executors. After I realized it was related to the serializable issues (though I did not find clear clues in the source code to indicate the b

[mllib] Random forest maxBins and confidence in training points

2015-08-18 Thread Mark Alen
Hi everyone,  I have two questions regarding the random forest implementation in mllib 1- maxBins: Say the value of a feature is between [0,100]. In my dataset there are a lot of data points between [0,10] and one datapoint at 100 and nothing between (10, 100). I am wondering how does the binning

Re: broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-18 Thread Tathagata Das
Why are you even trying to broadcast a producer? A broadcast variable is some immutable piece of serializable DATA that can be used for processing on the executors. A Kafka producer is neither DATA nor immutable, and definitely not serializable. The right way to do this is to create the producer in

Re: What is the reason for ExecutorLostFailure?

2015-08-18 Thread Corey Nolet
Usually more information as to the cause of this will be found down in your logs. I generally see this happen when an out of memory exception has occurred for one reason or another on an executor. It's possible your memory settings are too small per executor or the concurrent number of tasks you ar

What is the reason for ExecutorLostFailure?

2015-08-18 Thread VIJAYAKUMAR JAWAHARLAL
Hi All Why am I getting ExecutorLostFailure and executors are completely lost for rest of the processing? Eventually it makes job to fail. One thing for sure that lot of shuffling happens across executors in my program. Is there a way to understand and debug ExecutorLostFailure? Any pointers

Re: broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-18 Thread Cody Koeninger
I wouldn't expect a kafka producer to be serializable at all... among other things, it has a background thread On Tue, Aug 18, 2015 at 4:55 PM, Shenghua(Daniel) Wan wrote: > Hi, > Did anyone see java.util.ConcurrentModificationException when using > broadcast variables? > I encountered this exce

broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-18 Thread Shenghua(Daniel) Wan
Hi, Did anyone see java.util.ConcurrentModificationException when using broadcast variables? I encountered this exception when wrapping a Kafka producer like this in the spark streaming driver. Here is what I did. KafkaProducer producer = new KafkaProducer(properties); final Broadcast bCastProduce

Re: Is it this a BUG?: Why Spark Flume Streaming job is not deploying the Receiver to the specified host?

2015-08-18 Thread Tathagata Das
Are you using the Flume polling stream or the older stream? Such problems of binding used to occur in the older push-based approach, hence we built the polling stream (pull-based). On Tue, Aug 18, 2015 at 4:45 AM, diplomatic Guru wrote: > I'm testing the Flume + Spark integration example (flum

Re: Scala: How to match a java object????

2015-08-18 Thread Sujit Pal
Hi Saif, Would this work? import scala.collection.JavaConversions._ new java.math.BigDecimal(5) match { case x: java.math.BigDecimal => x.doubleValue } It gives me on the scala console. res9: Double = 5.0 Assuming you had a stream of BigDecimals, you could just call map on it. myBigDecimals.

Re: dse spark-submit multiple jars issue

2015-08-18 Thread Andrew Or
Hi Satish, The problem is that `--jars` accepts a comma-delimited list of jars! E.g. spark-submit ... --jars lib1.jar,lib2.jar,lib3.jar main.jar where main.jar is your main application jar (the one that starts a SparkContext), and lib*.jar refer to additional libraries that your main application

Re: Difference between Sort based and Hash based shuffle

2015-08-18 Thread Andrew Or
Hi Muhammad, On a high level, in hash-based shuffle each mapper M writes R shuffle files, one for each reducer where R is the number of reduce partitions. This results in M * R shuffle files. Since it is not uncommon for M and R to be O(1000), this quickly becomes expensive. An optimization with h

Spark scala addFile retrieving file with incorrect size

2015-08-18 Thread Bernardo Vecchia Stein
Hi all, I'm trying to run a spark job (written in scala) that uses addFile to download some small files to each node. However, one of the downloaded files has an incorrect size (the other ones are ok), which causes an error when using it in the code. I have looked more into the issue and hexdump'

Re: Json Serde used by Spark Sql

2015-08-18 Thread Michael Armbrust
Under the covers we use Jackson's Streaming API as of Spark 1.4. On Tue, Aug 18, 2015 at 1:12 PM, Udit Mehta wrote: > Hi, > > I was wondering what json serde does spark sql use. I created a JsonRDD > out of a json file and then registered it as a temp table to query. I can > then query the table

Re: Programmatically create SparkContext on YARN

2015-08-18 Thread Andrew Or
Hi Andreas, I believe the distinction is not between standalone and YARN mode, but between client and cluster mode. In client mode, your Spark submit JVM runs your driver code. In cluster mode, one of the workers (or NodeManagers if you're using YARN) in the cluster runs your driver code. In the

NaN in GraphX PageRank answer

2015-08-18 Thread Khaled Ammar
Hi all, I was trying to use GraphX to compute pagerank and found that pagerank value for several vertices is NaN. I am using Spark 1.3. Any idea how to fix that? -- Thanks, -Khaled

Re: how do I execute a job on a single worker node in standalone mode

2015-08-18 Thread Andrew Or
Hi Axel, You can try setting `spark.deploy.spreadOut` to false (through your conf/spark-defaults.conf file). What this does is essentially try to schedule as many cores on one worker as possible before spilling over to other workers. Note that you *must* restart the cluster through the sbin script

RE: Evaluating spark + Cassandra for our use cases

2015-08-18 Thread Benjamin Ross
Hi Jorn, Of course we're planning on doing a proof of concept here - the difficulty is that our timeline is short, so we cannot afford too many PoCs before we have to make a decision. We also need to figure out *which* databases to proof of concept. Note that one tricky aspect of our problem i

Re: Why standalone mode don't allow to set num-executor ?

2015-08-18 Thread Andrew Or
Hi Canan, This is mainly for legacy reasons. The default behavior in standalone in mode is that the application grabs all available resources in the cluster. This effectively means we want one executor per worker, where each executor grabs all the available cores and memory on that worker. In this

Re: Scala: How to match a java object????

2015-08-18 Thread Marcelo Vanzin
On Tue, Aug 18, 2015 at 1:19 PM, wrote: > Hi, Can you please elaborate? I am confused :-) You did note that the two pieces of code are different, right? See http://docs.scala-lang.org/tutorials/tour/pattern-matching.html for how to match things in Scala, especially the "typed pattern" example.

RE: Scala: How to match a java object????

2015-08-18 Thread Saif.A.Ellafi
Hi, Can you please elaborate? I am confused :-) Saif -Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Tuesday, August 18, 2015 5:15 PM To: Ellafi, Saif A. Cc: wrbri...@gmail.com; user@spark.apache.org Subject: Re: Scala: How to match a java object On Tue, A

Re: Evaluating spark + Cassandra for our use cases

2015-08-18 Thread Jörn Franke
Hi, First you need to make your SLA clear. It does not sound for me they are defined very well or that your solution is necessary for the scenario. I also find it hard to believe that 1 customer has 100Million transactions per month. Time series data is easy to precalculate - you do not necessari

Re: Scala: How to match a java object????

2015-08-18 Thread Marcelo Vanzin
On Tue, Aug 18, 2015 at 12:59 PM, wrote: > > 5 match { case java.math.BigDecimal => 2 } 5 match { case _: java.math.BigDecimal => 2 } -- Marcelo - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional comman

Json Serde used by Spark Sql

2015-08-18 Thread Udit Mehta
Hi, I was wondering what json serde does spark sql use. I created a JsonRDD out of a json file and then registered it as a temp table to query. I can then query the table using dot notation for nested structs/arrays. I was wondering how does spark sql deserialize the json data based on the query.

RE: Scala: How to match a java object????

2015-08-18 Thread Saif.A.Ellafi
Hi, thank you for further assistance you can reproduce this by simply running 5 match { case java.math.BigDecimal => 2 } In my personal case, I am applying a map acton to a Seq[Any], so the elements inside are of type any, to which I need to apply a proper .asInstanceOf[WhoYouShouldBe]. Saif

Re: Spark Job Hangs on our production cluster

2015-08-18 Thread Imran Rashid
sorry, by "repl" I mean "spark-shell", I guess I'm used to them being used interchangeably. From that thread dump, the one thread that isn't stuck is trying to get classes specifically related to the shell / repl: java.lang.Thread.State: RUNNABLE > at java.net.SocketInputStream.socketR

Re: Scala: How to match a java object????

2015-08-18 Thread William Briggs
Could you share your pattern matching expression that is failing? On Tue, Aug 18, 2015, 3:38 PM wrote: > Hi all, > > I am trying to run a spark job, in which I receive *java.math.BigDecimal* > objects, > instead of the scala equivalents, and I am trying to convert them into > Doubles. > If I t

Scala: How to match a java object????

2015-08-18 Thread Saif.A.Ellafi
Hi all, I am trying to run a spark job, in which I receive java.math.BigDecimal objects, instead of the scala equivalents, and I am trying to convert them into Doubles. If I try to match-case this object class, I get: "error: object java.math.BigDecimal is not a value" How could I get around m

RE: Spark Job Hangs on our production cluster

2015-08-18 Thread java8964
Hi, Imran: Thanks for your reply. I am not sure what do you mean "repl". Can you be more detail about that? This is only happened when the Spark 1.2.2 try to scan big data set, and cannot reproduce if it scans smaller dataset. FYI, I have to build and deploy Spark 1.3.1 on our production cluster.

Evaluating spark + Cassandra for our use cases

2015-08-18 Thread Benjamin Ross
My company is interested in building a real-time time-series querying solution using Spark and Cassandra. Specifically, we're interested in setting up a Spark system against Cassandra running a hive thrift server. We need to be able to perform real-time queries on time-series data - things lik

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Shushant Arora
But KafkaRDD[K, V, U, T, R] is not subclass of RDD[R] as per java generic inheritance is not supported so derived class cannot return different genric typed subclass from overriden method. On Wed, Aug 19, 2015 at 12:02 AM, Cody Koeninger wrote: > Option is covariant and KafkaRDD is a subclass o

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Cody Koeninger
Option is covariant and KafkaRDD is a subclass of RDD On Tue, Aug 18, 2015 at 1:12 PM, Shushant Arora wrote: > Is it that in scala its allowed for derived class to have any return type ? > > And streaming jar is originally created in scala so its allowed for > DirectKafkaInputDStream to return

Re: What am I missing that's preventing javac from finding the libraries (CLASSPATH is setup...)?

2015-08-18 Thread Ted Yu
Normally people would establish maven project with Spark dependencies or, use sbt. Can you go with either approach ? Cheers On Tue, Aug 18, 2015 at 10:28 AM, Jerry wrote: > Hello, > > So I setup Spark to run on my local machine to see if I can reproduce the > issue I'm having with data frames,

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Shushant Arora
Is it that in scala its allowed for derived class to have any return type ? And streaming jar is originally created in scala so its allowed for DirectKafkaInputDStream to return Option[KafkaRDD[K, V, U, T, R]] compute method ? On Tue, Aug 18, 2015 at 8:36 PM, Shushant Arora wrote: > looking a

Re: What am I missing that's preventing javac from finding the libraries (CLASSPATH is setup...)?

2015-08-18 Thread Jerry
So from what I understand, those usually pull dependencies for a given project? I'm able to run the spark shell so I'd assume I have everything. What am I missing from the big picture and what directory do I run maven on? Thanks, Jerry On Tue, Aug 18, 2015 at 11:15 AM, Ted Yu wrote: > N

COMPUTE STATS on hive table - NoSuchTableException

2015-08-18 Thread VIJAYAKUMAR JAWAHARLAL
Hi I am trying to compute stats on a lookup table from spark which resides in hive. I am invoking spark API as follows. It gives me NoSuchTableException. Table is double verified and subsequent statement “sqlContext.sql(“select * from cpatext.lkup”)” picks up the table correctly. I am wondering

Re: Too many files/dirs in hdfs

2015-08-18 Thread Mohit Anchlia
Is there a way to store all the results in one file and keep the file roll over separate than the spark streaming batch interval? On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY wrote: > In Spark Streaming you can simply check whether your RDD contains any > records or not and if records are th

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Prabeesh K.
Refer this post http://blog.prabeeshk.com/blog/2015/06/19/pyspark-notebook-with-docker/ Spark + Jupyter + Docker On 18 August 2015 at 21:29, Jerry Lam wrote: > Hi Guru, > > Thanks! Great to hear that someone tried it in production. How do you like > it so far? > > Best Regards, > > Jerry > > >

Re: Left outer joining big data set with small lookups

2015-08-18 Thread VIJAYAKUMAR JAWAHARLAL
Nope. Count action did not help to choose broadcast join. All of my tables are hive external tables. So, I tried to trigger compute statistics from sqlContext.sql. It gives me an error saying “nonsuch table”. I am not sure that is due to following bug in 1.4.1 https://issues.apache.org/jira/br

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi Guru, Thanks! Great to hear that someone tried it in production. How do you like it so far? Best Regards, Jerry On Tue, Aug 18, 2015 at 11:38 AM, Guru Medasani wrote: > Hi Jerry, > > Yes. I’ve seen customers using this in production for data science work. > I’m currently using this for on

What am I missing that's preventing javac from finding the libraries (CLASSPATH is setup...)?

2015-08-18 Thread Jerry
Hello, So I setup Spark to run on my local machine to see if I can reproduce the issue I'm having with data frames, but I'm running into issues with the compiler. Here's what I got: $ echo $CLASSPATH /usr/lib/jvm/java-6-oracle/lib:/home/adminz/dev/spark/spark-1.4.1/lib/spark-assembly-1.4.1-hadoo

Re: Spark executor lost because of GC overhead limit exceeded even though using 20 executors using 25GB each

2015-08-18 Thread Ted Yu
Do you mind providing a bit more information ? release of Spark code snippet of your app version of Java Thanks On Tue, Aug 18, 2015 at 8:57 AM, unk1102 wrote: > Hi this GC overhead limit error is making me crazy. I have 20 executors > using > 25 GB each I dont understand at all how can it t

Re: Spark Job Hangs on our production cluster

2015-08-18 Thread Imran Rashid
just looking at the thread dump from your original email, the 3 executor threads are all trying to load classes. (One thread is actually loading some class, and the others are blocked waiting to load a class, most likely trying to load the same thing.) That is really weird, definitely not somethi

Spark executor lost because of GC overhead limit exceeded even though using 20 executors using 25GB each

2015-08-18 Thread unk1102
Hi this GC overhead limit error is making me crazy. I have 20 executors using 25 GB each I dont understand at all how can it throw GC overhead I also dont that that big datasets. Once this GC error occurs in executor it will get lost and slowly other executors getting lost because of IOException, R

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Guru Medasani
Hi Jerry, Yes. I’ve seen customers using this in production for data science work. I’m currently using this for one of my projects on a cluster as well. Also, here is a blog that describes how to configure this. http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spa

Spark and ActorSystem

2015-08-18 Thread maxdml
Hi, I'd like to know where I could find more information related to the depreciation of the actor system in spark (from 1.4.x). I'm interested in the reasons for this decision, Cheers -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-ActorSystem-t

Issues with S3 paths that contain colons

2015-08-18 Thread bstempi
Hi, I'm running Spark on Amazon EMR (Spark 1.4.1, Hadoop 2.6.0). I'm seeing the exception below when encountering file names that contain colons. Any idea on how to get around this? scala> val files = sc.textFile("s3a://redactedbucketname/*") 2015-08-18 04:38:34,567 INFO [main] storage.MemoryS

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Shushant Arora
looking at source code of org.apache.spark.streaming.kafka.DirectKafkaInputDStream override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = { val untilOffsets = clamp(latestLeaderOffsets(maxRetries)) val rdd = KafkaRDD[K, V, U, T, R]( context.sparkContext, kafkaParams

Re: Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-18 Thread Cody Koeninger
Python version has been available since 1.4. It should be close to feature parity with the jvm version in 1.5 On Tue, Aug 18, 2015 at 9:36 AM, ayan guha wrote: > Hi Cody > > A non-related question. Any idea when Python-version of direct receiver is > expected? Me personally looking forward to i

Spark works with the data in another cluster(Elasticsearch)

2015-08-18 Thread gen tang
Hi, Currently, I have my data in the cluster of Elasticsearch and I try to use spark to analyse those data. The cluster of Elasticsearch and the cluster of spark are two different clusters. And I use hadoop input format(es-hadoop) to read data in ES. I am wondering how this environment affect the

Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-18 Thread Rick Moritz
Dear list, I am observing a very strange behaviour between a Spark 1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin interpreter (compiled with Java 6 and sourced from maven central). The workflow loads data from Hive, applies a number of transformations (including quite a

Re: Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-18 Thread ayan guha
Hi Cody A non-related question. Any idea when Python-version of direct receiver is expected? Me personally looking forward to it :) On Wed, Aug 19, 2015 at 12:02 AM, Cody Koeninger wrote: > The solution you found is also in the docs: > > http://spark.apache.org/docs/latest/streaming-kafka-integ

Re: Spark 1.4.1 - Mac OSX Yosemite

2015-08-18 Thread Charlie Hack
Looks like Scala 2.11.6 and Java 1.7.0_79. ✔ ~ 09:17 $ scala Welcome to Scala version 2.11.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_79). Type in expressions to have them evaluated. Type :help for more information. scala> ✔ ~ 09:26 $ echo $JAVA_HOME /Library/Java/JavaVirtualMachines/jdk1.

Re: Java 8 lambdas

2015-08-18 Thread Sean Owen
Yes, it should Just Work. lambdas can be used for any method that takes an instance of an interface with one method, and that describes Function, PairFunction, etc. On Tue, Aug 18, 2015 at 3:23 PM, Kristoffer Sjögren wrote: > Hi > > Is there a way to execute spark jobs with Java 8 lambdas instead

Java 8 lambdas

2015-08-18 Thread Kristoffer Sjögren
Hi Is there a way to execute spark jobs with Java 8 lambdas instead of using anonymous inner classes as seen in the examples? I think I remember seeing real lambdas in the examples before and in articles [1]? Cheers, -Kristoffer [1] http://blog.cloudera.com/blog/2014/04/making-apache-spark-eas

Re: Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-18 Thread Cody Koeninger
The solution you found is also in the docs: http://spark.apache.org/docs/latest/streaming-kafka-integration.html Java uses an atomic reference because Java doesn't allow you to close over non-final references. I'm not clear on your other question. On Tue, Aug 18, 2015 at 3:43 AM, Petr Novak wr

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Cody Koeninger
The superclass method in DStream is defined as returning an Option[RDD[T]] On Tue, Aug 18, 2015 at 8:07 AM, Shushant Arora wrote: > Getting compilation error while overriding compute method of > DirectKafkaInputDStream. > > > [ERROR] CustomDirectKafkaInputDstream.java:[51,83] > compute(org.apach

Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi spark users and developers, Did anyone have IPython Notebook (Jupyter) deployed in production that uses Spark as the computational engine? I know Databricks Cloud provides similar features with deeper integration with Spark. However, Databricks Cloud has to be hosted by Databricks so we cannot

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-18 Thread Shushant Arora
Getting compilation error while overriding compute method of DirectKafkaInputDStream. [ERROR] CustomDirectKafkaInputDstream.java:[51,83] compute(org.apache.spark.streaming.Time) in CustomDirectKafkaInputDstream cannot override compute(org.apache.spark.streaming.Time) in org.apache.spark.streaming

Re: how to write any data (non RDD) to a file inside closure?

2015-08-18 Thread Robineast
Still not sure what you are trying to achieve. If you could post some code that doesn’t work the community can help you understand where the error (syntactic or conceptual) is. > On 17 Aug 2015, at 17:42, dianweih001 [via Apache Spark User List] > wrote: > > Hi Robin, > > I know how to write

Re: how do I execute a job on a single worker node in standalone mode

2015-08-18 Thread Igor Berman
by default standalone creates 1 executor on every worker machine per application number of overall cores is configured with --total-executor-cores so in general if you'll specify --total-executor-cores=1 then there would be only 1 core on some executor and you'll get what you want on the other han

Is it this a BUG?: Why Spark Flume Streaming job is not deploying the Receiver to the specified host?

2015-08-18 Thread diplomatic Guru
I'm testing the Flume + Spark integration example (flume count). I'm deploying the job using yarn cluster mode. I first logged into the Yarn cluster, then submitted the job and passed in a specific worker node's IP to deploy the job. But when I checked the WebUI, it failed to bind to the specifie

Re: global variable in spark streaming with no dependency on key

2015-08-18 Thread Joanne Contact
Thanks. I tried. The problem is I have to updateStatebyKey to maintain other states related to keys. Not sure where to pass this accumulator variable into updateStateBykey. On Tue, Aug 18, 2015 at 2:17 AM, Hemant Bhanawat wrote: > See if SparkContext.accumulator helps. > > On Tue, Aug 18, 2015

Re:Why there are overlapping for tasks on the EventTimeline UI

2015-08-18 Thread Todd
I think I find the answer.. On the UI, the recording time of each task is when it is put into the thread pool. Then the UI makes sense At 2015-08-18 17:40:07, "Todd" wrote: Hi, Following is copied from the spark EventTimeline UI. I don't understand why there are overlapping between tasks?

Why there are overlapping for tasks on the EventTimeline UI

2015-08-18 Thread Todd
Hi, Following is copied from the spark EventTimeline UI. I don't understand why there are overlapping between tasks? I think they should be sequentially one by one in one executor(there are one core each executor). The blue part of each task is the scheduler delay time. Does it mean it is the d

Why standalone mode don't allow to set num-executor ?

2015-08-18 Thread canan chen
num-executor only works for yarn mode. In standalone mode, I have to set the --total-executor-cores and --executor-cores. Isn't this way so intuitive ? Any reason for that ?

Re: global variable in spark streaming with no dependency on key

2015-08-18 Thread Hemant Bhanawat
See if SparkContext.accumulator helps. On Tue, Aug 18, 2015 at 2:27 PM, Joanne Contact wrote: > Hi Gurus, > > Please help. > > But please don't tell me to use updateStateByKey because I need a > global variable (something like the clock time) across the micro > batches but not depending on key.

Re: Regarding rdd.collect()

2015-08-18 Thread Hemant Bhanawat
On Tue, Aug 18, 2015 at 1:16 PM, Dawid Wysakowicz < wysakowicz.da...@gmail.com> wrote: > No, the data is not stored between two jobs. But it is stored for a > lifetime of a job. Job can have multiple actions run. > I too thought so but wanted to confirm. Thanks. > > For a matter of sharing an rdd

global variable in spark streaming with no dependency on key

2015-08-18 Thread Joanne Contact
Hi Gurus, Please help. But please don't tell me to use updateStateByKey because I need a global variable (something like the clock time) across the micro batches but not depending on key. For my case, it is not acceptable to maintain a state for each key since each key comes in different times. Y

Re: Transform KafkaRDD to KafkaRDD, not plain RDD, or how to keep OffsetRanges after transformation

2015-08-18 Thread Petr Novak
The solution how to share offsetRanges after DirectKafkaInputStream is transformed is in: https://github.com/apache/spark/blob/master/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala https://github.com/apache/spark/blob/master/external/kafka/src/test/java/

Spark SQL Partition discovery - schema evolution

2015-08-18 Thread Guy Hadash
Hi all, I'm using Spark SQL using data from Openstack Swift. I'm trying to load parquet files with partition discovery, but I can't do it when the partitions don't match between two objects. For example, container which contains: /zone=0/object2 /zone=0/area=0/object1 Won't load, and will result

Re: Running Spark on user-provided Hadoop installation

2015-08-18 Thread gauravsehgal
Refer: http://spark.apache.org/docs/latest/hadoop-provided.html Specifically if you want to refer s3a paths. Please edit spark-env.sh and add following lines at end: SPARK_DIST_CLASSPATH=$(/path/to/hadoop/hadoop-2.7.1/bin/hadoop classpath) export SPARK_DIST_CLASSPATH="$SPARK_DIST_CLASSPATH:/path/t

Re: Regarding rdd.collect()

2015-08-18 Thread ayan guha
I think you are mixing the notion of job from hadoop map reduce world with spark. In spark, RDDs are immutable and transformations are lazy. So the first time rdd is actually fills up memory is when you run first transformation. After that, it stays up in memory until either application is stopped

Re:Changed Column order in DataFrame.Columns call and insertIntoJDBC

2015-08-18 Thread Todd
Take a look at the doc for the method: /** * Applies a schema to an RDD of Java Beans. * * WARNING: Since there is no guaranteed ordering for fields in a Java Bean, * SELECT * queries will return the columns in an undefined order. * @group dataframes * @since 1.3

Re: Regarding rdd.collect()

2015-08-18 Thread Dawid Wysakowicz
No, the data is not stored between two jobs. But it is stored for a lifetime of a job. Job can have multiple actions run. For a matter of sharing an rdd between jobs you can have a look at Spark Job Server(spark-jobserver ) or some In-Memory storages: Tach

Re:Re: Regarding rdd.collect()

2015-08-18 Thread Todd
One spark application can have many jobs,eg,first call rdd.count then call rdd.collect At 2015-08-18 15:37:14, "Hemant Bhanawat" wrote: It is still in memory for future rdd transformations and actions. This is interesting. You mean Spark holds the data in memory between two job execut

Changed Column order in DataFrame.Columns call and insertIntoJDBC

2015-08-18 Thread MooseSpark
I have a RDD which I am using to create the data frame based on one POJO, but when Dataframe is created, the sequence of column order get changed. DataFrame df=sqlCtx.createDataFrame(rdd, Pojo.class); String[] columns=df.columns(); //columns here are of different order what has been defined in po

Re: Regarding rdd.collect()

2015-08-18 Thread Hemant Bhanawat
It is still in memory for future rdd transformations and actions. This is interesting. You mean Spark holds the data in memory between two job executions. How does the second job get the handle of the data in memory? I am interested in knowing more about it. Can you forward me a spark article or

Re: Difference btw MEMORY_ONLY and MEMORY_AND_DISK

2015-08-18 Thread Sabarish Sasidharan
MEMORY_ONLY will fail if there is not enough memory but MEMORY_AND_DISK will spill to disk Regards Sab On Tue, Aug 18, 2015 at 12:45 PM, Harsha HN <99harsha.h@gmail.com> wrote: > Hello Sparkers, > > I would like to understand difference btw these Storage levels for a RDD > portion that doesn

Re: registering an empty RDD as a temp table in a PySpark SQL context

2015-08-18 Thread Hemant Bhanawat
It is definitely not the case for Spark SQL. A temporary table (much like dataFrame) is a just a logical plan with a name and it is not iterated unless a query is fired on it. I am not sure if using rdd.take in py code to verify the schema is a right approach as it creates a spark job. BTW, why w

Re: issue Running Spark Job on Yarn Cluster

2015-08-18 Thread MooseSpark
Please check logs in your hadoop yarn cluster, there you would get precise error or exception. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-Running-Spark-Job-on-Yarn-Cluster-tp21779p24308.html Sent from the Apache Spark User List mailing list archi

Re: Regarding rdd.collect()

2015-08-18 Thread Sabarish Sasidharan
It is still in memory for future rdd transformations and actions. What you get in driver is a copy of the data. Regards Sab On Tue, Aug 18, 2015 at 12:02 PM, praveen S wrote: > When I do an rdd.collect().. The data moves back to driver Or is still > held in memory across the executors? > --

  1   2   >