Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-06 Thread Tathagata Das
Why isnt a simple window function sufficient? eventData.window(Minutes(15), Seconds(3)) will keep generating RDDs every 3 second, each containing last 15 minutes of data. TD On Wed, Aug 6, 2014 at 3:43 PM, salemi alireza.sal...@udo.edu wrote: Hi, I have a DStream called eventData and it

Re: RDD to DStream

2014-08-06 Thread Tathagata Das
Hey Aniket, Great thoughts! I understand the usecase. But as you have realized yourself it is not trivial to cleanly stream a RDD as a DStream. Since RDD operations are defined to be scan based, it is not efficient to define RDD based on slices of data within a partition of another RDD, using

Re: Stopping StreamingContext does not kill receiver

2014-08-06 Thread Tathagata Das
I narrowed down the error. Unfortunately this is not quick fix. I have opened a JIRA for this. https://issues.apache.org/jira/browse/SPARK-2892 On Wed, Aug 6, 2014 at 3:59 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Okay let me give it a shot. On Wed, Aug 6, 2014 at 3:57 PM

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-07 Thread Tathagata Das
Okay, going back to your origin question, it wasnt clear what is the reduce function that you are trying to implement. Going by the 2nd example using window() operation, following by a count+filter (using sql), I am guessing you are trying to maintain a count of the all the active states in the

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
It could be because of the variable enableOpStat. Since its defined outside foreachRDD, referring to it inside the rdd.foreach is probably causing the whole streaming context being included in the closure. Scala funkiness. Try this, see if it works. msgCount.join(ddCount).foreachRDD((rdd:

Re: spark streaming multiple file output paths

2014-08-07 Thread Tathagata Das
The problem boils down to how to write an RDD in that way. You could use the HDFS Filesystem API to write each partition directly. pairRDD.groupByKey().foreachPartition(iterator = iterator.map { case (key, values) = // Open an output stream to destination file base-path/key/whatever

Re: Spark Streaming- Input from Kafka, output to HBase

2014-08-07 Thread Tathagata Das
For future reference in this thread, a better set of examples than the MetricAggregatorHBase on the JIRA to look at are here https://github.com/tmalaska/SparkOnHBase On Thu, Aug 7, 2014 at 1:41 AM, Khanderao Kand khanderao.k...@gmail.com wrote: I hope this has been resolved, were u

Re: spark streaming actor receiver doesn't play well with kryoserializer

2014-08-07 Thread Tathagata Das
for BlockGenerator called at time 1406336129800 14/07/25 17:55:30 [DEBUG] RecurringTimer: Callback for BlockGenerator called at time 140633613 On Jul 25, 2014, at 3:20 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Is this error on the executor or on the driver? Can you provide a larger

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
, Padmanabhan, Mahesh (contractor) mahesh.padmanab...@twc-contractor.com wrote: Thanks TD but unfortunately that did not work. From: Tathagata Das tathagata.das1...@gmail.com Date: Thursday, August 7, 2014 at 10:55 AM To: Mahesh Padmanabhan mahesh.padmanab...@twc-contractor.com Cc: user

Re: Spark Streaming Workflow Validation

2014-08-07 Thread Tathagata Das
I am not sure if it is a typo-error or not, but how are you using groupByKey to get the summed_values? Assuming you meant reduceByKey(), these workflows seems pretty efficient. TD On Thu, Aug 7, 2014 at 10:18 AM, Dan H. dch.ema...@gmail.com wrote: I wanted to post for validation to understand

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
From the extended info, I see that you have a function called createStreamingContext() in your code. Somehow that is getting referenced in in the foreach function. Is the whole foreachRDD code inside the createStreamingContext() function? Did you try marking the ssc field as transient? Here is a

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
= KafkaConsumer.messageStream(ssc) messageStream.foreachRDD(rdd = { A.func1(ssc.sparkContext) } Seems like the call A.func1(ssc.sparkContext) above is the cause of the exception. Thanks, Mahesh From: Tathagata Das tathagata.das1...@gmail.com Date: Thursday, August 7, 2014 at 1:11 PM To: amit amit.codenam

Re: Spark 1.0.1 NotSerialized exception (a bit of a head scratcher)

2014-08-07 Thread Tathagata Das
LOL! Glad it solved it. TD On Thu, Aug 7, 2014 at 2:23 PM, Padmanabhan, Mahesh (contractor) mahesh.padmanab...@twc-contractor.com wrote: Slap my head moment ā€“ using rdd.context solved it! Thanks TD, Mahesh From: Tathagata Das tathagata.das1...@gmail.com Date: Thursday, August 7, 2014

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-07 Thread Tathagata Das
Are you running on a cluster but giving a local path in ssc.checkpoint(...) ? TD On Thu, Aug 7, 2014 at 3:24 PM, salemi alireza.sal...@udo.edu wrote: Hi, Thank you or your help. With the new code I am getting the following error in the driver. What is going wrong here? 14/08/07 13:22:28

Re: Spark Streaming + reduceByWindow(reduceFunc, invReduceFunc, windowDuration, slideDuration

2014-08-07 Thread Tathagata Das
That is required for driver fault-tolerance, as well as for some transformations like updateSTateByKey that persist information across batches. It must be a HDFS directory when running on a cluster. TD On Thu, Aug 7, 2014 at 4:25 PM, salemi alireza.sal...@udo.edu wrote: That is correct. I do

Re: Use SparkStreaming to find the max of a dataset?

2014-08-07 Thread Tathagata Das
You can do the following. var globalMax = ... dstreamOfNumericalType.foreachRDD( rdd = { globalMax = math.max(rdd.max, globalMax) }) globalMax will keep getting updated after every batch TD On Thu, Aug 7, 2014 at 5:31 PM, bumble123 tc1...@att.com wrote: I can't figure out how to use

Re: Shared variable in Spark Streaming

2014-08-08 Thread Tathagata Das
Do you mean that you want a continuously updated count as more events/records are received in the DStream (remember, DStream is a continuous stream of data)? Assuming that is what you want, you can use a global counter var globalCount = 0L dstream.count().foreachRDD(rdd = { globalCount +=

Re: Custom Transformations in Spark

2014-08-08 Thread Tathagata Das
You can always define an arbitrary RDD-to-RDD function, use it from both Spark and Spark Streaming. For example, def myTransofmration(rdd: RDD[X]): RDD[Y] = { } In spark you can obvious apply it on an RDD. In spark streaming, you can apply on the RDDs of a DStream by

Re: java.lang.StackOverflowError when calling count()

2014-08-12 Thread Tathagata Das
The long lineage causes a long/deep Java object tree (DAG of RDD objects), which needs to be serialized as part of the task creation. When serializing, the whole object DAG needs to be traversed leading to the stackoverflow error. TD On Mon, Aug 11, 2014 at 7:14 PM, randylu randyl...@gmail.com

Re: how to use the method saveAsTextFile of a RDD like javaRDDmyOwnClass[]

2014-08-14 Thread Tathagata Das
FlatMap the JavaRDDBooleanPair[] to JavaRDDBooleanPair. Then it should work. TD On Thu, Aug 14, 2014 at 1:23 AM, Gefei Li gefeili.2...@gmail.com wrote: Hello, I wrote a class named BooleanPair: public static class BooleanPairet implements Serializable{ public Boolean

Re: spark streaming - lamda architecture

2014-08-14 Thread Tathagata Das
Can you be a bit more specific about what you mean by lambda architecture? On Thu, Aug 14, 2014 at 2:27 PM, salemi alireza.sal...@udo.edu wrote: Hi, How would you implement the batch layer of lamda architecture with spark/spark streaming? Thanks, Ali -- View this message in context:

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Tathagata Das
If just want arbitrary unique id attached to each record in a dstream (no ordering etc), then why not create generate and attach an UUID to each record? On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar kumar.soumi...@gmail.com wrote: I see a issue here. If rdd.id is 1000 then rdd.id *

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Tathagata Das
: Yes, that is an option. I started with a function of batch time, and index to generate id as long. This may be faster than generating UUID, with added benefit of sorting based on time. - Original Message - From: Tathagata Das tathagata.das1...@gmail.com To: Soumitra Kumar

Re: Converting a DStream's RDDs to SchemaRDDs

2014-08-28 Thread Tathagata Das
Try using local[n] with n 1, instead of local. Since receivers take up 1 slot, and local is basically 1 slot, there is no slot left to process the data. That's why nothing gets printed. TD On Thu, Aug 28, 2014 at 10:28 AM, Verma, Rishi (398J) rishi.ve...@jpl.nasa.gov wrote: Hi Folks, Iā€™d

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tathagata Das
Do you see this error right in the beginning or after running for sometime? The root cause seems to be that somehow your Spark executors got killed, which killed receivers and caused further errors. Please try to take a look at the executor logs of the lost executor to find what is the root cause

Re: DStream repartitioning, performance tuning processing

2014-08-28 Thread Tathagata Das
If you are repartitioning to 8 partitions, and your node happen to have at least 4 cores each, its possible that all 8 partitions are assigned to only 2 nodes. Try increasing the number of partitions. Also make sure you have executors (allocated by YARN) running on more than two nodes if you want

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tathagata Das
worker? On Thu, Aug 28, 2014 at 4:12 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Do you see this error right in the beginning or after running for sometime? The root cause seems to be that somehow your Spark executors got killed, which killed receivers and caused further errors

Re: [Streaming] Akka-based receiver with messages defined in uploaded jar

2014-08-29 Thread Tathagata Das
...@genesys.com] *Sent:* Wednesday, August 27, 2014 6:46 PM *To:* Tathagata Das *Cc:* user@spark.apache.org *Subject:* RE: [Streaming] Akka-based receiver with messages defined in uploaded jar Sorry for the delay with answer ā€“ was on vacation. As I said I was using modified version of launcher from

Re: [Streaming] Akka-based receiver with messages defined in uploaded jar

2014-09-02 Thread Tathagata Das
magic at either Spark or application code? *From:* Tathagata Das [mailto:tathagata.das1...@gmail.com] *Sent:* Friday, August 29, 2014 7:21 PM *To:* Anton Brazhnyk *Cc:* user@spark.apache.org *Subject:* Re: [Streaming] Akka-based receiver with messages defined in uploaded jar Can you

Re: Multi-tenancy for Spark (Streaming) Applications

2014-09-03 Thread Tathagata Das
In the current state of Spark Streaming, creating separate Java processes each having a streaming context is probably the best approach to dynamically adding and removing of input sources. All of these should be able to to use a YARN cluster for resource allocation. On Wed, Sep 3, 2014 at 6:30

Re: Spark Streaming into HBase

2014-09-03 Thread Tathagata Das
This is some issue with how Scala computes closures. Here because of the function blah it is trying the serialize the whole function that this code is part of. Can you define the function blah outside the main function? In fact you canTry putting the function in a serializable object. object

Re: RDDs

2014-09-04 Thread Tathagata Das
Yes Raymond is right. You can always run two jobs on the same cached RDD, and they can run in parallel (assuming you launch the 2 jobs from two different threads). However, with one copy of each RDD partition, the tasks of two jobs will experience some slot contentions. So if you replicate it, you

Re: Spark Streaming into HBase

2014-09-05 Thread Tathagata Das
needs to be Serializable, but the Blaher object doesn't. On Wed, Sep 3, 2014 at 7:59 PM, Tathagata Das [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=13478i=0 wrote: This is some issue with how Scala computes closures. Here because of the function blah

Re: [Spark Streaming] Tracking/solving 'block input not found'

2014-09-05 Thread Tathagata Das
Hey Gerard, Spark Streaming should just queue the processing and not delete the block data. There are reports of this error and I am still unable to reproduce the problem. One workaround you can try the configuration spark.streaming.unpersist = false . This stops Spark Streaming from cleaning up

Re: spark-streaming-kafka with broadcast variable

2014-09-05 Thread Tathagata Das
I am not sure if there is a good, clean way to do that - broadcasts variables are not designed to be used out side spark job closures. You could try a bit of a hacky stuff where you write the serialized variable to file in HDFS / NFS / distributed files sytem, and then use a custom decoder class

Re: Shared variable in Spark Streaming

2014-09-05 Thread Tathagata Das
' work for DStream? I think similar construct won't work for RDD, that's why there is accumulator. On Fri, Aug 8, 2014 at 12:52 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Do you mean that you want a continuously updated count as more events/records are received in the DStream

Re: Out of memory with Spark Streaming

2014-09-11 Thread Tathagata Das
Which version of spark are you running? If you are running the latest one, then could try running not a window but a simple event count on every 2 second batch, and see if you are still running out of memory? TD On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com

Re: Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-11 Thread Tathagata Das
This is very puzzling, given that this works in the local mode. Does running the kinesis example work with your spark-submit? https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala The instructions are present

Re: Spark Streaming compilation error: algebird not a member of package com.twitter

2014-09-21 Thread Tathagata Das
There is not artifact call spark-streaming-algebird . To use the algebird, you will have add the following dependency (in maven format) dependency groupIdcom.twitter/groupId artifactIdalgebird-core_${scala.binary.version}/artifactId version0.1.11/version /dependency This is

Re: How to initialize updateStateByKey operation

2014-09-23 Thread Tathagata Das
At a high-level, the suggestion sounds good to me. However regarding code, its best to submit a Pull Request on Spark github page for community reviewing. You will find more information here. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark On Tue, Sep 23, 2014 at 10:11 PM,

Re: NullPointerException on reading checkpoint files

2014-09-23 Thread Tathagata Das
This is actually a very tricky as their two pretty big challenges that need to be solved. (i) Checkpointing for broadcast variables: Unlike RDDs, broadcasts variable dont have checkpointing support (that is you cannot write the content of a broadcast variable to HDFS and recover it automatically

Re: RDD data checkpoint cleaning

2014-09-23 Thread Tathagata Das
I am not sure what you mean by data checkpoint continuously increase, leading to recovery process taking time? Do you mean that in HDFS you are seeing rdd checkpoint files being continuously written but never being deleted? On Tue, Sep 23, 2014 at 2:40 AM, RodrigoB rodrigo.boav...@aspect.com

Re: Multiple exceptions in Spark Streaming

2014-09-30 Thread Tathagata Das
Is this the logs of the worker where the failure occurs? I think issues similar to these have since been solved in later versions of Spark. TD On Tue, Sep 30, 2014 at 11:33 AM, Shaikh Riyaz shaikh@gmail.com wrote: Dear All, We are using Spark streaming version 1.0.0 in our Cloudea Hadoop

Re: Multiple exceptions in Spark Streaming

2014-09-30 Thread Tathagata Das
] - Your support will be highly appreciated. Regards, Riyaz On Wed, Oct 1, 2014 at 1:16 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Is this the logs of the worker where the failure occurs? I think issues similar to these have since been

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-23 Thread Tathagata Das
Hey Gerard, This is a very good question! *TL;DR: *The performance should be same, except in case of shuffle-based operations where the number of reducers is not explicitly specified. Let me answer in more detail by dividing the set of DStream operations into three categories. *1. Map-like

Re: Spark Streaming Applications

2014-10-23 Thread Tathagata Das
Cc'ing Helena for more information on this. TD On Thu, Oct 23, 2014 at 6:30 AM, Saiph Kappa saiph.ka...@gmail.com wrote: What is the application about? I couldn't find any proper description regarding the purpose of killrweather ( I mean, other than just integrating Spark with Cassandra). Do

Re: About Memory usage in the Spark UI

2014-10-23 Thread Tathagata Das
The memory usage of blocks of data received through Spark Streaming is not reflected in the Spark UI. It only shows the memory usage due to cached RDDs. I didnt find a JIRA for this, so I opened a new one. https://issues.apache.org/jira/browse/SPARK-4072 TD On Thu, Oct 23, 2014 at 12:47 AM,

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread Tathagata Das
, Tathagata Das tathagata.das1...@gmail.com wrote: Hey Gerard, This is a very good question! *TL;DR: *The performance should be same, except in case of shuffle-based operations where the number of reducers is not explicitly specified. Let me answer in more detail by dividing the set of DStream

Re: MEMORY_ONLY_SER question

2014-11-04 Thread Tathagata Das
It it deserialized in a streaming manner as the iterator moves over the partition. This is a functionality of core Spark, and Spark Streaming just uses it as is. What do you want to customize it to? On Tue, Nov 4, 2014 at 9:22 AM, Mohit Jaggi mohitja...@gmail.com wrote: Folks, If I have an RDD

Re: Streaming window operations not producing output

2014-11-04 Thread Tathagata Das
Didnt oyu get any errors in the log4j logs, saying that you have to enable checkpointing? TD On Tue, Nov 4, 2014 at 7:20 AM, diogo di...@uken.com wrote: So, to answer my own n00b question, if case anyone ever needs it. You have to enable checkpointing (by ssc.checkpoint(hdfsPath)). Windowed

Re: Store DStreams into Hive using Hive Streaming

2014-11-06 Thread Tathagata Das
Ted, any pointers? On Thu, Nov 6, 2014 at 4:46 PM, Luiz Geovani Vier lgv...@gmail.com wrote: Hello, Is there a built-in way or connector to store DStream results into an existing Hive ORC table using the Hive/HCatalog Streaming API? Otherwise, do you have any suggestions regarding the

Re: Any patterns for multiplexing the streaming data

2014-11-07 Thread Tathagata Das
I am not aware of any obvious existing pattern that does exactly this. Generally this sort of computation (subset, denormalization) things are so generic sounding terms but actually have very specific requirements that it hard to refer to a design pattern without more requirement info. If you

Re: JavaKafkaWordCount not working under Spark Streaming

2014-11-10 Thread Tathagata Das
What is the Spark master that you are using. Use local[4], not local if you are running locally. On Mon, Nov 10, 2014 at 3:01 PM, Something Something mailinglist...@gmail.com wrote: I am embarrassed to admit but I can't get a basic 'word count' to work under Kafka/Spark streaming. My code

Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Tathagata Das
You could get all the tweets in the stream, and then apply filter transformation on the DStream of tweets to filter away non-english tweets. The tweets in the DStream is of type twitter4j.Status which has a field describing the language. You can use that in the filter. Though in practice, a lot

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Tathagata Das
On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Let me further clarify Lalit's point on when RDDs generated by DStreams are destroyed, and hopefully that will answer your original questions. 1. How spark (streaming) guarantees that all the actions

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Tathagata Das
some solutions for this. Thanks! Bill On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Can you elaborate on the usage pattern that lead to cannot compute split ? Are you using the RDDs generated by DStream, outside the DStream logic? Something like

Re: Trying to understand a basic difference between these two configurations

2014-12-05 Thread Tathagata Das
That depends! See inline. I am assuming that when you said replacing local disk with HDFS in case 1, you are connected to a separate HDFS cluster (like case 1) with a single 10G link. Also assumign that all nodes (1 in case 1, and 6 in case 2) are the worker nodes, and the spark application

Re: Key not valid / already cancelled using Spark Streaming

2014-12-11 Thread Tathagata Das
Following Gerard's thoughts, here are possible things that could be happening. 1. Is there another process in the background that is deleting files in the directory where you are trying to write? Seems like the temporary file generated by one of the tasks is getting delete before it is renamed to

Re: Spark steaming : work with collect() but not without collect()

2014-12-11 Thread Tathagata Das
What does process do? Maybe when this process function is being run in the Spark executor, it is causing the some static initialization, which fails causing this exception. For Oracle documentation, an ExceptionInInitializerError is thrown to indicate that an exception occurred during evaluation

Re: Session for connections?

2014-12-11 Thread Tathagata Das
You could create a lazily initialized singleton factory and connection pool. Whenever an executor starts running the firt task that needs to push out data, it will create the connection pool as a singleton. And subsequent tasks running on the executor is going to use the connection pool. You will

Re: KafkaUtils explicit acks

2014-12-11 Thread Tathagata Das
I am updating the docs right now. Here is a staged copy that you can have sneak peek of. This will be part of the Spark 1.2. http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html The updated fault-tolerance section tries to simplify the explanation of when and what data

Re: Session for connections?

2014-12-11 Thread Tathagata Das
Also, this is covered in the streaming programming guide in bits and pieces. http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On Thu, Dec 11, 2014 at 4:55 AM, Ashic Mahtab as...@live.com wrote: That makes sense. I'll try that. Thanks :)

Re: Locking for shared RDDs

2014-12-11 Thread Tathagata Das
Aditya, I think you have the mental model of spark streaming a little off the mark. Unlike traditional streaming systems, where any kind of state is mutable, SparkStreaming is designed on Sparks immutable RDDs. Streaming data is received and divided into immutable blocks, then form immutable RDDs,

Re: Is there an efficient way to append new data to a registered Spark SQL Table?

2014-12-11 Thread Tathagata Das
First of all, how long do you want to keep doing this? The data is going to increase infinitely and without any bounds, its going to get too big for any cluster to handle. If all that is within bounds, then try the following. - Maintain a global variable having the current RDD storing all the log

Re: Specifying number of executors in Mesos

2014-12-11 Thread Tathagata Das
Not that I am aware of. Spark will try to spread the tasks evenly across executors, its not aware of the workers at all. So if the executors to worker allocation is uneven, I am not sure what can be done. Maybe others can get smoe ideas. On Tue, Dec 9, 2014 at 6:20 AM, Gerard Maas

Re: Error: Spark-streaming to Cassandra

2014-12-11 Thread Tathagata Das
This seems to be compilation errors. The second one seems to be that you are using CassandraJavaUtil.javafunctions wrong. Look at the documentation and set the parameter list correctly. TD On Mon, Dec 8, 2014 at 9:47 AM, m.sar...@accenture.com wrote: Hi, I am intending to save the streaming

Re: Key not valid / already cancelled using Spark Streaming

2014-12-11 Thread Tathagata Das
www.chaordic.com.br +55 48 3232.3200 On Thu, Dec 11, 2014 at 10:03 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Following Gerard's thoughts, here are possible things that could be happening. 1. Is there another process in the background that is deleting files in the directory

Re: Spark Streaming in Production

2014-12-11 Thread Tathagata Das
Spark Streaming takes care of restarting receivers if it fails. Regarding the fault-tolerance properties and deployment options, we made some improvements in the upcoming Spark 1.2. Here is a staged version of the Spark Streaming programming guide that you can read for the up-to-date explanation

Re: Read data from SparkStreaming from Java socket.

2014-12-12 Thread Tathagata Das
Yes, socketTextStream starts a TCP client that tries to connect to a TCP server (localhost: in your case). If there is a server running on that port that can send data to connected TCP connections, then you will receive data in the stream. Did you check out the quick example in the streaming

Re: Help with updateStateByKey

2014-12-18 Thread Tathagata Das
Another point to start playing with updateStateByKey is the example StatefulNetworkWordCount. See the streaming examples directory in the Spark repository. TD On Thu, Dec 18, 2014 at 6:07 AM, Pierce Lamb richard.pierce.l...@gmail.com wrote: I am trying to run stateful Spark Streaming

Re: Spark Streaming Python APIs?

2014-12-18 Thread Tathagata Das
A more updated version of the streaming programming guide is here http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html Please refer to this until we make the official release of Spark 1.2 TD On Tue, Dec 16, 2014 at 3:50 PM, smallmonkey...@hotmail.com

Re: Spark Streaming: HiveContext within Custom Actor

2014-12-30 Thread Tathagata Das
I am not sure that can be done. Receivers are designed to be run only on the executors/workers, whereas a SQLContext (for using Spark SQL) can only be defined on the driver. On Mon, Dec 29, 2014 at 6:45 PM, sranga sra...@gmail.com wrote: Hi Could Spark-SQL be used from within a custom actor

Re: word count aggregation

2014-12-30 Thread Tathagata Das
For windows that large (1 hour), you will probably also have to increase the batch interval for efficiency. TD On Mon, Dec 29, 2014 at 12:16 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You can use reduceByKeyAndWindow for that. Here's a pretty clean example

Re: Kafka + Spark streaming

2014-12-30 Thread Tathagata Das
1. Of course, a single block / partition has many Kafka messages, and from different Kafka topics interleaved together. The message count is not related to the block count. Any message received within a particular block interval will go in the same block. 2. Yes, the receiver will be started on

Re: SPARK-streaming app running 10x slower on YARN vs STANDALONE cluster

2014-12-30 Thread Tathagata Das
Thats is kind of expected due to data locality. Though you should see some tasks running on the executors as the data gets replicated to other nodes and can therefore run tasks based on locality. You have two solutions 1. kafkaStream.repartition() to explicitly repartition the received data

Re: Gradual slow down of the Streaming job (getCallSite at DStream.scala:294)

2014-12-30 Thread Tathagata Das
Which version of Spark Streaming are you using. When the batch processing time increases to 15-20 seconds, could you compare the task times compared to the tasks time when the application is just launched? Basically is the increase from 6 seconds to 15-20 seconds is caused by increase in

Re: Big performance difference between client and cluster deployment mode; is this expected?

2014-12-31 Thread Tathagata Das
Whats your spark-submit commands in both cases? Is it Spark Standalone or YARN (both support client and cluster)? Accordingly what is the number of executors/cores requested? TD On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji eshi...@gmail.com wrote: Also the job was deployed from the master

Re: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Tathagata Das
This is not normal. Its a huge scheduling delay!! Can you tell me more about the application? - cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input

Re: How to make spark partition sticky, i.e. stay with node?

2015-01-23 Thread Tathagata Das
Hello mingyu, That is a reasonable way of doing this. Spark Streaming natively does not support sticky because Spark launches tasks based on data locality. If there is no locality (example reduce tasks can run anywhere), location is randomly assigned. So the cogroup or join introduces a locality

Re: how to send JavaDStream RDD using foreachRDD using Java

2015-02-02 Thread Tathagata Das
Hello Sachin, While Akhil's solution is correct, this is not sufficient for your usecase. RDD.foreach (that Akhil is using) will run on the workers, but you are creating the Producer object on the driver. This will not work, a producer create on the driver cannot be used from the worker/executor.

Re: Error when Spark streaming consumes from Kafka

2015-02-02 Thread Tathagata Das
This is an issue that is hard to resolve without rearchitecting the whole Kafka Receiver. There are some workarounds worth looking into. http://mail-archives.apache.org/mod_mbox/kafka-users/201312.mbox/%3CCAFbh0Q38qQ0aAg_cj=jzk-kbi8xwf+1m6xlj+fzf6eetj9z...@mail.gmail.com%3E On Mon, Feb 2, 2015

Re: StreamingContext getOrCreate with queueStream

2015-02-05 Thread Tathagata Das
I dont think your screenshots came through in the email. None the less, queueStream will not work with getOrCreate. Its mainly for testing (by generating your own RDDs) and not really useful for production usage (where you really need to checkpoint-based recovery). TD On Thu, Feb 5, 2015 at 4:12

Re: spark-local dir running out of space during long ALS run

2015-02-16 Thread Tathagata Das
Correct, brute force clean up is not useful. Since Spark 1.0, Spark can do automatic cleanup of files based on which RDDs are used/garbage collected by JVM. That would be the best way, but depends on the JVM GC characteristics. If you force a GC periodically in the driver that might help you get

Re: Concurrent batch processing

2015-02-12 Thread Tathagata Das
So you have come across spark.streaming.concurrentJobs already :) Yeah, that is an undocumented feature that does allow multiple output operations to submitted in parallel. However, this is not made public for the exact reasons that you realized - the semantics in case of stateful operations is

Re: How to broadcast a variable read from a file in yarn-cluster mode?

2015-02-12 Thread Tathagata Das
Can you give me the whole logs? TD On Tue, Feb 10, 2015 at 10:48 AM, Jon Gregg jonrgr...@gmail.com wrote: OK that worked and getting close here ... the job ran successfully for a bit and I got output for the first couple buckets before getting a java.lang.Exception: Could not compute split,

Re: Streaming scheduling delay

2015-02-12 Thread Tathagata Das
Thanks for looking into it. On Thu, Feb 12, 2015 at 8:10 PM, Tathagata Das t...@databricks.com wrote: Hey Tim, Let me get the key points. 1. If you are not writing back to Kafka, the delay is stable? That is, instead of foreachRDD { // write to kafka } if you do dstream.count

Re: Spark streaming job throwing ClassNotFound exception when recovering from checkpointing

2015-02-12 Thread Tathagata Das
Could you come up with a minimal example through which I can reproduce the problem? On Tue, Feb 10, 2015 at 12:30 PM, conor fennell.co...@gmail.com wrote: I am getting the following error when I kill the spark driver and restart the job: 15/02/10 17:31:05 INFO CheckpointReader: Attempting to

Re: streaming joining multiple streams

2015-02-12 Thread Tathagata Das
Sorry for the late response. With the amount of data you are planning join, any system would take time. However, between Hive's MapRduce joins, and Spark's basic shuffle, and Spark SQL's join, the latter wins hands down. Furthermore, with the APIs of Spark and Spark Streaming, you will have to do

Re: Streaming scheduling delay

2015-02-12 Thread Tathagata Das
Hey Tim, Let me get the key points. 1. If you are not writing back to Kafka, the delay is stable? That is, instead of foreachRDD { // write to kafka } if you do dstream.count, then the delay is stable. Right? 2. If so, then Kafka is the bottleneck. Is the number of partitions, that you spoke of

Re: In a Spark Streaming application, what might be the potential causes for util.AkkaUtils: Error sending message in 1 attempts and java.util.concurrent.TimeoutException: Futures timed out and

2015-02-19 Thread Tathagata Das
What version of Spark are you using? TD On Thu, Feb 19, 2015 at 2:45 AM, Emre Sevinc emre.sev...@gmail.com wrote: Hello, We have a Spark Streaming application that watches an input directory, and as files are copied there the application reads them and sends the contents to a RESTful web

Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds

2015-01-28 Thread Tathagata Das
Ohhh nice! Would be great if you can share us some code soon. It is indeed a very complicated problem and there is probably no single solution that fits all usecases. So having one way of doing things would be a great reference. Looking forward to that! On Wed, Jan 28, 2015 at 4:52 PM, Tobias

Re: Error reporting/collecting for users

2015-01-28 Thread Tathagata Das
You could use foreachRDD to do the operations and then inside the foreach create an accumulator to gather all the errors together dstream.foreachRDD { rdd = val accumulator = new Accumulator[] rdd.map { . }.count // whatever operation that is error prone // gather all errors

Re: Build error

2015-01-30 Thread Tathagata Das
That is a known issue uncovered last week. It fails on certain environments, not on Jenkins which is our testing environment. There is already a PR up to fix it. For now you can build using mvn package -DskipTests TD On Fri, Jan 30, 2015 at 8:59 PM, Andrew Musselman andrew.mussel...@gmail.com

Re: NaiveBayes classifier causes ShuffleDependency class cast exception

2015-02-13 Thread Tathagata Das
You cannot have two Spark Contexts in the same JVM active at the same time. Just create one SparkContext and then use it for both purpose. TD On Fri, Feb 6, 2015 at 8:49 PM, VISHNU SUBRAMANIAN johnfedrickena...@gmail.com wrote: Can you try creating just a single spark context and then try

Re: Interact with streams in a non-blocking way

2015-02-13 Thread Tathagata Das
Here is an example of how you can do. Lets say myDStream contains the data that you may want to asynchornously query, say using, Spark SQL. val sqlContext = new SqlContext(streamingContext.sparkContext) myDStream.foreachRDD { rdd = // rdd is a RDD of case class

Re: why generateJob is a private API?

2015-03-16 Thread Tathagata Das
It was not really meant to be pubic and overridden. Because anything you want to do to generate jobs from RDDs can be done using DStream.foreachRDD On Sun, Mar 15, 2015 at 11:14 PM, madhu phatak phatak@gmail.com wrote: Hi, I am trying to create a simple subclass of DStream. If I

Re: MappedStream vs Transform API

2015-03-16 Thread Tathagata Das
It's mostly for legacy reasons. First we had added all the MappedDStream, etc. and then later we realized we need to expose something that is more generic for arbitrary RDD-RDD transformations. It can be easily replaced. However, there is a slight value in having MappedDStream, for developers to

Re: MappedStream vs Transform API

2015-03-17 Thread Tathagata Das
etc so they are consistent with other API's?. Regards, Madhukara Phatak http://datamantra.io/ On Mon, Mar 16, 2015 at 11:42 PM, Tathagata Das t...@databricks.com wrote: It's mostly for legacy reasons. First we had added all the MappedDStream, etc. and then later we realized we need to expose

Re: problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found)

2015-03-16 Thread Tathagata Das
If you are creating an assembly, make sure spark-streaming is marked as provided. spark-streaming is already part of the spark installation so will be present at run time. That might solve some of these, may be!? TD On Mon, Mar 16, 2015 at 11:30 AM, Kelly, Jonathan jonat...@amazon.com wrote:

Re: problems with spark-streaming-kinesis-asl and sbt assembly (different file contents found)

2015-03-16 Thread Tathagata Das
-kinesis-asl.) Jonathan Kelly Elastic MapReduce - SDE Port 99 (SEA35) 08.220.C2 From: Tathagata Das t...@databricks.com Date: Monday, March 16, 2015 at 12:45 PM To: Jonathan Kelly jonat...@amazon.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: problems with spark

<    1   2   3   4   5   6   7   8   9   >