Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Tathagata Das
(streamTuple._1, SaveMode.Append) } } On Tue, Jul 28, 2015 at 3:42 PM, Tathagata Das t...@databricks.com wrote: I dont think any one has really run 500 text streams. And parSequences do nothing out there, you are only parallelizing the setup code which does not really compute anything. Also

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Tathagata Das
I dont think any one has really run 500 text streams. And parSequences do nothing out there, you are only parallelizing the setup code which does not really compute anything. Also it setsup 500 foreachRDD operations that will get executed in each batch sequentially, so does not make sense. The

Re: Spark Streaming Json file groupby function

2015-07-28 Thread Tathagata Das
If you are trying to keep such long term state, it will be more robust in the long term to use a dedicated data store (cassandra/HBase/etc.) that is designed for long term storage. On Tue, Jul 28, 2015 at 4:37 PM, swetha swethakasire...@gmail.com wrote: Hi TD, We have a requirement to

Re: broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Tathagata Das
Rather than using accumulator directly, what you can do is something like this to lazily create an accumulator and use it (will get lazily recreated if driver restarts from checkpoint) dstream.transform { rdd = val accum = SingletonObject.getOrCreateAccumulator() // single object method to

Re: restart from last successful stage

2015-07-29 Thread Tathagata Das
. sc.saveAs) and then run a modified version of the job that skips Stage 0, assuming you have a full understanding of the breakdown of stages in your job. On Tue, Jul 28, 2015 at 9:28 PM, Tathagata Das t...@databricks.com wrote: Okay, may I am confused on the word would be useful to *restart

Re: Spark Streaming Kafka could not find leader offset for Set()

2015-07-29 Thread Tathagata Das
There is a known issue that Kafka cannot return leader if there is not data in the topic. I think it was raised in another thread in this forum. Is that the issue? On Wed, Jul 29, 2015 at 10:38 AM, unk1102 umesh.ka...@gmail.com wrote: Hi I have Spark Streaming code which streams from Kafka

Re: Problems with JobScheduler

2015-07-30 Thread Tathagata Das
Yes, and that is indeed the problem. It is trying to process all the data in Kafka, and therefore taking 60 seconds. You need to set the rate limits for that. On Thu, Jul 30, 2015 at 8:51 AM, Cody Koeninger c...@koeninger.org wrote: If you don't set it, there is no maximum rate, it will get

Re: Does Spark Streaming need to list all the files in a directory?

2015-07-30 Thread Tathagata Das
For the first time it needs to list them. AFter that the list should be cached by the file stream implementation (as far as I remember). On Thu, Jul 30, 2015 at 3:55 PM, Brandon White bwwintheho...@gmail.com wrote: Is this a known bottle neck for Spark Streaming textFileStream? Does it need

Re: How RDD lineage works

2015-07-30 Thread Tathagata Das
You have to read the original Spark paper to understand how RDD lineage works. https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf On Thu, Jul 30, 2015 at 9:25 PM, Ted Yu yuzhih...@gmail.com wrote: Please take a look at: core/src/test/scala/org/apache/spark/CheckpointSuite.scala

Re: Heatmap with Spark Streaming

2015-07-30 Thread Tathagata Das
I do suggest that the non-spark related discussions be taken to a different this forum as it does not directly contribute to the contents of this user list. On Thu, Jul 30, 2015 at 8:52 PM, UMESH CHAUDHARY umesh9...@gmail.com wrote: Thanks for the valuable suggestion. I also started with

Re: Getting the number of slaves

2015-07-30 Thread Tathagata Das
To clarify, that is the number of executors requested by the SparkContext from the cluster manager. On Tue, Jul 28, 2015 at 5:18 PM, amkcom amk...@gmail.com wrote: try sc.getConf.getInt(spark.executor.instances, 1) -- View this message in context:

Re: Re: How RDD lineage works

2015-07-30 Thread Tathagata Das
. -- bit1...@163.com *From:* bit1...@163.com *Date:* 2015-07-31 13:11 *To:* Tathagata Das tathagata.das1...@gmail.com; yuzhihong yuzhih...@gmail.com *CC:* user user@spark.apache.org *Subject:* Re: Re: How RDD lineage works Thanks TD and Zhihong for the guide

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-31 Thread Tathagata Das
: Tathagata, Could the bottleneck possibility be the number of executor nodes in our cluster? Since we are creating 500 Dstreams based off 500 textfile directories, do we need at least 500 executors / nodes to be receivers for each one of the streams? On Tue, Jul 28, 2015 at 6:09 PM, Tathagata Das t

Re: Writing streaming data to cassandra creates duplicates

2015-07-28 Thread Tathagata Das
You have to partition that data on the Spark Streaming by the primary key, and then make sure insert data into Cassandra atomically per key, or per set of keys in the partition. You can use the combination of the (batch time, and partition Id) of the RDD inside foreachRDD as the unique id for the

Re: restart from last successful stage

2015-07-28 Thread Tathagata Das
If you are using the same RDDs in the both the attempts to run the job, the previous stage outputs generated in the previous job will indeed be reused. This applies to core though. For dataframes, depending on what you do, the physical plan may get generated again leading to new RDDs which may

Re: Checkpoint file not found

2015-08-03 Thread Tathagata Das
Can you tell us more about streaming app? DStream operation that you are using? On Sun, Aug 2, 2015 at 9:14 PM, Anand Nalya anand.na...@gmail.com wrote: Hi, I'm writing a Streaming application in Spark 1.3. After running for some time, I'm getting following execption. I'm sure, that no other

Re: Graceful shutdown for Spark Streaming

2015-07-29 Thread Tathagata Das
StreamingContext.stop(stopGracefully = true) stops the streaming context gracefully. Then you can safely terminate the Spark cluster. They are two different steps and needs to be done separately ensuring that the driver process has been completely terminated before the Spark cluster is the

Re: broadcast variable and accumulators issue while spark streaming checkpoint recovery

2015-07-29 Thread Tathagata Das
previous batch interval's offsets.. On Thu, Jul 30, 2015 at 2:49 AM, Tathagata Das t...@databricks.com wrote: Rather than using accumulator directly, what you can do is something like this to lazily create an accumulator and use it (will get lazily recreated if driver restarts from

Re: Graceful shutdown for Spark Streaming

2015-07-30 Thread Tathagata Das
in logic , in my case sleep(t.s.) is not working .) So i used to kill coarseGrained job at each slave by script .Please suggest something . On Thu, Jul 30, 2015 at 5:14 AM, Tathagata Das t...@databricks.com wrote: StreamingContext.stop(stopGracefully = true) stops the streaming context

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
If you are running on a cluster, the listening is occurring on one of the executors, not in the driver. On Mon, Aug 10, 2015 at 10:29 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am trying to run this program as a yarn-client. The job seems to be submitting successfully however I don't

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
: I am running as a yarn-client which probably means that the program that submitted the job is where the listening is also occurring? I thought that the yarn is only used to negotiate resources in yarn-client master mode. On Mon, Aug 10, 2015 at 11:34 AM, Tathagata Das t...@databricks.com wrote

Re: ERROR ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver

2015-08-10 Thread Tathagata Das
I think this may be expected. When the streaming context is stopped without the SparkContext, then the receivers are stopped by the driver. The receiver sends back the message that it has been stopped. This is being (probably incorrectly) logged with ERROR level. On Sun, Aug 9, 2015 at 12:52 AM,

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
on or not. On Mon, Aug 10, 2015 at 12:43 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I've verified all the executors and I don't see a process listening on the port. However, the application seem to show as running in the yarn UI On Mon, Aug 10, 2015 at 11:56 AM, Tathagata Das t

Re: Streaming of WordCount example

2015-08-10 Thread Tathagata Das
-with-dependencies.jar localhost /user/ec2-user/checkpoint/ /user/ec2-user/out Even though I am running as local I see it being scheduled and managed by yarn. On Mon, Aug 10, 2015 at 12:56 PM, Tathagata Das t...@databricks.com wrote: Is it receiving any data? If so, then it must

Re: stopping spark stream app

2015-08-10 Thread Tathagata Das
In general, it is a little risky to put long running stuff in a shutdown hook as it may delay shutdown of the process which may delay other things. That said, you could try it out. A better way to explicitly shutdown gracefully is to use an RPC to signal the driver process to start shutting down,

Re: Graceful shutdown for Spark Streaming

2015-08-10 Thread Tathagata Das
July 2015 at 08:19, anshu shukla anshushuk...@gmail.com wrote: Yes I was doing same , if You mean that this is the correct way to do Then I will verify it once more in my case . On Thu, Jul 30, 2015 at 1:02 PM, Tathagata Das t...@databricks.com wrote: How is sleep not working? Are you

Re: stopping spark stream app

2015-08-10 Thread Tathagata Das
at end of each batch and if set then gracefully stop the application ? On Tue, Aug 11, 2015 at 12:43 AM, Tathagata Das t...@databricks.com wrote: In general, it is a little risky to put long running stuff in a shutdown hook as it may delay shutdown of the process which may delay other things

Re: Spark Kinesis Checkpointing/Processing Delay

2015-08-10 Thread Tathagata Das
You are correct. The earlier Kinesis receiver (as of Spark 1.4) was not saving checkpoints correctly and was in general not reliable (even with WAL enabled). We have improved this in Spark 1.5 with updated Kinesis receiver, that keeps track of the Kinesis sequence numbers as part of the Spark

Re: QueueStream Does Not Support Checkpointing

2015-08-14 Thread Tathagata Das
A hacky workaround is to create a customer InputDStream that creates the right RDDs based on a function. The TestInputDStream https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala#L61 does something similar for Spark Streaming unit

Re: stopping spark stream app

2015-08-12 Thread Tathagata Das
way to kill the app.after stopping context? On Tue, Aug 11, 2015 at 2:55 AM, Shushant Arora shushantaror...@gmail.com wrote: Thanks! On Tue, Aug 11, 2015 at 1:34 AM, Tathagata Das t...@databricks.com wrote: 1. RPC can be done in many ways, and a web service is one of many ways. A even

Re: Partitioning in spark streaming

2015-08-12 Thread Tathagata Das
Yes. On Wed, Aug 12, 2015 at 12:12 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Thanks! To write to hdfs I do need to use saveAs method? On Wed, Aug 12, 2015 at 12:01 PM, Tathagata Das t...@databricks.com wrote: This is how Spark does. It writes the task output to a uniquely-named

Re: Reliable Streaming Receiver

2015-08-05 Thread Tathagata Das
You could very easily strip out the BlockGenerator code from the Spark source code and use it directly in the same way the Reliable Kafka Receiver uses it. BTW, you should know that we will be deprecating the receiver based approach for the Direct Kafka approach. That is quite flexible, can give

Re: user threads in executors

2015-07-22 Thread Tathagata Das
underhood? If compete batch is not loaded will using custim size of 100-200 request in one batch and post will help instead of whole partition ? On Wed, Jul 22, 2015 at 12:18 AM, Tathagata Das t...@databricks.com wrote: If you can post multiple items at a time, then use foreachPartition

Re: spark streaming 1.3 issues

2015-07-22 Thread Tathagata Das
For Java, do OffsetRange[] offsetRanges = ((HasOffsetRanges)rdd*.rdd()*).offsetRanges(); If you fix that error, you should be seeing data. You can call arbitrary RDD operations on a DStream, using DStream.transform. Take a look at the docs. For the direct kafka approach you are doing, - tasks

Re: spark streaming 1.3 coalesce on kafkadirectstream

2015-07-22 Thread Tathagata Das
With DirectKafkaStream there are two approaches. 1. you increase the number of KAfka partitions Spark will automatically read in parallel 2. if that's not possible, then explicitly repartition only if there are more cores in the cluster than the number of Kafka partitions, AND the first map-like

Re: Convert Simple Kafka Consumer to standalone Spark JavaStream Consumer

2015-07-21 Thread Tathagata Das
From what I understand about your code, it is getting data from different partitions of a topic - get all data from partition 1, then from partition 2, etc. Though you have configured it to read from just one partition (topicCount has count = 1). So I am not sure what your intention is, read all

Re: Does Spark streaming support is there with RabbitMQ

2015-07-22 Thread Tathagata Das
You could contact the authors of the spark-packages.. maybe that will help? On Mon, Jul 20, 2015 at 6:41 AM, Jeetendra Gangele gangele...@gmail.com wrote: Thanks Todd, I m not sure whether somebody has used it or not. can somebody confirm if this integrate nicely with Spark streaming? On

Re: user threads in executors

2015-07-21 Thread Tathagata Das
If you can post multiple items at a time, then use foreachPartition to post the whole partition in a single request. On Tue, Jul 21, 2015 at 9:35 AM, Richard Marscher rmarsc...@localytics.com wrote: You can certainly create threads in a map transformation. We do this to do concurrent DB

Re: spark streaming disk hit

2015-07-21 Thread Tathagata Das
Most shuffle files are really kept around in the OS's buffer/disk cache, so it is still pretty much in memory. If you are concerned about performance, you have to do a holistic comparison for end-to-end performance. You could take a look at this.

Re: How to deal with the spark streaming application while upgrade spark

2015-07-23 Thread Tathagata Das
Currently that is the best way. On Thu, Jul 23, 2015 at 12:51 AM, JoneZhang joyoungzh...@gmail.com wrote: My spark streaming on kafka application is running in spark 1.3. I want upgrade spark to 1.4 now. How to deal with the spark streaming application? Save the kafka topic partition

Re: Master vs. Slave Nodes Clarification

2015-07-14 Thread Tathagata Das
Yep :) On Tue, Jul 14, 2015 at 2:44 PM, algermissen1971 algermissen1...@icloud.com wrote: On 14 Jul 2015, at 23:26, Tathagata Das t...@databricks.com wrote: Just to be clear, you mean the Spark Standalone cluster manager's master and not the applications driver, right. Sorry, by now I

Re: Master vs. Slave Nodes Clarification

2015-07-14 Thread Tathagata Das
Just to be clear, you mean the Spark Standalone cluster manager's master and not the applications driver, right. In that case, the earlier responses are correct. TD On Tue, Jul 14, 2015 at 11:26 AM, Mohammed Guller moham...@glassbeam.com wrote: The master node does not have to be similar to

Re: Spark streaming Processing time keeps increasing

2015-07-16 Thread Tathagata Das
MAke sure you provide the filterFunction with the invertible reduceByKeyAndWindow. Otherwise none of the keys will get removed, and the key space will continue increase. This is what is leading to the lag. So use the filtering function to filter out the keys that are not needed any more. On Thu,

Re: Sessionization using updateStateByKey

2015-07-14 Thread Tathagata Das
[Apologies for repost, for those who have seen this response already in the dev mailing list] 1. When you set ssc.checkpoint(checkpointDir), the spark streaming periodically saves the state RDD (which is a snapshot of all the state data) to HDFS using RDD checkpointing. In fact, a streaming app

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
, Chen Song chen.song...@gmail.com wrote: Thanks TD. As for 1), if timing is not guaranteed, how does exactly once semantics supported? It feels like exactly once receiving is not necessarily exactly once processing. Chen On Tue, Jul 14, 2015 at 10:16 PM, Tathagata Das t...@databricks.com

Re: Stopping StreamingContext before receiver has started

2015-07-14 Thread Tathagata Das
This is a known race condition - root cause of SPARK-5681 https://issues.apache.org/jira/browse/SPARK-5681 On Mon, Jul 13, 2015 at 3:35 AM, Juan Rodríguez Hortalá juan.rodriguez.hort...@gmail.com wrote: Hi, I have noticed that when StreamingContext.stop is called when no receiver has

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
PM, Tathagata Das t...@databricks.com wrote: In the receiver based approach, If the receiver crashes for any reason (receiver crashed or executor crashed) the receiver should get restarted on another executor and should start reading data from the offset present in the zookeeper

Re: Spark Streaming - Inserting into Tables

2015-07-14 Thread Tathagata Das
Why is .remember not ideal? On Sun, Jul 12, 2015 at 7:22 PM, Brandon White bwwintheho...@gmail.com wrote: Hi Yin, Yes there were no new rows. I fixed it by doing a .remember on the context. Obviously, this is not ideal. On Sun, Jul 12, 2015 at 6:31 PM, Yin Huai yh...@databricks.com wrote:

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread Tathagata Das
I do not recommend using IndexRDD for state management in Spark Streaming. What it does not solve out-of-the-box is checkpointing of indexRDDs, which important because long running streaming jobs can lead to infinite chain of RDDs. Spark Streaming solves it for the updateStateByKey operation which

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
High Level API. Regards, Dibyendu On Sat, Jun 27, 2015 at 3:04 PM, Tathagata Das t...@databricks.com wrote: In the receiver based approach, If the receiver crashes for any reason (receiver crashed or executor crashed) the receiver should get restarted on another executor and should

Re: Ordering of Batches in Spark streaming

2015-07-14 Thread Tathagata Das
This has been discussed in a number of threads in this mailing list. Here is a summary. 1. Processing of batch T+1 always starts after all the processing of batch T has completed. But here a batch is defined by data of all the receivers running the in the system receiving within the batch

Re: rest on streaming

2015-07-14 Thread Tathagata Das
You can do this. // global variable to keep track of latest stuff var latestTime = _ var latestRDD = _ dstream.foreachRDD((rdd: RDD[..], time: Time) = { latestTime = time latestRDD = rdd }) Now you can asynchronously access the latest RDD. However if you are going to run jobs on the

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Tathagata Das
Your streaming job may have been seemingly running ok, but the DStream checkpointing must have been failing in the background. It would have been visible in the log4j logs. In 1.4.0, we enabled fast-failure for that so that checkpointing failures dont get hidden in the background. The fact that

Re: NotSerializableException in spark 1.4.0

2015-07-15 Thread Tathagata Das
/SPARK-7180 and squashes the following commits: On Wed, Jul 15, 2015 at 2:37 PM, Chen Song chen.song...@gmail.com wrote: Thanks Can you point me to the patch to fix the serialization stack? Maybe I can pull it in and rerun my job. Chen On Wed, Jul 15, 2015 at 4:40 PM, Tathagata Das t

Re: Spark streaming Processing time keeps increasing

2015-07-17 Thread Tathagata Das
to provide sufficient information as part of the value so that you can take that decision in the filter function. As always, thanks for your help Nikunj On Thu, Jul 16, 2015 at 1:16 AM, Tathagata Das t...@databricks.com wrote: MAke sure you provide the filterFunction with the invertible

Re: it seem like the exactly once feature not work on spark1.4

2015-07-17 Thread Tathagata Das
Yes. More information in my talk - https://www.youtube.com/watch?v=d5UJonrruHk On Fri, Jul 17, 2015 at 1:15 AM, JoneZhang joyoungzh...@gmail.com wrote: I see now. There are three steps in SparkStreaming + Kafka date processing 1.Receiving the data 2.Transforming the data 3.Pushing out the

Re: Problems after upgrading to spark 1.4.0

2015-07-13 Thread Tathagata Das
Spark 1.4.0 added shutdown hooks in the driver to cleanly shutdown the Sparkcontext in the driver, which would shutdown the executors. I am not sure whether this is related or not, but somehow the executor's shutdown hook is being called. Can you check the driver logs to see if driver's shutdown

Re: fileStream with old files

2015-07-14 Thread Tathagata Das
It was added, but its not documented publicly. I am planning to change the name of the conf to spark.streaming.fileStream.minRememberDuration to make it easier to understand On Mon, Jul 13, 2015 at 9:43 PM, Terry Hole hujie.ea...@gmail.com wrote: A new configuration named

Re: Performance - Python streaming v/s Scala streaming

2015-08-24 Thread Tathagata Das
The scala version of the Kafka is something that we have been working on for a while, and is likely to be more optimized than the python one. The python one definitely requires pass the data back and forth between JVM and Python VM and decoding the raw bytes to the Python strings (probably less

Re: How to check whether the RDD is empty or not

2015-10-21 Thread Tathagata Das
What do you mean by checking when a "DStream is empty"? DStream represents an endless stream of data, and at point of time checking whether it is empty or not does not make sense. FYI, there is RDD.isEmpty() On Wed, Oct 21, 2015 at 10:03 AM, diplomatic Guru wrote: >

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread Tathagata Das
If you just want to control the number of reducers, then setting the numPartitions is sufficient. If you want to control how exact partitioning scheme (that is some other scheme other than hash-based) then you need to implement a custom partitioner. It can be used to improve data skews, etc. which

Re: expected Kinesis checkpoint behavior when driver restarts

2015-10-27 Thread Tathagata Das
Your observation is correct! The current implementation of checkpointing to DynamoDB is tied to the presence of new data from Kinesis (I think that emulates the KCL behavior), if there is no data for while, the checkpointing does not occur. That explains your observation. I have filed a JIRA to

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread Tathagata Das
(Long, String)])] = > { val grpdRecs = rdd.groupByKey(); val srtdRecs = > grpdRecs.mapValues[(List[(Long, String)])](iter => > iter.toList.sortBy(_._1)) srtdRecs } > > rdd.reduceByKey((a, b) => { > (Math.max(a._1, b._1), (a._2 ++ b._2)) > }) > > > > On Tue, Oct 2

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread Tathagata Das
, String)])](iter => > iter.toList.sortBy(_._1)) srtdRecs } > > > On Tue, Oct 27, 2015 at 6:47 PM, Tathagata Das <t...@databricks.com> > wrote: > >> if you specify the same partitioner (custom or otherwise) for both >> partitionBy and groupBy, then may be it wi

Re: [Spark Streaming] Connect to Database only once at the start of Streaming job

2015-10-28 Thread Tathagata Das
Yeah, of course. Just create an RDD from jdbc, call cache()/persist(), then force it to be evaluated using something like count(). Once it is cached, you can use it in a StreamingContext. Because of the cache it should not access JDBC any more. On Tue, Oct 27, 2015 at 12:04 PM, diplomatic Guru

Re: [Spark Streaming] Connect to Database only once at the start of Streaming job

2015-10-28 Thread Tathagata Das
However, if your executor dies. Then it may reconnect to JDBC to reconstruct the RDD partitions that were lost. To prevent that you can checkpoint the RDD to a HDFS-like filesystem (using rdd.checkpoint()). Then you are safe, it wont reconnect to JDBC. On Tue, Oct 27, 2015 at 11:17 PM, Tathagata

Re: SPARKONHBase checkpointing issue

2015-10-28 Thread Tathagata Das
Yes, the workaround is the same that has been suggested in the JIRA for accumulator and broadcast variables. Basically make a singleton object which lazily initializes the HBaseContext. Because of singleton, it wont get serialized through checkpoint. After recovering, it will be reinitialized

Re: [Spark Streaming] Why are some uncached RDDs are growing?

2015-10-28 Thread Tathagata Das
UpdateStateByKey automatically caches its RDDs. On Tue, Oct 27, 2015 at 8:05 AM, diplomatic Guru wrote: > > Hello All, > > When I checked my running Stream job on WebUI, I can see that some RDDs > are being listed that were not requested to be cached. What more is that

Re: java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterReceiver

2015-11-09 Thread Tathagata Das
to > run it from eclipse. > > Is there any problem running the application from eclipse ? > > > > On 9 November 2015 at 12:27, Tathagata Das <t...@databricks.com> wrote: > >> How are you submitting the spark application? >> You are supposed to submit the f

Re: dynamic allocation w/ spark streaming on mesos?

2015-11-11 Thread Tathagata Das
The reason the existing dynamic allocation does not work out of the box for spark streaming is because the heuristics used for decided when to scale up/down is not the right one for micro-batch workloads. It works great for typical batch workloads. However you can use the underlying developer API

Re: java.lang.ClassNotFoundException: org.apache.spark.streaming.twitter.TwitterReceiver

2015-11-09 Thread Tathagata Das
How are you submitting the spark application? You are supposed to submit the fat-jar of the application that include the spark-streaming-twitter dependency (and its subdeps) but not spark-streaming and spark-core. On Mon, Nov 9, 2015 at 1:02 AM, أنس الليثي wrote: > I

Re: Stack overflow error caused by long lineage RDD created after many recursions

2015-10-30 Thread Tathagata Das
You have to run some action after rdd.checkpointi() for the checkpointing to actually occur. Have you done that? On Fri, Oct 30, 2015 at 3:10 PM, Panos Str wrote: > Hi all! > > Here's a part of a Scala recursion that produces a stack overflow after > many > recursions. I've

Re: How to unpersist a DStream in Spark Streaming

2015-11-05 Thread Tathagata Das
Spark streaming automatically takes care of unpersisting any RDDs generated by DStream. You can set the StreamingContext.remember() to set the minimum persistence duration. Any persisted RDD older than that will be automatically unpersisted On Thu, Nov 5, 2015 at 9:12 AM, swetha kasireddy

Re: kinesis batches hang after YARN automatic driver restart

2015-11-03 Thread Tathagata Das
The Kinesis integration underneath uses the KCL libraries which takes a minute or so sometimes to spin up the threads and start getting data from Kinesis. That is under normal conditions. In your case, it could be happening that because of your killing and restarting, the restarted KCL may be

Re: Spark 1.5 java.net.ConnectException: Connection refused

2015-10-14 Thread Tathagata Das
y around >> lossless processing of kinesis records in Spark-1.5 + kinesis-asl-1.5 when >> jobs are aborted. Any pointers or quick explanation would be very helpful. >> >> >> On Tue, Oct 13, 2015 at 4:04 PM, Tathagata Das <t...@databricks.com> >> wrote: >>

Re: Graceful shutdown drops processing in Spark Streaming

2015-10-07 Thread Tathagata Das
Aaah, interesting, you are doing 15 minute slide duration. Yeah, internally the streaming scheduler waits for the last "batch" interval which has data to be processed, but if there is a sliding interval (i.e. 15 mins) that is higher than batch interval, then that might not be run. This is indeed a

Re: Get the previous state string in Spark streaming

2015-10-16 Thread Tathagata Das
Its hard to help without any stacktrace associated with UnsupportedOperationException. On Thu, Oct 15, 2015 at 10:40 PM, Chandra Mohan, Ananda Vel Murugan < ananda.muru...@honeywell.com> wrote: > One of my co-worker(Yogesh) was trying to get this posted in spark mailing > and it seems it did not

Re: Get the previous state string in Spark streaming

2015-10-16 Thread Tathagata Das
o-generated method stub > > if(state.toString()==null) > >return Optional.of(events); > > else { > > //UnsupportedOperationException here > >return Optional.of(events.add(state.toString());); > >

Re: PySpark + Streaming + DataFrames

2015-10-19 Thread Tathagata Das
RDD and DF are not compatible data types. So you cannot return a DF when you have to return an RDD. What rather you can do is return the underlying RDD of the dataframe by dataframe.rdd(). On Fri, Oct 16, 2015 at 12:07 PM, Jason White wrote: > Hi Ken, thanks for

Re: How to put an object in cache for ever in Streaming

2015-10-19 Thread Tathagata Das
all the executors or just the same executor? > > On Fri, Oct 16, 2015 at 5:49 PM, Tathagata Das <t...@databricks.com> > wrote: > >> Setting a ttl is not recommended any more as Spark works with Java GC to >> clean up stuff (RDDs, shuffles, broadcasts,etc.) that are not

Re: Issue in spark batches

2015-10-19 Thread Tathagata Das
If cassandra is down, does saveToCassandra throw an exception? If it does, you can catch that exception and write your own logic to retry and/or no update. Once the foreachRDD function completes, that batch will be internally marked as completed. TD On Mon, Oct 19, 2015 at 5:48 AM, varun sharma

Re: Dynamic Allocation & Spark Streaming

2015-10-19 Thread Tathagata Das
Unfortunately the title on the JIRA is extremely confusing. I have fixed it. The reason why dynamic allocation does not work well with streaming is that the heuristic that is used to automatically scale up or down the number of executors works for the pattern of task schedules in batch jobs, not

Re: Is one batch created by Streaming Context always equal to one RDD?

2015-10-19 Thread Tathagata Das
Each DStream creates one RDD per batch. On Mon, Oct 19, 2015 at 4:39 AM, vaibhavrtk wrote: > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Is-one-batch-created-by-Streaming-Context-always-equal-to-one-RDD-tp25117.html >

Re: PySpark + Streaming + DataFrames

2015-10-19 Thread Tathagata Das
ersion from RDD -> DF involves a `.take(10)` in PySpark, even if > you provide the schema, so I was avoiding back-and-forth conversions. I’ll > see if I can create a ‘trusted’ conversion that doesn’t involve the `take`. > > -- > Jason > > On October 19, 2015 at 5:23:59

Re: Multiple Spark Streaming Jobs on Single Master

2015-10-19 Thread Tathagata Das
You can set the max cores for the first submitted job such that it does not take all the resources from the master. See http://spark.apache.org/docs/latest/submitting-applications.html # Run on a Spark standalone cluster in client deploy mode ./bin/spark-submit \ --class

Re: Issue in spark batches

2015-10-20 Thread Tathagata Das
again. >> >> Thanks >> Varun >> >> On Tue, Oct 20, 2015 at 2:50 AM, Tathagata Das <t...@databricks.com> >> wrote: >> >>> If cassandra is down, does saveToCassandra throw an exception? If it >>> does, you can catch that exception and

Re: Issue in spark batches

2015-10-21 Thread Tathagata Das
eep on retrying? > Also If some batch fail, I want to block further batches to be processed > as it would create inconsistency in updation of zookeeper offsets and maybe > kill the job itself after lets say 3 retries. > > Any pointers to achieve same are appreciated. > > On Wed, Oct 2

Re: How to put an object in cache for ever in Streaming

2015-10-16 Thread Tathagata Das
Setting a ttl is not recommended any more as Spark works with Java GC to clean up stuff (RDDs, shuffles, broadcasts,etc.) that are not in reference any more. So you can keep an RDD cached in Spark, and every minute uncache the previous one, and cache a new one. TD On Fri, Oct 16, 2015 at 12:02

Re: Job splling to disk and memory in Spark Streaming

2015-10-21 Thread Tathagata Das
Well, reduceByKey needs to shutffle if your intermediate data is not already partitioned in the same way as reduceByKey's partitioning. reduceByKey() has other signatures that take in a partitioner, or simply number of partitions. So you can set the same partitioner as your previous stage.

Re: Spark Streaming: Doing operation in Receiver vs RDD

2015-10-08 Thread Tathagata Das
Since it is about encryption and decryption, its also good know how the raw data is actually saved in disk. If the write ahead log is enabled, then the raw data will be saved to the WAL in HDFS. You probably do not want to save decrypted data in that. So its better not to decrupt in the receiver,

Re: Why Spark Stream job stops producing outputs after a while?

2015-10-12 Thread Tathagata Das
Are you sure that there are not log4j errors in the driver logs? What if you try enabling debug level? And what does the streaming UI say? On Mon, Oct 12, 2015 at 12:50 PM, Uthayan Suthakar < uthayan.sutha...@gmail.com> wrote: > Any suggestions? Is there anyway that I could debug this issue? >

Re: Broadcast var is null

2015-10-05 Thread Tathagata Das
Make sure the broadcast variable works independent of the streaming application. Then make sure it work without have StreamingContext.getOrCreate(). That will disambiguate whether that error is thrown when starting a new context, or when recovering a context from checkpoint (as getOrCreate is

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Tathagata Das
he executor (since the number of messages > in the rdd is known already) > > do a foreachPartition and println or count the iterator manually. > > On Tue, Oct 6, 2015 at 6:02 PM, Tathagata Das <t...@databricks.com> wrote: > >> Are sure that this is not related to Cassandra in

Re: does KafkaCluster can be public ?

2015-10-06 Thread Tathagata Das
Given the interest, I am also inclining towards making it a public developer API. Maybe even experimental. Cody, mind submitting a patch? On Tue, Oct 6, 2015 at 7:45 AM, Sean Owen wrote: > For what it's worth, I also use this class in an app, but it happens > to be from

Re: Weird performance pattern of Spark Streaming (1.4.1) + direct Kafka

2015-10-06 Thread Tathagata Das
Are sure that this is not related to Cassandra inserts? Could you just do foreachRDD { _.count } instead to keep Cassandra out of the picture and then test this agian. On Tue, Oct 6, 2015 at 12:33 PM, Adrian Tanase wrote: > Also check if the Kafka cluster is still balanced.

Re: [Spark 1.5] Kinesis receivers not starting

2015-10-08 Thread Tathagata Das
How many executors and cores do you acquire? td On Thu, Oct 8, 2015 at 6:11 PM, Bharath Mukkati wrote: > Hi Spark Users, > > I am testing my application on Spark 1.5 and kinesis-asl-1.5. The > streaming application starts but I see a ton of stages scheduled for >

Re: Kafka and Spark combination

2015-10-09 Thread Tathagata Das
http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ Recently it has been merged into HBase https://issues.apache.org/jira/browse/HBASE-13992 There are other options to use. See spark-packages.org. On Fri, Oct 9, 2015 at 4:33 PM, Xiao Li wrote: >

Re: Spark checkpoint restore failure due to s3 consistency issue

2015-10-09 Thread Tathagata Das
gt; > On Fri, Oct 9, 2015 at 2:18 PM, Tathagata Das <t...@databricks.com> wrote: > >> Can you provide the before stop and after restart log4j logs for this? >> >> On Fri, Oct 9, 2015 at 2:13 PM, Spark Newbie <sparknewbie1...@gmail.com> >> wrote: >&

Re: Spark checkpoint restore failure due to s3 consistency issue

2015-10-09 Thread Tathagata Das
Can you provide the before stop and after restart log4j logs for this? On Fri, Oct 9, 2015 at 2:13 PM, Spark Newbie wrote: > Hi Spark Users, > > I'm seeing checkpoint restore failures causing the application startup to > fail with the below exception. When I do "ls"

Re: Spark 1.5 java.net.ConnectException: Connection refused

2015-10-13 Thread Tathagata Das
Is this happening too often? Is it slowing things down or blocking progress. Failures once in a while is part of the norm, and the system should take care of itself. On Tue, Oct 13, 2015 at 2:47 PM, Spark Newbie wrote: > Hi Spark users, > > I'm seeing the below

<    1   2   3   4   5   6   7   8   9   >