Re: RE: Spark or Storm

2015-06-19 Thread Tathagata Das
.com *From:* Haopu Wang hw...@qilinsoft.com *Date:* 2015-06-19 18:47 *To:* Enno Shioji eshi...@gmail.com; Tathagata Das t...@databricks.com *CC:* prajod.vettiyat...@wipro.com; Cody Koeninger c...@koeninger.org; bit1...@163.com; Jordan Pilat jrpi...@gmail.com; Will Briggs wrbri...@gmail.com

Re: createDirectStream and Stats

2015-06-19 Thread Tathagata Das
, this is with CDH 5.4.1 and Spark 1.3. On Thu, Jun 18, 2015 at 7:05 PM, Tim Smith secs...@gmail.com wrote: Thanks for the super-fast response, TD :) I will now go bug my hadoop vendor to upgrade from 1.3 to 1.4. Cloudera, are you listening? :D On Thu, Jun 18, 2015 at 7:02 PM, Tathagata Das

Re: Assigning number of workers in spark streaming

2015-06-19 Thread Tathagata Das
deploy mode ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://207.184.161.138:7077 \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000 On Sat, Jun 20, 2015 at 3:18 AM, Tathagata Das t...@databricks.com wrote

Re: Serial batching with Spark Streaming

2015-06-19 Thread Tathagata Das
())) { rdd.collect().forEach(s - out.println(s)); } return null; }); context.start(); context.awaitTermination(); } On 17 June 2015 at 17:25, Tathagata Das t...@databricks.com wrote: The default behavior should be that batch X + 1 starts processing only

Re: Assigning number of workers in spark streaming

2015-06-19 Thread Tathagata Das
Depends on what cluster manager are you using. Its all pretty well documented in the online documentation. http://spark.apache.org/docs/latest/submitting-applications.html On Fri, Jun 19, 2015 at 2:29 PM, anshu shukla anshushuk...@gmail.com wrote: Hey , *[For Client Mode]* 1- Is there any

Re: createDirectStream and Stats

2015-06-19 Thread Tathagata Das
I dont think there was any enhancments that can change this behavior. On Fri, Jun 19, 2015 at 6:16 PM, Tim Smith secs...@gmail.com wrote: On Fri, Jun 19, 2015 at 5:15 PM, Tathagata Das t...@databricks.com wrote: Also, can you find from the spark UI the break up of the stages in each batch's

Re: createDirectStream and Stats

2015-06-19 Thread Tathagata Das
vs days. On Fri, Jun 19, 2015 at 1:40 PM, Tathagata Das t...@databricks.com wrote: Yes, please tell us what operation are you using. TD On Fri, Jun 19, 2015 at 11:42 AM, Cody Koeninger c...@koeninger.org wrote: Is there any more info you can provide / relevant code? On Fri, Jun 19

Re: RE: Spark or Storm

2015-06-19 Thread Tathagata Das
If the current documentation is confusing, we can definitely improve the documentation. However, I dont not understand why is the term transactional confusing. If your output operation has to add 5, then the user has to implement the following mechanism 1. If the unique id of the batch of data is

Re: understanding on the waiting batches and scheduling delay in Streaming UI

2015-06-18 Thread Tathagata Das
Also, could you give a screenshot of the streaming UI. Even better, could you run it on Spark 1.4 which has a new streaming UI and then use that for debugging/screenshot? TD On Thu, Jun 18, 2015 at 3:05 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Which version of spark? and what is your

Re: Latency between the RDD in Streaming

2015-06-18 Thread Tathagata Das
is aprox 1500 tuples/sec). On Fri, Jun 19, 2015 at 2:50 AM, Tathagata Das t...@databricks.com wrote: Couple of ways. 1. Easy but approx way: Find scheduling delay and processing time using StreamingListener interface, and then calculate end-to-end delay = 0.5 * batch interval + scheduling

Re: createDirectStream and Stats

2015-06-18 Thread Tathagata Das
Are you using Spark 1.3.x ? That explains. This issue has been fixed in Spark 1.4.0. Bonus you get a fancy new streaming UI with more awesome stats. :) On Thu, Jun 18, 2015 at 7:01 PM, Tim Smith secs...@gmail.com wrote: Hi, I just switched from createStream to the createDirectStream API for

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-18 Thread Tathagata Das
the off-heap allocation of netty? Best Regards, Shixiong Zhu 2015-06-03 11:58 GMT+08:00 Ji ZHANG zhangj...@gmail.com: Hi, Thanks for you information. I'll give spark1.4 a try when it's released. On Wed, Jun 3, 2015 at 11:31 AM, Tathagata Das t...@databricks.com wrote: Could you try it out

Re: [Spark Streaming] Runtime Error in call to max function for JavaPairRDD

2015-06-18 Thread Tathagata Das
I think you may be including a different version of Spark Streaming in your assembly. Please mark spark-core nd spark-streaming as provided dependencies. Any installation of Spark will automatically provide Spark in the classpath so you do not have to bundle it. On Thu, Jun 18, 2015 at 8:44 AM,

Re: Latency between the RDD in Streaming

2015-06-18 Thread Tathagata Das
Its not clear what you are asking. Find what among RDD? On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla anshushuk...@gmail.com wrote: Is there any fixed way to find among RDD in stream processing systems , in the Distributed set-up . -- Thanks Regards, Anshu Shukla

Re: Latency between the RDD in Streaming

2015-06-18 Thread Tathagata Das
initial D-STREAM to final/last D-STREAM . Help Please !! On Fri, Jun 19, 2015 at 12:40 AM, Tathagata Das t...@databricks.com wrote: Its not clear what you are asking. Find what among RDD? On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla anshushuk...@gmail.com wrote: Is there any fixed way

Re: Serial batching with Spark Streaming

2015-06-17 Thread Tathagata Das
The default behavior should be that batch X + 1 starts processing only after batch X completes. If you are using Spark 1.4.0, could you show us a screenshot of the streaming tab, especially the list of batches? And could you also tell us if you are setting any SparkConf configurations? On Wed,

Re: Spark or Storm

2015-06-17 Thread Tathagata Das
To add more information beyond what Matei said and answer the original question, here are other things to consider when comparing between Spark Streaming and Storm. * Unified programming model and semantics - Most occasions you have to process the same data again in batch jobs. If you have two

Re: Spark Streaming reads from stdin or output from command line utility

2015-06-12 Thread Tathagata Das
stdin. Also I'm working with Scala. It would be great if you could talk about multi stdin case as well! Thanks. From: Tathagata Das t...@databricks.com Date: Thursday, June 11, 2015 at 8:11 PM To: Heath Guo heath...@fb.com Cc: user user@spark.apache.org Subject: Re: Spark Streaming reads from

Re: Re: How to keep a SQLContext instance alive in a spark streaming application's life cycle?

2015-06-12 Thread Tathagata Das
BTW, in Spark 1.4 announced today, I added SQLContext.getOrCreate. So you dont need to create the singleton yourself. On Wed, Jun 10, 2015 at 3:21 AM, Sergio Jiménez Barrio drarse.a...@gmail.com wrote: Note: CCing user@spark.apache.org First, you must check if the RDD is empty:

Re: FW: Re: Autoscaling Spark cluster based on topic sizes/rate of growth in Kafka or Spark's metrics?

2015-06-11 Thread Tathagata Das
Let me try to add some clarity in the different thought directions that's going on in this thread. 1. HOW TO DETECT THE NEED FOR MORE CLUSTER RESOURCES? If there are not rate limits set up, the most reliable way to detect whether the current Spark cluster is being insufficient to handle the data

Re: Spark Streaming reads from stdin or output from command line utility

2015-06-11 Thread Tathagata Das
Are you going to receive data from one stdin from one machine, or many stdins on many machines? On Thu, Jun 11, 2015 at 7:25 PM, foobar heath...@fb.com wrote: Hi, I'm new to Spark Streaming, and I want to create a application where Spark Streaming could create DStream from stdin. Basically I

Re: Shutdown with streaming driver running in cluster broke master web UI permanently

2015-06-11 Thread Tathagata Das
Do you have the event logging enabled? TD On Thu, Jun 11, 2015 at 11:24 AM, scar0909 scar0...@gmail.com wrote: I have the same problem. i realized that the master spark becomes unresponsive when we kill the leader zookeeper (of course i assigned the leader election task to the zookeeper).

Re: Join between DStream and Periodically-Changing-RDD

2015-06-11 Thread Tathagata Das
Another approach not mentioned is to use a function to get the RDD that is to be joined. Something like this. Not sure, but you can try something like this also: kvDstream.foreachRDD(rdd = { val rdd = getOrUpdateRDD(params...) rdd.join(kvFile) }) The

Re: Spark SQL and Streaming Results

2015-06-05 Thread Tathagata Das
You could take at RDD *async operations, their source code. May be that can help if getting some early results. TD On Fri, Jun 5, 2015 at 8:41 AM, Pietro Gentile pietro.gentile89.develo...@gmail.com wrote: Hi all, what is the best way to perform Spark SQL queries and obtain the result

Re: Spark SQL and Streaming Results

2015-06-05 Thread Tathagata Das
have missed it. @TD Is this project still active? I'm not sure what the status is but it may provide some insights on how to achieve what your looking to do. On Fri, Jun 5, 2015 at 6:34 PM, Tathagata Das t...@databricks.com wrote: You could take at RDD *async operations, their source code

Re: Roadmap for Spark with Kafka on Scala 2.11?

2015-06-04 Thread Tathagata Das
But compile scope is supposed to be added to the assembly. https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope On Thu, Jun 4, 2015 at 1:24 PM, algermissen1971 algermissen1...@icloud.com wrote: Hi Iulian, On 26 May 2015, at 13:04, Iulian

Re: Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-06-02 Thread Tathagata Das
the official website. I could use spark1.0 + yarn, but I can't find a way to handle the logs, level and rolling, so it'll explode the harddrive. Currently I'll stick to spark1.0 + standalone, until our ops team decides to upgrade cdh. On Tue, Jun 2, 2015 at 2:58 PM, Tathagata Das t...@databricks.com

Re: Spark updateStateByKey fails with class leak when using case classes - resend

2015-06-01 Thread Tathagata Das
Interesting, only in local[*]! In the github you pointed to, what is the main that you were running. TD On Mon, May 25, 2015 at 9:23 AM, rsearle eggsea...@verizon.net wrote: Further experimentation indicates these problems only occur when master is local[*]. There are no issues if a

Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Tathagata Das
In the receiver-less direct approach, there is no concept of consumer group as we dont use the Kafka High Level consumer (that uses ZK). Instead Spark Streaming manages offsets on its own, giving tighter guarantees. If you want to monitor the progress of the processing of offsets, you will have to

Re: Recommended Scala version

2015-05-31 Thread Tathagata Das
in the jar I've specified with the —jars argument on the command line are available in the REPL. Cheers Alex On Thu, May 28, 2015 at 8:38 AM, Tathagata Das t...@databricks.com wrote: Would be great if you guys can test out the Spark 1.4.0 RC2 (RC3 coming out soon) with Scala 2.11 and report

Re: Recommended Scala version

2015-05-28 Thread Tathagata Das
Would be great if you guys can test out the Spark 1.4.0 RC2 (RC3 coming out soon) with Scala 2.11 and report issues. TD On Tue, May 26, 2015 at 9:15 AM, Koert Kuipers ko...@tresata.com wrote: we are still running into issues with spark-shell not working on 2.11, but we are running on somewhat

Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Tathagata Das
You can throttle the no receiver direct Kafka stream using spark.streaming.kafka.maxRatePerPartition http://spark.apache.org/docs/latest/configuration.html#spark-streaming On Wed, May 27, 2015 at 4:34 PM, Ted Yu yuzhih...@gmail.com wrote: Have you seen

Re: Spark Streaming - Design considerations/Knobs

2015-05-24 Thread Tathagata Das
, Hemant On Thu, May 21, 2015 at 2:21 AM, Tathagata Das t...@databricks.com wrote: Correcting the ones that are incorrect or incomplete. BUT this is good list for things to remember about Spark Streaming. On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat hemant9...@gmail.com wrote: Hi, I have

Re: Trying to connect to many topics with several DirectConnect

2015-05-22 Thread Tathagata Das
Can you show us the rest of the program? When are you starting, or stopping the context. Is the exception occuring right after start or stop? What about log4j logs, what does it say? On Fri, May 22, 2015 at 7:12 AM, Cody Koeninger c...@koeninger.org wrote: I just verified that the following

Re: Storing spark processed output to Database asynchronously.

2015-05-22 Thread Tathagata Das
to push the data into database. Meanwhile, Spark is unable to receive data probably because the process is blocked. On Thu, May 21, 2015 at 5:28 PM, Tathagata Das t...@databricks.com wrote: Can you elaborate on how the data loss is occurring? On Thu, May 21, 2015 at 1:10 AM, Gautam Bajaj

Re: [Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll

2015-05-21 Thread Tathagata Das
Thanks for the JIRA. I will look into this issue. TD On Thu, May 21, 2015 at 1:31 AM, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: I ran into one of the issues that are potentially caused because of this and have logged a JIRA bug - https://issues.apache.org/jira/browse/SPARK-7788

Re: Storing spark processed output to Database asynchronously.

2015-05-21 Thread Tathagata Das
If you cannot push data as fast as you are generating it, then async isnt going to help either. The work is just going to keep piling up as many many async jobs even though your batch processing times will be low as that processing time is not going to reflect how much of overall work is pending

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-05-21 Thread Tathagata Das
Looks like somehow the file size reported by the FSInputDStream of Tachyon's FileSystem interface, is returning zero. On Mon, May 11, 2015 at 4:38 AM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Just to follow up this thread further . I was doing some fault tolerant testing

Re: Connecting to an inmemory database from Spark

2015-05-21 Thread Tathagata Das
Doesnt seem like a Cassandra specific issue. Could you give us more information (code, errors, stack traces)? On Thu, May 21, 2015 at 1:33 PM, tshah77 tejasrs...@gmail.com wrote: TD, Do you have any example about reading from cassandra using spark streaming in java? I am trying to connect

Re: Spark Streaming - Design considerations/Knobs

2015-05-20 Thread Tathagata Das
Correcting the ones that are incorrect or incomplete. BUT this is good list for things to remember about Spark Streaming. On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat hemant9...@gmail.com wrote: Hi, I have compiled a list (from online sources) of knobs/design considerations that need to

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-20 Thread Tathagata Das
If you are talking about handling driver crash failures, then all bets are off anyways! Adding a shutdown hook in the hope of handling driver process failure, handles only a some cases (Ctrl-C), but does not handle cases like SIGKILL (does not run JVM shutdown hooks) or driver machine crash. So

Re: [Unit Test Failure] Test org.apache.spark.streaming.JavaAPISuite.testCount failed

2015-05-20 Thread Tathagata Das
Has this been fixed for you now? There has been a number of patches since then and it may have been fixed. On Thu, May 14, 2015 at 7:20 AM, Wangfei (X) wangf...@huawei.com wrote: Yes it is repeatedly on my locally Jenkins. 发自我的 iPhone 在 2015年5月14日,18:30,Tathagata Das t...@databricks.com 写道

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-19 Thread Tathagata Das
If you wanted to stop it gracefully, then why are you not calling ssc.stop(stopGracefully = true, stopSparkContext = true)? Then it doesnt matter whether the shutdown hook was called or not. TD On Mon, May 18, 2015 at 9:43 PM, Dibyendu Bhattacharya dibyendu.bhattach...@gmail.com wrote: Hi,

Re: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-19 Thread Tathagata Das
If you dont want the fileStream to start only after certain event has happened, why not start the streamingContext after that event? TD On Sun, May 17, 2015 at 7:51 PM, Haopu Wang hw...@qilinsoft.com wrote: I want to use file stream as input. And I look at SparkStreaming document again, it's

Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-16 Thread Tathagata Das
Marscher [mailto:rmarsc...@localytics.com] *Sent:* Friday, May 15, 2015 7:20 PM *To:* Evo Eftimov *Cc:* Tathagata Das; user *Subject:* Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond The doc is a bit confusing IMO, but at least for my application I had to use a fair pool

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Tathagata Das
What is the error you are seeing? TD On Thu, May 14, 2015 at 9:00 AM, Erich Ess er...@simplerelevance.com wrote: Hi, Is it possible to setup streams from multiple Kinesis streams and process them in a single job? From what I have read, this should be possible, however, the Kinesis layer

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Tathagata Das
? it should work the same way - including union() of streams from totally different source types (kafka, kinesis, flume). On Thu, May 14, 2015 at 2:07 PM, Tathagata Das t...@databricks.com wrote: What is the error you are seeing? TD On Thu, May 14, 2015 at 9:00 AM, Erich Ess er

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-14 Thread Tathagata Das
It would be good if you can tell what I should add to the documentation to make it easier to understand. I can update the docs for 1.4.0 release. On Tue, May 12, 2015 at 9:52 AM, Lee McFadden splee...@gmail.com wrote: Thanks for explaining Sean and Cody, this makes sense now. I'd like to help

Re: [Unit Test Failure] Test org.apache.spark.streaming.JavaAPISuite.testCount failed

2015-05-14 Thread Tathagata Das
Do you get this failure repeatedly? On Thu, May 14, 2015 at 12:55 AM, kf wangf...@huawei.com wrote: Hi, all, i got following error when i run unit test of spark by dev/run-tests on the latest branch-1.4 branch. the latest commit id: commit d518c0369fa412567855980c3f0f426cde5c190d

Re: how to use rdd.countApprox

2015-05-13 Thread Tathagata Das
On Wednesday, May 13, 2015 12:02 PM, Tathagata Das t...@databricks.com wrote: That is a good question. I dont see a direct way to do that. You could do try the following val jobGroupId = group-id-based-on-current-time rdd.sparkContext.setJobGroup(jobGroupId) val approxCount

Re: how to use rdd.countApprox

2015-05-13 Thread Tathagata Das
. Thanks, Du On Wednesday, May 13, 2015 1:12 PM, Tathagata Das t...@databricks.com wrote: That is not supposed to happen :/ That is probably a bug. If you have the log4j logs, would be good to file a JIRA. This may be worth debugging. On Wed, May 13, 2015 at 12:13 PM, Du Li l...@yahoo

Re: DStream Union vs. StreamingContext Union

2015-05-12 Thread Tathagata Das
@Vadim What happened when you tried unioning using DStream.union in python? TD On Tue, May 12, 2015 at 9:53 AM, Evo Eftimov evo.efti...@isecc.com wrote: I can confirm it does work in Java *From:* Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com] *Sent:* Tuesday, May 12, 2015 5:53 PM

Re: DStream Union vs. StreamingContext Union

2015-05-12 Thread Tathagata Das
On Tue, May 12, 2015 at 12:57 PM, Tathagata Das tathagata.das1...@gmail.com wrote: @Vadim What happened when you tried unioning using DStream.union in python? TD On Tue, May 12, 2015 at 9:53 AM, Evo Eftimov evo.efti...@isecc.com wrote: I can confirm it does work in Java *From:* Vadim

Re: how to use rdd.countApprox

2015-05-12 Thread Tathagata Das
From the code it seems that as soon as the rdd.countApprox(5000) returns, you can call pResult.initialValue() to get the approximate count at that point of time (that is after timeout). Calling pResult.getFinalValue() will further block until the job is over, and give the final correct values

Re: Receiver Fault Tolerance

2015-05-06 Thread Tathagata Das
Incorrect. The receiver runs in an executor just like a any other tasks. In the cluster mode, the driver runs in a worker, however it launches executors in OTHER workers in the cluster. Its those executors running in other workers that run tasks, and also the receivers. On Wed, May 6, 2015 at

Re: How update counter in cassandra

2015-05-06 Thread Tathagata Das
This may help. http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala On Wed, May 6, 2015 at 5:35 PM, Sergio Jiménez Barrio drarse.a...@gmail.com wrote: I have a Counter family colums in Cassandra. I want update this counters

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Tathagata Das
Is the function ingestToMysql running on the driver or on the executors? Accordingly you can try debugging while running in a distributed manner, with and without calling the function. If you dont get too many open files without calling ingestToMysql(), the problem is likely to be in

Re: Join between Streaming data vs Historical Data in spark

2015-04-29 Thread Tathagata Das
Have you taken a look at the join section in the streaming programming guide? http://spark.apache.org/docs/latest/streaming-programming-guide.html#stream-dataset-joins On Wed, Apr 29, 2015 at 7:11 AM, Rendy Bambang Junior rendy.b.jun...@gmail.com wrote: Let say I have transaction data and

Re: implicit function in SparkStreaming

2015-04-29 Thread Tathagata Das
I believe that the implicit def is pulling in the enclosing class (in which the def is defined) in the closure which is not serializable. On Wed, Apr 29, 2015 at 4:20 AM, guoqing0...@yahoo.com.hk guoqing0...@yahoo.com.hk wrote: Hi guys, I`m puzzled why i cant use the implicit function in

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Tathagata Das
Also cc;ing Cody. @Cody maybe there is a reason for doing connection pooling even if there is not performance difference. TD On Wed, Apr 29, 2015 at 4:06 PM, Tathagata Das t...@databricks.com wrote: Is the function ingestToMysql running on the driver or on the executors? Accordingly you can

Re: Driver memory leak?

2015-04-29 Thread Tathagata Das
It could be related to this. https://issues.apache.org/jira/browse/SPARK-6737 This was fixed in Spark 1.3.1. On Wed, Apr 29, 2015 at 8:38 AM, Sean Owen so...@cloudera.com wrote: Not sure what you mean. It's already in CDH since 5.4 = 1.3.0 (This isn't the place to ask about CDH) I also

Re: Re: implicit function in SparkStreaming

2015-04-29 Thread Tathagata Das
in the enclosing class ? *From:* Tathagata Das t...@databricks.com *Date:* 2015-04-30 07:00 *To:* guoqing0...@yahoo.com.hk *CC:* user user@spark.apache.org *Subject:* Re: implicit function in SparkStreaming I believe that the implicit def is pulling in the enclosing class (in which the def

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-23 Thread Tathagata Das
What was the state of your streaming application? Was it falling behind with a large increasing scheduling delay? TD On Thu, Apr 23, 2015 at 11:31 AM, N B nb.nos...@gmail.com wrote: Thanks for the response, Conor. I tried with those settings and for a while it seemed like it was cleaning up

Re: Convert DStream to DataFrame

2015-04-22 Thread Tathagata Das
Did you checkout the latest streaming programming guide? http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations You also need to be aware of that to convert json RDDs to dataframe, sqlContext has to make a pass on the data to learn the schema. This will

Re: Map Question

2015-04-22 Thread Tathagata Das
Is the mylist present on every executor? If not, then you have to pass it on. And broadcasts are the best way to pass them on. But note that once broadcasted it will immutable at the executors, and if you update the list at the driver, you will have to broadcast it again. TD On Wed, Apr 22, 2015

Re: Not able run multiple tasks in parallel, spark streaming

2015-04-22 Thread Tathagata Das
Furthermore, just to explain, doing arr.par.foreach does not help because it not really running anything, it only doing setup of the computation. Doing the setup in parallel does not mean that the jobs will be done concurrently. Also, from your code it seems like your pairs of dstreams dont

Re: Map Question

2015-04-22 Thread Tathagata Das
Absolutely. The same code would work for local as well as distributed mode! On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Can I use broadcast vars in local mode? ᐧ On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das t...@databricks.com wrote: Yep

Re: Map Question

2015-04-22 Thread Tathagata Das
it locally...and I plan to move it to production on EC2. The way I fixed it is by doing myrdd.map(lambda x: (x, mylist)).map(myfunc) but I don't think it's efficient? mylist is filled only once at the start and never changes. Vadim ᐧ On Wed, Apr 22, 2015 at 1:42 PM, Tathagata Das t

Re: Spark Streaming updatyeStateByKey throws OutOfMemory Error

2015-04-22 Thread Tathagata Das
It could very well be that your executor memory is not enough to store the state RDDs AND operate on the data. 1G per executor is quite low. Definitely give more memory. And have you tried increasing the number of partitions (specify number of partitions in updateStateByKey) ? On Wed, Apr 22,

Re: Map Question

2015-04-22 Thread Tathagata Das
: Great. Will try to modify the code. Always room to optimize! ᐧ On Wed, Apr 22, 2015 at 2:11 PM, Tathagata Das t...@databricks.com wrote: Absolutely. The same code would work for local as well as distributed mode! On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy vadim.bichuts

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-22 Thread Tathagata Das
is unexpectedly null when DStream is being serialized must mean something. Under which circumstances, such an exception would trigger? On Tue, Apr 21, 2015 at 4:12 PM, Tathagata Das t...@databricks.com wrote: Yeah, I am not sure what is going on. The only way to figure to take a look

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-21 Thread Tathagata Das
is being serialized must mean something. Under which circumstances, such an exception would trigger? On Tue, Apr 21, 2015 at 4:12 PM, Tathagata Das t...@databricks.com wrote: Yeah, I am not sure what is going on. The only way to figure to take a look at the disassembled bytecodes using javap

Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-21 Thread Tathagata Das
that. The puzzling thing is why removing the context bounds solve the problem... What does this exception mean in general? On Mon, Apr 20, 2015 at 1:33 PM, Tathagata Das t...@databricks.com wrote: When are you getting this exception? After starting the context? TD On Mon, Apr 20, 2015 at 10:44

Re: SparkStreaming onStart not being invoked on CustomReceiver attached to master with multiple workers

2015-04-20 Thread Tathagata Das
Responses inline. On Mon, Apr 20, 2015 at 3:27 PM, Ankit Patel patel7...@hotmail.com wrote: What you said is correct and I am expecting the printlns to be in my console or my SparkUI. I do not see it in either places. Can you actually login into the machine running the executor which runs the

Re: RAM management during cogroup and join

2015-04-15 Thread Tathagata Das
Significant optimizations can be made by doing the joining/cogroup in a smart way. If you have to join streaming RDDs with the same batch RDD, then you can first partition the batch RDDs using a partitions and cache it, and then use the same partitioner on the streaming RDDs. That would make sure

Re: RAM management during cogroup and join

2015-04-15 Thread Tathagata Das
and RAM allocated for the result RDD The optimizations mentioned still don’t change the total number of elements included in the result RDD and RAM allocated – right? *From:* Tathagata Das [mailto:t...@databricks.com] *Sent:* Wednesday, April 15, 2015 9:25 PM *To:* Evo Eftimov *Cc:* user

Re: RAM management during cogroup and join

2015-04-15 Thread Tathagata Das
:* Tathagata Das [mailto:t...@databricks.com] *Sent:* Wednesday, April 15, 2015 9:48 PM *To:* Evo Eftimov *Cc:* user *Subject:* Re: RAM management during cogroup and join Agreed. On Wed, Apr 15, 2015 at 1:29 PM, Evo Eftimov evo.efti...@isecc.com wrote: That has been done Sir

Re: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-04-15 Thread Tathagata Das
asynchronously--- window1 --- window2 While writing it, I start to believe they can, because windows are time-triggered, not triggered when previous window has finished... But it's better to ask:) 2015-04-15 2:08 GMT+02:00 Tathagata Das t

Re: How to do dispatching in Streaming?

2015-04-15 Thread Tathagata Das
It may be worthwhile to do architect the computation in a different way. dstream.foreachRDD { rdd = rdd.foreach { record = // do different things for each record based on filters } } TD On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang jianshi.hu...@gmail.com wrote: Hi, I have a

Re: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-04-14 Thread Tathagata Das
Fundamentally, stream processing systems are designed for processing streams of data, not for storing large volumes of data for a long period of time. So if you have to maintain that much state for months, then its best to use another system that is designed for long term storage (like Cassandra)

Re: Seeing message about receiver not being de-registered on invoking Streaming context stop

2015-04-14 Thread Tathagata Das
What version of Spark are you using? There was a known bug which could be causing this. It got fixed in Spark 1.3 TD On Mon, Apr 13, 2015 at 11:44 PM, Akhil Das ak...@sigmoidanalytics.com wrote: When you say done fetching documents, does it mean that you are stopping the streamingContext?

Re: sbt-assembly spark-streaming-kinesis-asl error

2015-04-14 Thread Tathagata Das
Have you tried marking only spark-streaming-kinesis-asl as not provided, and the rest as provided? Then you will not even need to add kinesis-asl.jar in the spark-submit. TD On Tue, Apr 14, 2015 at 2:27 PM, Mike Trienis mike.trie...@orcsol.com wrote: Richard, You response was very helpful

Re: not found: value SQLContextSingleton

2015-04-11 Thread Tathagata Das
Have you created a class called SQLContextSingleton ? If so, is it in the compile class path? On Fri, Apr 10, 2015 at 6:47 AM, Mukund Ranjan (muranjan) muran...@cisco.com wrote: Hi All, Any idea why I am getting this error? wordsTenSeconds.foreachRDD((rdd: RDD[String], time: Time)

Re: coalesce(*, false) problem

2015-04-10 Thread Tathagata Das
Coalesce tries to reduce the number of partitions into smaller number of partitions, without moving the data around (as much as possible). Since most of received data is in a few machines (those running receivers), coallesce just makes bigger merged partitions in those. Without coalesce Machine

Re: Lookup / Access of master data in spark streaming

2015-04-09 Thread Tathagata Das
Responses inline. Hope they help. On Thu, Apr 9, 2015 at 8:20 AM, Amit Assudani aassud...@impetus.com wrote: Hi Friends, I am trying to solve a use case in spark streaming, I need help on getting to right approach on lookup / update the master data. Use case ( simplified ) I’ve a

Re: Could not compute split, block not found in Spark Streaming Simple Application

2015-04-09 Thread Tathagata Das
, Tathagata Das t...@databricks.com wrote: If it is deterministically reproducible, could you generate full DEBUG level logs, from the driver and the workers and give it to me? Basically I want to trace through what is happening to the block that is not being found. And can you tell what Cluster

Re: Continuous WARN messages from BlockManager about block replication

2015-04-09 Thread Tathagata Das
Well, you are running in local mode, so it cannot find another peer to replicate the blocks received from receivers. That's it. Its not a real concern and that error will go away when you are run it in a cluster. On Thu, Apr 9, 2015 at 11:24 AM, Nandan Tammineedi nan...@defend7.com wrote: Hi,

Re: Timeout errors from Akka in Spark 1.2.1

2015-04-08 Thread Tathagata Das
There are a couple of options. Increase timeout (see Spark configuration). Also see past mails in the mailing list. Another option you may try (I have gut feeling that may work, but I am not sure) is calling GC on the driver periodically. The cleaning up of stuff is tied to GCing of RDD objects

Re: Timeout errors from Akka in Spark 1.2.1

2015-04-08 Thread Tathagata Das
as the driver? Thanks NB On Wed, Apr 8, 2015 at 1:29 PM, Tathagata Das t...@databricks.com wrote: Its does take effect on the executors, not on the driver. Which is okay because executors have all the data and therefore have GC issues, not so usually for the driver. If you want to double

Re: Empty RDD?

2015-04-08 Thread Tathagata Das
Aah yes. The jsonRDD method needs to walk through the whole RDD to understand the schema, and does not work if there is not data in it. Making sure there is no data in it using take(1) should work. TD

Re: Empty RDD?

2015-04-08 Thread Tathagata Das
What is the computation you are doing in the foreachRDD, that is throwing the exception? One way to guard against is to do a take(1) to see if you get back any data. If there is none, then don't do anything with the RDD. TD On Wed, Apr 8, 2015 at 1:08 PM, Vadim Bichutskiy

Re: Spark + Kinesis

2015-04-06 Thread Tathagata Das
, InitialPositionInStream.LATEST, StorageLevel.MEMORY_AND_DISK_2)}/* Union all the streams */val unionStreams = ssc.union(kinesisStreams).map(byteArray = new String(byteArray))unionStreams.print()ssc.start() ssc.awaitTermination() }}* On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das t

Re: WordCount example

2015-04-06 Thread Tathagata Das
, Apr 6, 2015 at 9:55 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Interesting, I see 0 cores in the UI? - *Cores:* 0 Total, 0 Used On Fri, Apr 3, 2015 at 2:55 PM, Tathagata Das t...@databricks.com wrote: What does the Spark Standalone UI at port 8080 say about number of cores

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Tathagata Das
So you want to sort based on the total count of the all the records received through receiver? In that case, you have to combine all the counts using updateStateByKey (

Re: Spark Application Stages and DAG

2015-04-03 Thread Tathagata Das
What he meant is that look it up in the Spark UI, specifically in the Stage tab to see what is taking so long. And yes code snippet helps us debug. TD On Fri, Apr 3, 2015 at 12:47 AM, Akhil Das ak...@sigmoidanalytics.com wrote: You need open the Stage\'s page which is taking time, and see how

Re: About Waiting batches on the spark streaming UI

2015-04-03 Thread Tathagata Das
Very good question! This is because the current code is written such that the ui considers a batch as waiting only when it has actually started being processed. Thats batched waiting in the job queue is not considered in the calculation. It is arguable that it may be more intuitive to count that

Re: About Waiting batches on the spark streaming UI

2015-04-03 Thread Tathagata Das
, Tathagata Das t...@databricks.com wrote: Very good question! This is because the current code is written such that the ui considers a batch as waiting only when it has actually started being processed. Thats batched waiting in the job queue is not considered in the calculation. It is arguable

Re: WordCount example

2015-04-03 Thread Tathagata Das
: 3 processor : 4 processor : 5 processor : 6 processor : 7 On Fri, Apr 3, 2015 at 2:33 PM, Tathagata Das t...@databricks.com wrote: How many cores are present in the works allocated to the standalone cluster spark://ip-10-241-251-232:7077 ? On Fri, Apr 3, 2015

Re: Simple but faster data streaming

2015-04-03 Thread Tathagata Das
I am afraid not. The whole point of Spark Streaming is to make it easy to do complicated processing on streaming data while interoperating with core Spark, MLlib, SQL without the operational overheads of maintain 4 different systems. As a slight cost of achieving that unification, there maybe some

Re: Spark Streaming FileStream Nested File Support

2015-04-03 Thread Tathagata Das
be added in at some point. On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das t...@databricks.com wrote: I sort-a-hacky workaround is to use a queueStream where you can manually create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note that this is for testing only as queueStream does

<    1   2   3   4   5   6   7   8   9   >