Re: Spark streaming on ec2

2014-02-27 Thread Tathagata Das
Yes! Spark streaming programs are just like any spark program and so any ec2 cluster setup using the spark-ec2 scripts can be used to run spark streaming programs as well. On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia buendia...@gmail.comwrote: Hi, Does the ec2 support for spark 0.9

Re: Spark streaming on ec2

2014-02-27 Thread Tathagata Das
means for daemonizing and monitoring spark streaming apps which are supposed to run 24/7? If not, any suggestions for how to do this? On Thu, Feb 27, 2014 at 8:23 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Zookeeper is automatically set up in the cluster as Spark uses Zookeeper

Re: NoSuchMethodError - Akka - Props

2014-03-06 Thread Tathagata Das
Are you launching your application using scala or java command? scala command bring in a version of Akka that we have found to cause conflicts with Spark's version for Akka. So its best to launch using Java. TD On Thu, Mar 6, 2014 at 3:45 PM, Deepak Nulu deepakn...@gmail.com wrote: I see the

Re: NoSuchMethodError in KafkaReciever

2014-03-06 Thread Tathagata Das
I dont have a Eclipse setup so I am not sure what is going on here. I would try to use maven in the command line with a pom to see if this compiles. Also, try to cleanup your system maven cache. Who knows if it had pulled in a wrong version of kafka 0.8 and using it all the time. Blowing away the

Re: Explain About Logs NetworkWordcount.scala

2014-03-07 Thread Tathagata Das
I am not sure how to debug this without any more information about the source. Can you monitor on the receiver side that data is being accepted by the receiver but not reported? TD On Wed, Mar 5, 2014 at 7:23 AM, eduardocalfaia e.costaalf...@unibs.itwrote: Hi TD, I have seen in the web UI

Re: Sliding Window operations do not work as documented

2014-03-24 Thread Tathagata Das
Hello Sanjay, Yes, your understanding of lazy semantics is correct. But ideally every batch should read based on the batch interval provided in the StreamingContext. Can you open a JIRA on this? On Mon, Mar 24, 2014 at 7:45 AM, Sanjay Awatramani sanjay_a...@yahoo.com wrote: Hi All, I found

Re: [bug?] streaming window unexpected behaviour

2014-03-24 Thread Tathagata Das
Yes, I believe that is current behavior. Essentially, the first few RDDs will be partial windows (assuming window duration sliding interval). TD On Mon, Mar 24, 2014 at 1:12 PM, Adrian Mocanu amoc...@verticalscope.com wrote: I have what I would call unexpected behaviour when using window on a

Re: spark executor/driver log files management

2014-03-25 Thread Tathagata Das
correct? Thanks, Sourav On Tue, Mar 25, 2014 at 3:26 AM, Tathagata Das tathagata.das1...@gmail.com wrote: You can use RollingFileAppenders in log4j.properties. http://logging.apache.org/log4j/extras/apidocs/org/apache/log4j/rolling/RollingFileAppender.html You can have other scripts

Re: Spark Streaming ZeroMQ Java Example

2014-03-25 Thread Tathagata Das
Unfortunately there isnt one right now. But it is probably too hard to start with the JavaNetworkWordCounthttps://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/streaming/examples/JavaNetworkWordCount.java, and use the ZeroMQUtils in the same way as the

Re: [bug?] streaming window unexpected behaviour

2014-03-25 Thread Tathagata Das
to count how many windows there are. -A -Original Message- From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: March-24-14 6:55 PM To: user@spark.apache.org Cc: u...@spark.incubator.apache.org Subject: Re: [bug?] streaming window unexpected behaviour Yes, I believe

Re: Spark Streaming - Shared hashmaps

2014-03-26 Thread Tathagata Das
When you say launch long-running tasks does it mean long running Spark jobs/tasks, or long-running tasks in another system? If the rate of requests from Kafka is not low (in terms of records per second), you could collect the records in the driver, and maintain the shared bag in the driver. A

Re: streaming questions

2014-03-26 Thread Tathagata Das
*Answer 1:*Make sure you are using master as local[n] with n 1 (assuming you are running it in local mode). The way Spark Streaming works is that it assigns a code to the data receiver, and so if you run the program with only one core (i.e., with local or local[1]), then it wont have resources to

Re: rdd.saveAsTextFile problem

2014-03-26 Thread Tathagata Das
Can you give us the more detailed exception + stack trace in the log? It should be in the driver log. If not, please take a look at the executor logs, through the web ui to find the stack trace. TD On Tue, Mar 25, 2014 at 10:43 PM, gaganbm gagan.mis...@gmail.com wrote: Hi Folks, Is this

Re: spark streaming and the spark shell

2014-03-27 Thread Tathagata Das
Seems like the configuration of the Spark worker is not right. Either the worker has not been given enough memory or the allocation of the memory to the RDD storage needs to be fixed. If configured correctly, the Spark workers should not get OOMs. On Thu, Mar 27, 2014 at 2:52 PM, Evgeny

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-28 Thread Tathagata Das
27, 2014 at 3:47 PM, Evgeny Shishkin itparan...@gmail.comwrote: On 28 Mar 2014, at 01:44, Tathagata Das tathagata.das1...@gmail.com wrote: The more I think about it the problem is not about /tmp, its more about the workers not having enough memory. Blocks of received data could be falling

Re: CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Tathagata Das
Few things that would be helpful. 1. Environment settings - you can find them on the environment tab in the Spark application UI 2. Are you setting the HDFS configuration correctly in your Spark program? For example, can you write a HDFS file from a Spark program (say spark-shell) to your HDFS

Re: CheckpointRDD has different number of partitions than original RDD

2014-04-08 Thread Tathagata Das
-examples_2.10-assembly-0.9.0-incubating.jar System Classpath http://10.10.41.67:40368/jars/spark-examples_2.10-assembly-0.9.0-incubating.jarAdded By User From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: Monday, April 07, 2014 7:54 PM To: user@spark.apache.org

Spark 0.9.1 released

2014-04-09 Thread Tathagata Das
, Matei Zaharia, Nan Zhu, Nick Lanham, Patrick Wendell, Prabin Banka, Prashant Sharma, Qiuzhuang, Raymond Liu, Reynold Xin, Sandy Ryza, Sean Owen, Shixiong Zhu, shiyun.wxm, Stevo Slavić, Tathagata Das, Tom Graves, Xiangrui Meng TD

Re: Spark 0.9.1 released

2014-04-09 Thread Tathagata Das
A small additional note: Please use the direct download links in the Spark Downloads http://spark.apache.org/downloads.html page. The Apache mirrors take a day or so to sync from the main repo, so may not work immediately. TD On Wed, Apr 9, 2014 at 2:54 PM, Tathagata Das tathagata.das1

Re: Strange behaviour of different SSCs with same Kafka topic

2014-04-14 Thread Tathagata Das
Does this happen at low event rate for that topic as well, or only for a high volume rate? TD On Wed, Apr 9, 2014 at 11:24 PM, gaganbm gagan.mis...@gmail.com wrote: I am really at my wits' end here. I have different Streaming contexts, lets say 2, and both listening to same Kafka topics. I

Re: Strange behaviour of different SSCs with same Kafka topic

2014-04-21 Thread Tathagata Das
help me in how to use some external data to manipulate the RDD records ? Thanks and regards Gagan B Mishra *Programmer* *560034, Bangalore* *India* On Tue, Apr 15, 2014 at 4:09 AM, Tathagata Das [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=4434i

Re: question about the SocketReceiver

2014-04-21 Thread Tathagata Das
As long as the socket server sends data through the same connection, the existing code is going to work. The socket.getInputStream returns a input stream which will continuously allow you to pull data sent over the connection. The bytesToObject function continuously reads data from the input

Re: checkpointing without streaming?

2014-04-21 Thread Tathagata Das
Diana, that is a good question. When you persist an RDD, the system still remembers the whole lineage of parent RDDs that created that RDD. If one of the executor fails, and the persist data is lost (both local disk and memory data will get lost), then the lineage is used to recreate the RDD. The

Re: Bind exception while running FlumeEventCount

2014-04-22 Thread Tathagata Das
Hello Neha, This is the result of a known bug in 0.9. Can you try running the latest Spark master branch to see if this problem is resolved? TD On Tue, Apr 22, 2014 at 2:48 AM, NehaS Singh nehas.si...@lntinfotech.comwrote: Hi, I have installed

Re: Java Spark Streaming - SparkFlumeEvent

2014-04-28 Thread Tathagata Das
You can get the internal AvroFlumeEvent inside the SparkFlumeEvent using SparkFlumeEvent.event. That should probably give you all the original text data. On Mon, Apr 28, 2014 at 5:46 AM, Kulkarni, Vikram vikram.kulka...@hp.comwrote: Hi Spark-users, Within my Spark Streaming program, I am

Re: What is Seq[V] in updateStateByKey?

2014-04-29 Thread Tathagata Das
You may have already seen it, but I will mention it anyways. This example may help. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/StatefulNetworkWordCount.scala Here the state is essentially a running count of the words seen. So the value

Re: Spark's behavior

2014-04-29 Thread Tathagata Das
are not using stream context with master local, we have 1 Master and 8 Workers and 1 word source. The command line that we are using is: bin/run-example org.apache.spark.streaming.examples.JavaNetworkWordCount spark://192.168.0.13:7077 On Apr 30, 2014, at 0:09, Tathagata Das tathagata.das1

Re: What is Seq[V] in updateStateByKey?

2014-04-30 Thread Tathagata Das
Yeah, I remember changing fold to sum in a few places, probably in testsuites, but missed this example I guess. On Wed, Apr 30, 2014 at 1:29 PM, Sean Owen so...@cloudera.com wrote: S is the previous count, if any. Seq[V] are potentially many new counts. All of them have to be added together

Re: Spark streaming

2014-05-01 Thread Tathagata Das
Take a look at the RDD.pipe() operation. That allows you to pipe the data in a RDD to any external shell command (just like Unix Shell pipe). On May 1, 2014 10:46 AM, Mohit Singh mohit1...@gmail.com wrote: Hi, I guess Spark is using streaming in context of streaming live data but what I mean

Re: What is Seq[V] in updateStateByKey?

2014-05-01 Thread Tathagata Das
Depends on your code. Referring to the earlier example, if you do words.map(x = (x,1)).updateStateByKey() then for a particular word, if a batch contains 6 occurrences of that word, then the Seq[V] will be [1, 1, 1, 1, 1, 1] Instead if you do words.map(x = (x,1)).reduceByKey(_ +

Re: range partitioner with updateStateByKey

2014-05-01 Thread Tathagata Das
Ordered by what? arrival order? sort order? TD On Thu, May 1, 2014 at 2:35 PM, Adrian Mocanu amoc...@verticalscope.comwrote: If I use a range partitioner, will this make updateStateByKey take the tuples in order? Right now I see them not being taken in order (most of them are ordered

Re: another updateStateByKey question

2014-05-02 Thread Tathagata Das
Could be a bug. Can you share a code with data that I can use to reproduce this? TD On May 2, 2014 9:49 AM, Adrian Mocanu amoc...@verticalscope.com wrote: Has anyone else noticed that *sometimes* the same tuple calls update state function twice? I have 2 tuples with the same key in 1 RDD

Re: spark run issue

2014-05-04 Thread Tathagata Das
with the version spark uses. Sorry for spamming. Weide On Sat, May 3, 2014 at 7:04 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I am a little confused about the version of Spark you are using. Are you using Spark 0.9.1 that uses scala 2.10.3 ? TD On Sat, May 3, 2014 at 6:16 PM, Weide

Re: sbt run with spark.ContextCleaner ERROR

2014-05-04 Thread Tathagata Das
Can you tell which version of Spark you are using? Spark 1.0 RC3, or something intermediate? And do you call sparkContext.stop at the end of your application? If so, does this error occur before or after the stop()? TD On Sun, May 4, 2014 at 2:40 AM, wxhsdp wxh...@gmail.com wrote: Hi, all i

Re: spark streaming question

2014-05-05 Thread Tathagata Das
One main reason why Spark Streaming can achieve higher throughput than Storm is because Spark Streaming operates in coarser-grained batches - second-scale massive batches - which reduce per-tuple of overheads in shuffles, and other kinds of data movements, etc. Note that, this is also true that

Re: Spark Streaming and JMS

2014-05-05 Thread Tathagata Das
A few high-level suggestions. 1. I recommend using the new Receiver API in almost-released Spark 1.0 (see branch-1.0 / master branch on github). Its a slightly better version of the earlier NetworkReceiver, as it hides away blockgenerator (which needed to be unnecessarily manually started and

Re: How to use spark-submit

2014-05-07 Thread Tathagata Das
Doesnt the run-example script work for you? Also, are you on the latest commit of branch-1.0 ? TD On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta soumya.sima...@gmail.comwrote: Yes, I'm struggling with a similar problem where my class are not found on the worker nodes. I'm using

Re: sbt run with spark.ContextCleaner ERROR

2014-05-07 Thread Tathagata Das
Okay, this needs to be fixed. Thanks for reporting this! On Mon, May 5, 2014 at 11:00 PM, wxhsdp wxh...@gmail.com wrote: Hi, TD i tried on v1.0.0-rc3 and still got the error -- View this message in context:

Re: missing method in my slf4j after excluding Spark ZK log4j

2014-05-12 Thread Tathagata Das
This gives dependency tree in SBT (spark uses this). https://github.com/jrudolph/sbt-dependency-graph TD On Mon, May 12, 2014 at 4:55 PM, Sean Owen so...@cloudera.com wrote: It sounds like you are doing everything right. NoSuchMethodError suggests it's finding log4j, just not the right

Re: streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected

2014-05-12 Thread Tathagata Das
A very crucial thing to remember when using file stream is that the files must be written to the monitored directory atomically. That is when the file system show the file in its listing, the file should not be appended / updated after that. That often causes this kind of issues, as spark

Re: Average of each RDD in Stream

2014-05-12 Thread Tathagata Das
Use DStream.foreachRDD to do an operation on the final RDD of every batch. val sumandcount = numbers.map(n = (n.toDouble, 1)).reduce{ (a, b) = (a._1 + b._1, a._2 + b._2) } sumandcount.foreachRDD { rdd = val first: (Double, Int) = rdd.take(1) ; ... } DStream.reduce creates DStream whose RDDs

Re: Proper way to stop Spark stream processing

2014-05-12 Thread Tathagata Das
Since you are using the latest Spark code and not Spark 0.9.1 (guessed from the log messages), you can actually do graceful shutdown of a streaming context. This ensures that the receivers are properly stopped and all received data is processed and then the system terminates (stop() stays blocked

Re: Equivalent of collect() on DStream

2014-05-16 Thread Tathagata Das
Doesnt DStream.foreach() suffice? anyDStream.foreach { rdd = // do something with rdd } On Wed, May 14, 2014 at 9:33 PM, Stephen Boesch java...@gmail.com wrote: Looking further it appears the functionality I am seeking is in the following *private[spark] * class ForEachdStream (version

Re: life if an executor

2014-05-19 Thread Tathagata Das
That's one the main motivation in using Tachyon ;) http://tachyon-project.org/ It gives off heap in-memory caching. And starting Spark 0.9, you can cache any RDD in Tachyon just by specifying the appropriate StorageLevel. TD On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.com

Re: question about the license of akka and Spark

2014-05-20 Thread Tathagata Das
Akka is under Apache 2 license too. http://doc.akka.io/docs/akka/snapshot/project/licenses.html On Tue, May 20, 2014 at 2:16 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote: Hi Just know akka is under a commercial license,however Spark is under the apache license. Is there any problem?

Re: any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?

2014-05-21 Thread Tathagata Das
Unfortunately, there is no API support for this right now. You could implement it yourself by implementing your own receiver and controlling the rate at which objects are received. If you are using any of the standard receivers (Flume, Kafka, etc.), I recommended looking at the source code of the

Re: any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?

2014-05-21 Thread Tathagata Das
this in a generic that it can be used for all receivers ;) TD On Wed, May 21, 2014 at 12:57 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Unfortunately, there is no API support for this right now. You could implement it yourself by implementing your own receiver and controlling the rate

Re: tests that run locally fail when run through bamboo

2014-05-21 Thread Tathagata Das
This do happens sometimes, but it is a warning because Spark is designed try successive ports until it succeeds. So unless a cray number of successive ports are blocked (runaway processes?? insufficient clearing of ports by OS??), these errors should not be a problem for tests passing. On

Re: I want to filter a stream by a subclass.

2014-05-21 Thread Tathagata Das
You could do records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] } On Wed, May 21, 2014 at 3:28 PM, Ian Holsman i...@holsman.com.au wrote: Hi. Firstly I'm a newb (to both Scala Spark). I have a stream, that contains multiple types of records, and I would like to

Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-21 Thread Tathagata Das
Are you running a vanilla Hadoop 2.3.0 or the one that comes with CDH5 / HDP(?) ? We may be able to reproduce this in that case. TD On Wed, May 21, 2014 at 8:35 PM, Tom Graves tgraves...@yahoo.com wrote: It sounds like something is closing the hdfs filesystem before everyone is really done

Re: Unable to run a Standalone job

2014-05-22 Thread Tathagata Das
How are you launching the application? sbt run ? spark-submit? local mode or Spark standalone cluster? Are you packaging all your code into a jar? Looks to me that you seem to have spark classes in your execution environment but missing some of Spark's dependencies. TD On Thu, May 22, 2014 at

Re: How to turn off MetadataCleaner?

2014-05-22 Thread Tathagata Das
The cleaner should remain up while the sparkcontext is still active (not stopped). However, here it seems you are stopping the sparkContext (ssc.stop(true)), the cleaner should be stopped. However, there was a bug earlier where some of the cleaners may not have been stopped when the context is

Re: Unable to run a Standalone job

2014-05-22 Thread Tathagata Das
persists. On Thu, May 22, 2014 at 3:59 PM, Shrikar archak shrika...@gmail.com wrote: I am running as sbt run. I am running it locally . Thanks, Shrikar On Thu, May 22, 2014 at 3:53 PM, Tathagata Das tathagata.das1...@gmail.com wrote: How are you launching the application? sbt run ? spark

Re: Interactive modification of DStreams

2014-06-02 Thread Tathagata Das
Currently Spark Streaming does not support addition/deletion/modification of DStream after the streaming context has been started. Nor can you restart a stopped streaming context. Also, multiple spark contexts (and therefore multiple streaming contexts) cannot be run concurrently in the same JVM.

Re: Window slide duration

2014-06-02 Thread Tathagata Das
I am assuming that you are referring to the OneForOneStrategy: key not found: 1401753992000 ms error, and not to the previous Time 1401753992000 ms is invalid Those two seem a little unrelated to me. Can you give us the stacktrace associated with the key-not-found error? TD On Mon, Jun 2,

Re: NoSuchElementException: key not found

2014-06-02 Thread Tathagata Das
Do you have the info level logs of the application? Can you grep the value 32855 to find any references to it? Also what version of the Spark are you using (so that I can match the stack trace, does not seem to match with Spark 1.0)? TD On Mon, Jun 2, 2014 at 3:27 PM, Michael Chang

Re: Window slide duration

2014-06-02 Thread Tathagata Das
Can you give all the logs? Would like to see what is clearing the key 1401754908000 ms TD On Mon, Jun 2, 2014 at 5:38 PM, Vadim Chekan kot.bege...@gmail.com wrote: Ok, it seems like Time ... is invalid is part of normal workflow, when window DStream will ignore RDDs at moments in time when

Re: Failed to remove RDD error

2014-06-03 Thread Tathagata Das
, Michael Chang m...@tellapart.com wrote: Thanks Tathagata, Thanks for all your hard work! In the future, is it possible to mark experimental features as such on the online documentation? Thanks, Michael On Mon, Jun 2, 2014 at 6:12 PM, Tathagata Das tathagata.das1...@gmail.com wrote

Re: NoSuchElementException: key not found

2014-06-03 Thread Tathagata Das
. TD On Tue, Jun 3, 2014 at 10:09 AM, Michael Chang m...@tellapart.com wrote: I only had the warning level logs, unfortunately. There were no other references of 32855 (except a repeated stack trace, I believe). I'm using Spark 0.9.1 On Mon, Jun 2, 2014 at 5:50 PM, Tathagata Das

Re: NoSuchElementException: key not found

2014-06-03 Thread Tathagata Das
wrote: Hi Tathagata, Thanks for your help! By not using coalesced RDD, do you mean not repartitioning my Dstream? Thanks, Mike On Tue, Jun 3, 2014 at 12:03 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I think I know what is going on! This probably a race condition

Re: custom receiver in java

2014-06-04 Thread Tathagata Das
Yes, thanks updating this old thread! We heard our community demands and added support for Java receivers! TD On Wed, Jun 4, 2014 at 12:15 PM, lbustelo g...@bustelos.com wrote: Not that what TD was referring above, is already in 1.0.0

Re: running Spark Streaming just once and stop it

2014-06-12 Thread Tathagata Das
You should be able to see the streaming tab in the Spark web ui (running on port 4040) if you have created StreamingContext and you are using Spark 1.0 TD On Thu, Jun 12, 2014 at 1:06 AM, Ravi Hemnani raviiihemn...@gmail.com wrote: Hey, I did

Re: Spark Streaming not processing file with particular number of entries

2014-06-13 Thread Tathagata Das
This is very odd. If it is running fine on mesos, I dont see a obvious reason why it wont work on Spark standalone cluster. Is the .4 million file already present in the monitored directory when the context is started? In that case, the file will not be picked up (unless textFileStream is created

Re: Long running Spark Streaming Job increasing executing time per batch

2014-06-19 Thread Tathagata Das
Thats quite odd. Yes, with checkpoint the lineage does not increase. Can you tell which stage is the processing of each batch is causing the increase in the processing time? Also, what is the batch interval, and checkpoint interval? TD On Thu, Jun 19, 2014 at 8:45 AM, Skogberg, Fredrik

Re: Issue while trying to aggregate with a sliding window

2014-06-19 Thread Tathagata Das
zeroTime marks the time when the streaming job started, and the first batch of data is from zeroTime to zeroTime + slideDuration. The validity check of time - zeroTime) being multiple of slideDuration is to ensure that for a given dstream, it generates RDD at the right times. For example, say the

Re: How to achieve reasonable performance on Spark Streaming?

2014-06-19 Thread Tathagata Das
Hello all, Apologies for the late response, this thread went below my radar. There are a number of things that can be done to improve the performance. Here are some of them of the top of my head based on what you have mentioned. Most of them are mentioned in the streaming guide's performance

Re: Possible approaches for adding extra metadata (Spark Streaming)?

2014-06-20 Thread Tathagata Das
If the metadata is directly related to each individual records, then it can be done either ways. Since I am not sure how easy or hard will it be for you add tags before putting the data into spark streaming, its hard to recommend one method over the other. However, if the metadata is related to

Re: Could not compute split, block not found

2014-07-01 Thread Tathagata Das
Are you by any change using only memory in the storage level of the input streams? TD On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer t...@preferred.jp wrote: Bill, let's say the processing time is t' and the window size t. Spark does not *require* t' t. In fact, for *temporary* peaks in

Re: Spark Streaming - two questions about the streamingcontext

2014-07-09 Thread Tathagata Das
1. Multiple output operations are processed in the order they are defined. That is because by default each one output operation is processed at a time. This *can* be parallelized using an undocumented config parameter spark.streaming.concurrentJobs which is by default set to 1. 2. Yes, the output

Re: Restarting a Streaming Context

2014-07-10 Thread Tathagata Das
I confirm that is indeed the case. It is designed to be so because it keeps things simpler - less chances of issues related to cleanup when stop() is called. Also it keeps things consistent with the spark context - once a spark context is stopped it cannot be used any more. You can create a new

Re: All of the tasks have been completed but the Stage is still shown as Active?

2014-07-10 Thread Tathagata Das
Do you see any errors in the logs of the driver? On Thu, Jul 10, 2014 at 1:21 AM, Haopu Wang hw...@qilinsoft.com wrote: I'm running an App for hours in a standalone cluster. From the data injector and Streaming tab of web ui, it's running well. However, I see quite a lot of Active stages in

Re: KMeans code is rubbish

2014-07-10 Thread Tathagata Das
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how

Re: Getting Persistent Connection using socketStream?

2014-07-10 Thread Tathagata Das
The implementation of the input-stream-to-iterator function in #2 is incorrect. The function should be such that, when the hasNext is called on the iterator, it should try to read from the buffered reader. If an object (that is, line) can be read, then return it, otherwise block and wait for data

Re: What version of twitter4j should I use with Spark Streaming?

2014-07-10 Thread Tathagata Das
Spark Streaming uses twitter4j 3.0.3. 3.0.6 should probably work fine. The exception that you are seeing is something that should be looked into. Can you give us more logs (specially executor logs) with stack traces that has the error. TD On Thu, Jul 10, 2014 at 2:42 PM, Nick Chammas

Re: Number of executors change during job running

2014-07-10 Thread Tathagata Das
Are you specifying the number of reducers in all the DStream.ByKey operations? If the reduce by key is not set, then the number of reducers used in the stages can keep changing across batches. TD On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I have a

Re: NoSuchElementException: key not found when changing the window lenght and interval in Spark Streaming

2014-07-10 Thread Tathagata Das
This bug has been fixed. Either use the master branch of Spark, or maybe wait a few days for Spark 1.0.1 to be released (voting has successfully closed). TD On Thu, Jul 10, 2014 at 2:33 AM, richiesgr richie...@gmail.com wrote: Hi I get exactly the same problem here, do you've found the

Re: Spark Streaming using File Stream in Java

2014-07-10 Thread Tathagata Das
The fileStream is not designed to work with continuously updating file, as the one of the main design goals of Spark is immutability (to guarantee fault-tolerance by recomputation), and files that are appending (mutating) defeats that. It rather designed to pickup new files added atomically (using

Re: Number of executors change during job running

2014-07-10 Thread Tathagata Das
then 3 minutes. Thanks! Bill On Thu, Jul 10, 2014 at 4:28 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Are you specifying the number of reducers in all the DStream.ByKey operations? If the reduce by key is not set, then the number of reducers used in the stages can keep

Re: Some question about SQL and streaming

2014-07-10 Thread Tathagata Das
Yeah, the right solution is to have something like SchemaDStream, where the schema of all the schemaRDD generated by it can be stored. Something I really would like to see happen in the future :) TD On Thu, Jul 10, 2014 at 6:37 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, I think it

Re: Getting Persistent Connection using socketStream?

2014-07-10 Thread Tathagata Das
Right this uses NextIterator, which is elsewhere in the repo. It just makes it cleaner to implement a custom iterator. But i guess you got the high level point, so its okay. TD On Thu, Jul 10, 2014 at 7:21 PM, kytay kaiyang@gmail.com wrote: Hi TD Thanks. I have problem understanding

Re: Join two Spark Streaming

2014-07-10 Thread Tathagata Das
Do you want to continuously maintain the set of unique integers seen since the beginning of stream? var uniqueValuesRDD: RDD[Int] = ... dstreamOfIntegers.transform(newDataRDD = { val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct uniqueValuesRDD = newUniqueValuesRDD //

Re: Join two Spark Streaming

2014-07-11 Thread Tathagata Das
instead of accumulative number of unique integers. I do have two questions about your code: 1. Why do we need uniqueValuesRDD? Why do we need to call uniqueValuesRDD.checkpoint()? 2. Where is distinctValues defined? Bill On Thu, Jul 10, 2014 at 8:46 PM, Tathagata Das tathagata.das1

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread Tathagata Das
for this. Thanks. On Thursday, July 10, 2014 7:24 PM, Tathagata Das tathagata.das1...@gmail.com wrote: How are you supplying the text file? On Wed, Jul 9, 2014 at 11:51 AM, M Singh mans6si...@yahoo.com wrote: Hi Folks: I am working on an application which uses spark streaming (version

Re: writing FLume data to HDFS

2014-07-11 Thread Tathagata Das
What is the error you are getting when you say ??I was trying to write the data to hdfs..but it fails… TD On Thu, Jul 10, 2014 at 1:36 PM, Sundaram, Muthu X. muthu.x.sundaram@sabre.com wrote: I am new to spark. I am trying to do the following. NetcatàFlumeàSpark streaming(process Flume

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread Tathagata Das
...@yahoo.com wrote: So, is it expected for the process to generate stages/tasks even after processing a file ? Also, is there a way to figure out the file that is getting processed and when that process is complete ? Thanks On Friday, July 11, 2014 1:51 PM, Tathagata Das tathagata.das1

Re: How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-11 Thread Tathagata Das
The same executor can be used for both receiving and processing, irrespective of the deployment mode (yarn, spark standalone, etc.) It boils down to the number of cores / task slots that executor has. Each receiver is like a long running task, so each of them occupy a slot. If there are free slots

Re: Number of executors change during job running

2014-07-11 Thread Tathagata Das
that the computation can be efficiently finished. I am not sure how to achieve this. Thanks! Bill On Thu, Jul 10, 2014 at 5:59 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Can you try setting the number-of-partitions in all the shuffle-based DStream operations, explicitly. It may

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Tathagata Das
You dont get any exception from twitter.com, saying credential error or something? I have seen this happen when once one was behind vpn to his office, and probably twitter was blocked in their office. You could be having a similar issue. TD On Fri, Jul 11, 2014 at 2:57 PM, SK

Re: Some question about SQL and streaming

2014-07-11 Thread Tathagata Das
Yes, even though we dont have immediate plans, I definitely would like to see it happen some time in not-so-distant future. TD On Thu, Jul 10, 2014 at 7:55 PM, Shao, Saisai saisai.s...@intel.com wrote: No specific plans to do so, since there has some functional loss like time based

Re: Generic Interface between RDD and DStream

2014-07-11 Thread Tathagata Das
I totally agree that doing if we are able to do this it will be very cool. However, this requires having a common trait / interface between RDD and DStream, which we dont have as of now. It would be very cool though. On my wish list for sure. TD On Thu, Jul 10, 2014 at 11:53 AM, mshah

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Tathagata Das
Does nothing get printed on the screen? If you are not getting any tweets but spark streaming is running successfully you should get at least counts being printed every batch (which would be zero). But they are not being printed either, check the spark web ui to see stages are running or not. If

Re: How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-11 Thread Tathagata Das
? Best, Fang, Yan yanfang...@gmail.com +1 (206) 849-4108 On Fri, Jul 11, 2014 at 1:45 PM, Tathagata Das tathagata.das1...@gmail.com wrote: The same executor can be used for both receiving and processing, irrespective of the deployment mode (yarn, spark standalone, etc.) It boils down

Re: Number of executors change during job running

2014-07-11 Thread Tathagata Das
days. No findings why there are always 2 executors for the groupBy stage. Thanks a lot! Bill On Fri, Jul 11, 2014 at 1:50 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Can you show us the program that you are running. If you are setting number of partitions in the XYZ-ByKey operation

Re: Spark Streaming timing considerations

2014-07-11 Thread Tathagata Das
This is not in the current streaming API. Queue stream is useful for testing with generated RDDs, but not for actual data. For actual data stream, the slack time can be implemented by doing DStream.window on a larger window that take slack time in consideration, and then the required

Re: Generic Interface between RDD and DStream

2014-07-11 Thread Tathagata Das
know it's just an intermediate hack, but still ;-) greetz, aℕdy ℙetrella about.me/noootsab [image: aℕdy ℙetrella on about.me] http://about.me/noootsab On Sat, Jul 12, 2014 at 12:57 AM, Tathagata Das tathagata.das1...@gmail.com wrote: I totally agree that doing if we are able to do

Re: Stopping StreamingContext does not kill receiver

2014-07-12 Thread Tathagata Das
Yes, thats a bug i just discovered. Race condition in the Twitter Receiver, will fix asap. Here is the JIRA https://issues.apache.org/jira/browse/SPARK-2464 TD On Sat, Jul 12, 2014 at 3:21 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: To add a potentially relevant piece of

Re: can't print DStream after reduce

2014-07-13 Thread Tathagata Das
Try doing DStream.foreachRDD and then printing the RDD count and further inspecting the RDD. On Jul 13, 2014 1:03 AM, Walrus theCat walrusthe...@gmail.com wrote: Hi, I have a DStream that works just fine when I say: dstream.print If I say: dstream.map(_,1).print that works, too.

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-13 Thread Tathagata Das
In case you still have issues with duplicate files in uber jar, here is a reference sbt file with assembly plugin that deals with duplicates https://github.com/databricks/training/blob/sparkSummit2014/streaming/scala/build.sbt On Fri, Jul 11, 2014 at 10:06 AM, Bill Jay

Re: Spark Streaming with Kafka NoClassDefFoundError

2014-07-13 Thread Tathagata Das
at 5:48 PM, Tathagata Das tathagata.das1...@gmail.com wrote: In case you still have issues with duplicate files in uber jar, here is a reference sbt file with assembly plugin that deals with duplicates https://github.com/databricks/training/blob/sparkSummit2014/streaming/scala/build.sbt

Re: writing FLume data to HDFS

2014-07-14 Thread Tathagata Das
()); System.out.println(LOG RECORD = + logRecord); ??I was trying to write the data to hdfs..but it fails… *From:* Tathagata Das [mailto:tathagata.das1...@gmail.com] *Sent:* Friday, July 11, 2014 1:43 PM *To:* user@spark.apache.org *Cc:* u

  1   2   3   4   5   6   7   8   9   >