Re: NoSuchElementException: key not found

2014-06-03 Thread Tathagata Das
. TD On Tue, Jun 3, 2014 at 10:09 AM, Michael Chang m...@tellapart.com wrote: I only had the warning level logs, unfortunately. There were no other references of 32855 (except a repeated stack trace, I believe). I'm using Spark 0.9.1 On Mon, Jun 2, 2014 at 5:50 PM, Tathagata Das

Re: NoSuchElementException: key not found

2014-06-03 Thread Tathagata Das
wrote: Hi Tathagata, Thanks for your help! By not using coalesced RDD, do you mean not repartitioning my Dstream? Thanks, Mike On Tue, Jun 3, 2014 at 12:03 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I think I know what is going on! This probably a race condition

Re: Interactive modification of DStreams

2014-06-02 Thread Tathagata Das
Currently Spark Streaming does not support addition/deletion/modification of DStream after the streaming context has been started. Nor can you restart a stopped streaming context. Also, multiple spark contexts (and therefore multiple streaming contexts) cannot be run concurrently in the same JVM.

Re: Window slide duration

2014-06-02 Thread Tathagata Das
I am assuming that you are referring to the OneForOneStrategy: key not found: 1401753992000 ms error, and not to the previous Time 1401753992000 ms is invalid Those two seem a little unrelated to me. Can you give us the stacktrace associated with the key-not-found error? TD On Mon, Jun 2,

Re: NoSuchElementException: key not found

2014-06-02 Thread Tathagata Das
Do you have the info level logs of the application? Can you grep the value 32855 to find any references to it? Also what version of the Spark are you using (so that I can match the stack trace, does not seem to match with Spark 1.0)? TD On Mon, Jun 2, 2014 at 3:27 PM, Michael Chang

Re: Window slide duration

2014-06-02 Thread Tathagata Das
Can you give all the logs? Would like to see what is clearing the key 1401754908000 ms TD On Mon, Jun 2, 2014 at 5:38 PM, Vadim Chekan kot.bege...@gmail.com wrote: Ok, it seems like Time ... is invalid is part of normal workflow, when window DStream will ignore RDDs at moments in time when

Re: Unable to run a Standalone job

2014-05-22 Thread Tathagata Das
How are you launching the application? sbt run ? spark-submit? local mode or Spark standalone cluster? Are you packaging all your code into a jar? Looks to me that you seem to have spark classes in your execution environment but missing some of Spark's dependencies. TD On Thu, May 22, 2014 at

Re: How to turn off MetadataCleaner?

2014-05-22 Thread Tathagata Das
The cleaner should remain up while the sparkcontext is still active (not stopped). However, here it seems you are stopping the sparkContext (ssc.stop(true)), the cleaner should be stopped. However, there was a bug earlier where some of the cleaners may not have been stopped when the context is

Re: Unable to run a Standalone job

2014-05-22 Thread Tathagata Das
persists. On Thu, May 22, 2014 at 3:59 PM, Shrikar archak shrika...@gmail.com wrote: I am running as sbt run. I am running it locally . Thanks, Shrikar On Thu, May 22, 2014 at 3:53 PM, Tathagata Das tathagata.das1...@gmail.com wrote: How are you launching the application? sbt run ? spark

Re: any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?

2014-05-21 Thread Tathagata Das
Unfortunately, there is no API support for this right now. You could implement it yourself by implementing your own receiver and controlling the rate at which objects are received. If you are using any of the standard receivers (Flume, Kafka, etc.), I recommended looking at the source code of the

Re: any way to control memory usage when streaming input's speed is faster than the speed of handled by spark streaming ?

2014-05-21 Thread Tathagata Das
this in a generic that it can be used for all receivers ;) TD On Wed, May 21, 2014 at 12:57 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Unfortunately, there is no API support for this right now. You could implement it yourself by implementing your own receiver and controlling the rate

Re: tests that run locally fail when run through bamboo

2014-05-21 Thread Tathagata Das
This do happens sometimes, but it is a warning because Spark is designed try successive ports until it succeeds. So unless a cray number of successive ports are blocked (runaway processes?? insufficient clearing of ports by OS??), these errors should not be a problem for tests passing. On

Re: I want to filter a stream by a subclass.

2014-05-21 Thread Tathagata Das
You could do records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] } On Wed, May 21, 2014 at 3:28 PM, Ian Holsman i...@holsman.com.au wrote: Hi. Firstly I'm a newb (to both Scala Spark). I have a stream, that contains multiple types of records, and I would like to

Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-21 Thread Tathagata Das
Are you running a vanilla Hadoop 2.3.0 or the one that comes with CDH5 / HDP(?) ? We may be able to reproduce this in that case. TD On Wed, May 21, 2014 at 8:35 PM, Tom Graves tgraves...@yahoo.com wrote: It sounds like something is closing the hdfs filesystem before everyone is really done

Re: question about the license of akka and Spark

2014-05-20 Thread Tathagata Das
Akka is under Apache 2 license too. http://doc.akka.io/docs/akka/snapshot/project/licenses.html On Tue, May 20, 2014 at 2:16 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote: Hi Just know akka is under a commercial license,however Spark is under the apache license. Is there any problem?

Re: life if an executor

2014-05-19 Thread Tathagata Das
That's one the main motivation in using Tachyon ;) http://tachyon-project.org/ It gives off heap in-memory caching. And starting Spark 0.9, you can cache any RDD in Tachyon just by specifying the appropriate StorageLevel. TD On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.com

Re: Equivalent of collect() on DStream

2014-05-16 Thread Tathagata Das
Doesnt DStream.foreach() suffice? anyDStream.foreach { rdd = // do something with rdd } On Wed, May 14, 2014 at 9:33 PM, Stephen Boesch java...@gmail.com wrote: Looking further it appears the functionality I am seeking is in the following *private[spark] * class ForEachdStream (version

Re: missing method in my slf4j after excluding Spark ZK log4j

2014-05-12 Thread Tathagata Das
This gives dependency tree in SBT (spark uses this). https://github.com/jrudolph/sbt-dependency-graph TD On Mon, May 12, 2014 at 4:55 PM, Sean Owen so...@cloudera.com wrote: It sounds like you are doing everything right. NoSuchMethodError suggests it's finding log4j, just not the right

Re: streaming on hdfs can detected all new file, but the sum of all the rdd.count() not equals which had detected

2014-05-12 Thread Tathagata Das
A very crucial thing to remember when using file stream is that the files must be written to the monitored directory atomically. That is when the file system show the file in its listing, the file should not be appended / updated after that. That often causes this kind of issues, as spark

Re: Average of each RDD in Stream

2014-05-12 Thread Tathagata Das
Use DStream.foreachRDD to do an operation on the final RDD of every batch. val sumandcount = numbers.map(n = (n.toDouble, 1)).reduce{ (a, b) = (a._1 + b._1, a._2 + b._2) } sumandcount.foreachRDD { rdd = val first: (Double, Int) = rdd.take(1) ; ... } DStream.reduce creates DStream whose RDDs

Re: Proper way to stop Spark stream processing

2014-05-12 Thread Tathagata Das
Since you are using the latest Spark code and not Spark 0.9.1 (guessed from the log messages), you can actually do graceful shutdown of a streaming context. This ensures that the receivers are properly stopped and all received data is processed and then the system terminates (stop() stays blocked

Re: How to use spark-submit

2014-05-07 Thread Tathagata Das
Doesnt the run-example script work for you? Also, are you on the latest commit of branch-1.0 ? TD On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta soumya.sima...@gmail.comwrote: Yes, I'm struggling with a similar problem where my class are not found on the worker nodes. I'm using

Re: sbt run with spark.ContextCleaner ERROR

2014-05-07 Thread Tathagata Das
Okay, this needs to be fixed. Thanks for reporting this! On Mon, May 5, 2014 at 11:00 PM, wxhsdp wxh...@gmail.com wrote: Hi, TD i tried on v1.0.0-rc3 and still got the error -- View this message in context:

Re: spark streaming question

2014-05-05 Thread Tathagata Das
One main reason why Spark Streaming can achieve higher throughput than Storm is because Spark Streaming operates in coarser-grained batches - second-scale massive batches - which reduce per-tuple of overheads in shuffles, and other kinds of data movements, etc. Note that, this is also true that

Re: Spark Streaming and JMS

2014-05-05 Thread Tathagata Das
A few high-level suggestions. 1. I recommend using the new Receiver API in almost-released Spark 1.0 (see branch-1.0 / master branch on github). Its a slightly better version of the earlier NetworkReceiver, as it hides away blockgenerator (which needed to be unnecessarily manually started and

Re: spark run issue

2014-05-04 Thread Tathagata Das
with the version spark uses. Sorry for spamming. Weide On Sat, May 3, 2014 at 7:04 PM, Tathagata Das tathagata.das1...@gmail.com wrote: I am a little confused about the version of Spark you are using. Are you using Spark 0.9.1 that uses scala 2.10.3 ? TD On Sat, May 3, 2014 at 6:16 PM, Weide

Re: sbt run with spark.ContextCleaner ERROR

2014-05-04 Thread Tathagata Das
Can you tell which version of Spark you are using? Spark 1.0 RC3, or something intermediate? And do you call sparkContext.stop at the end of your application? If so, does this error occur before or after the stop()? TD On Sun, May 4, 2014 at 2:40 AM, wxhsdp wxh...@gmail.com wrote: Hi, all i

Re: another updateStateByKey question

2014-05-02 Thread Tathagata Das
Could be a bug. Can you share a code with data that I can use to reproduce this? TD On May 2, 2014 9:49 AM, Adrian Mocanu amoc...@verticalscope.com wrote: Has anyone else noticed that *sometimes* the same tuple calls update state function twice? I have 2 tuples with the same key in 1 RDD

Re: Spark streaming

2014-05-01 Thread Tathagata Das
Take a look at the RDD.pipe() operation. That allows you to pipe the data in a RDD to any external shell command (just like Unix Shell pipe). On May 1, 2014 10:46 AM, Mohit Singh mohit1...@gmail.com wrote: Hi, I guess Spark is using streaming in context of streaming live data but what I mean

Re: What is Seq[V] in updateStateByKey?

2014-05-01 Thread Tathagata Das
Depends on your code. Referring to the earlier example, if you do words.map(x = (x,1)).updateStateByKey() then for a particular word, if a batch contains 6 occurrences of that word, then the Seq[V] will be [1, 1, 1, 1, 1, 1] Instead if you do words.map(x = (x,1)).reduceByKey(_ +

Re: range partitioner with updateStateByKey

2014-05-01 Thread Tathagata Das
Ordered by what? arrival order? sort order? TD On Thu, May 1, 2014 at 2:35 PM, Adrian Mocanu amoc...@verticalscope.comwrote: If I use a range partitioner, will this make updateStateByKey take the tuples in order? Right now I see them not being taken in order (most of them are ordered

Re: What is Seq[V] in updateStateByKey?

2014-04-30 Thread Tathagata Das
Yeah, I remember changing fold to sum in a few places, probably in testsuites, but missed this example I guess. On Wed, Apr 30, 2014 at 1:29 PM, Sean Owen so...@cloudera.com wrote: S is the previous count, if any. Seq[V] are potentially many new counts. All of them have to be added together

Re: What is Seq[V] in updateStateByKey?

2014-04-29 Thread Tathagata Das
You may have already seen it, but I will mention it anyways. This example may help. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/StatefulNetworkWordCount.scala Here the state is essentially a running count of the words seen. So the value

Re: Spark's behavior

2014-04-29 Thread Tathagata Das
are not using stream context with master local, we have 1 Master and 8 Workers and 1 word source. The command line that we are using is: bin/run-example org.apache.spark.streaming.examples.JavaNetworkWordCount spark://192.168.0.13:7077 On Apr 30, 2014, at 0:09, Tathagata Das tathagata.das1

Re: Java Spark Streaming - SparkFlumeEvent

2014-04-28 Thread Tathagata Das
You can get the internal AvroFlumeEvent inside the SparkFlumeEvent using SparkFlumeEvent.event. That should probably give you all the original text data. On Mon, Apr 28, 2014 at 5:46 AM, Kulkarni, Vikram vikram.kulka...@hp.comwrote: Hi Spark-users, Within my Spark Streaming program, I am

Re: Bind exception while running FlumeEventCount

2014-04-22 Thread Tathagata Das
Hello Neha, This is the result of a known bug in 0.9. Can you try running the latest Spark master branch to see if this problem is resolved? TD On Tue, Apr 22, 2014 at 2:48 AM, NehaS Singh nehas.si...@lntinfotech.comwrote: Hi, I have installed

Re: Strange behaviour of different SSCs with same Kafka topic

2014-04-21 Thread Tathagata Das
help me in how to use some external data to manipulate the RDD records ? Thanks and regards Gagan B Mishra *Programmer* *560034, Bangalore* *India* On Tue, Apr 15, 2014 at 4:09 AM, Tathagata Das [via Apache Spark User List] [hidden email] http://user/SendEmail.jtp?type=nodenode=4434i

Re: question about the SocketReceiver

2014-04-21 Thread Tathagata Das
As long as the socket server sends data through the same connection, the existing code is going to work. The socket.getInputStream returns a input stream which will continuously allow you to pull data sent over the connection. The bytesToObject function continuously reads data from the input

Re: checkpointing without streaming?

2014-04-21 Thread Tathagata Das
Diana, that is a good question. When you persist an RDD, the system still remembers the whole lineage of parent RDDs that created that RDD. If one of the executor fails, and the persist data is lost (both local disk and memory data will get lost), then the lineage is used to recreate the RDD. The

Re: Strange behaviour of different SSCs with same Kafka topic

2014-04-14 Thread Tathagata Das
Does this happen at low event rate for that topic as well, or only for a high volume rate? TD On Wed, Apr 9, 2014 at 11:24 PM, gaganbm gagan.mis...@gmail.com wrote: I am really at my wits' end here. I have different Streaming contexts, lets say 2, and both listening to same Kafka topics. I

Spark 0.9.1 released

2014-04-09 Thread Tathagata Das
, Matei Zaharia, Nan Zhu, Nick Lanham, Patrick Wendell, Prabin Banka, Prashant Sharma, Qiuzhuang, Raymond Liu, Reynold Xin, Sandy Ryza, Sean Owen, Shixiong Zhu, shiyun.wxm, Stevo Slavić, Tathagata Das, Tom Graves, Xiangrui Meng TD

Re: Spark 0.9.1 released

2014-04-09 Thread Tathagata Das
A small additional note: Please use the direct download links in the Spark Downloads http://spark.apache.org/downloads.html page. The Apache mirrors take a day or so to sync from the main repo, so may not work immediately. TD On Wed, Apr 9, 2014 at 2:54 PM, Tathagata Das tathagata.das1

Re: CheckpointRDD has different number of partitions than original RDD

2014-04-08 Thread Tathagata Das
-examples_2.10-assembly-0.9.0-incubating.jar System Classpath http://10.10.41.67:40368/jars/spark-examples_2.10-assembly-0.9.0-incubating.jarAdded By User From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: Monday, April 07, 2014 7:54 PM To: user@spark.apache.org

Re: CheckpointRDD has different number of partitions than original RDD

2014-04-07 Thread Tathagata Das
Few things that would be helpful. 1. Environment settings - you can find them on the environment tab in the Spark application UI 2. Are you setting the HDFS configuration correctly in your Spark program? For example, can you write a HDFS file from a Spark program (say spark-shell) to your HDFS

Re: Spark Streaming + Kafka + Mesos/Marathon strangeness

2014-03-28 Thread Tathagata Das
27, 2014 at 3:47 PM, Evgeny Shishkin itparan...@gmail.comwrote: On 28 Mar 2014, at 01:44, Tathagata Das tathagata.das1...@gmail.com wrote: The more I think about it the problem is not about /tmp, its more about the workers not having enough memory. Blocks of received data could be falling

Re: spark streaming and the spark shell

2014-03-27 Thread Tathagata Das
Seems like the configuration of the Spark worker is not right. Either the worker has not been given enough memory or the allocation of the memory to the RDD storage needs to be fixed. If configured correctly, the Spark workers should not get OOMs. On Thu, Mar 27, 2014 at 2:52 PM, Evgeny

Re: Spark Streaming - Shared hashmaps

2014-03-26 Thread Tathagata Das
When you say launch long-running tasks does it mean long running Spark jobs/tasks, or long-running tasks in another system? If the rate of requests from Kafka is not low (in terms of records per second), you could collect the records in the driver, and maintain the shared bag in the driver. A

Re: streaming questions

2014-03-26 Thread Tathagata Das
*Answer 1:*Make sure you are using master as local[n] with n 1 (assuming you are running it in local mode). The way Spark Streaming works is that it assigns a code to the data receiver, and so if you run the program with only one core (i.e., with local or local[1]), then it wont have resources to

Re: rdd.saveAsTextFile problem

2014-03-26 Thread Tathagata Das
Can you give us the more detailed exception + stack trace in the log? It should be in the driver log. If not, please take a look at the executor logs, through the web ui to find the stack trace. TD On Tue, Mar 25, 2014 at 10:43 PM, gaganbm gagan.mis...@gmail.com wrote: Hi Folks, Is this

Re: spark executor/driver log files management

2014-03-25 Thread Tathagata Das
correct? Thanks, Sourav On Tue, Mar 25, 2014 at 3:26 AM, Tathagata Das tathagata.das1...@gmail.com wrote: You can use RollingFileAppenders in log4j.properties. http://logging.apache.org/log4j/extras/apidocs/org/apache/log4j/rolling/RollingFileAppender.html You can have other scripts

Re: Spark Streaming ZeroMQ Java Example

2014-03-25 Thread Tathagata Das
Unfortunately there isnt one right now. But it is probably too hard to start with the JavaNetworkWordCounthttps://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/streaming/examples/JavaNetworkWordCount.java, and use the ZeroMQUtils in the same way as the

Re: [bug?] streaming window unexpected behaviour

2014-03-25 Thread Tathagata Das
to count how many windows there are. -A -Original Message- From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: March-24-14 6:55 PM To: user@spark.apache.org Cc: u...@spark.incubator.apache.org Subject: Re: [bug?] streaming window unexpected behaviour Yes, I believe

Re: Sliding Window operations do not work as documented

2014-03-24 Thread Tathagata Das
Hello Sanjay, Yes, your understanding of lazy semantics is correct. But ideally every batch should read based on the batch interval provided in the StreamingContext. Can you open a JIRA on this? On Mon, Mar 24, 2014 at 7:45 AM, Sanjay Awatramani sanjay_a...@yahoo.com wrote: Hi All, I found

Re: [bug?] streaming window unexpected behaviour

2014-03-24 Thread Tathagata Das
Yes, I believe that is current behavior. Essentially, the first few RDDs will be partial windows (assuming window duration sliding interval). TD On Mon, Mar 24, 2014 at 1:12 PM, Adrian Mocanu amoc...@verticalscope.com wrote: I have what I would call unexpected behaviour when using window on a

Re: Explain About Logs NetworkWordcount.scala

2014-03-07 Thread Tathagata Das
I am not sure how to debug this without any more information about the source. Can you monitor on the receiver side that data is being accepted by the receiver but not reported? TD On Wed, Mar 5, 2014 at 7:23 AM, eduardocalfaia e.costaalf...@unibs.itwrote: Hi TD, I have seen in the web UI

Re: NoSuchMethodError - Akka - Props

2014-03-06 Thread Tathagata Das
Are you launching your application using scala or java command? scala command bring in a version of Akka that we have found to cause conflicts with Spark's version for Akka. So its best to launch using Java. TD On Thu, Mar 6, 2014 at 3:45 PM, Deepak Nulu deepakn...@gmail.com wrote: I see the

Re: NoSuchMethodError in KafkaReciever

2014-03-06 Thread Tathagata Das
I dont have a Eclipse setup so I am not sure what is going on here. I would try to use maven in the command line with a pom to see if this compiles. Also, try to cleanup your system maven cache. Who knows if it had pulled in a wrong version of kafka 0.8 and using it all the time. Blowing away the

Re: Spark streaming on ec2

2014-02-27 Thread Tathagata Das
Yes! Spark streaming programs are just like any spark program and so any ec2 cluster setup using the spark-ec2 scripts can be used to run spark streaming programs as well. On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia buendia...@gmail.comwrote: Hi, Does the ec2 support for spark 0.9

Re: Spark streaming on ec2

2014-02-27 Thread Tathagata Das
means for daemonizing and monitoring spark streaming apps which are supposed to run 24/7? If not, any suggestions for how to do this? On Thu, Feb 27, 2014 at 8:23 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Zookeeper is automatically set up in the cluster as Spark uses Zookeeper

<    4   5   6   7   8   9