Yes! Spark streaming programs are just like any spark program and so any
ec2 cluster setup using the spark-ec2 scripts can be used to run spark
streaming programs as well.
On Thu, Feb 27, 2014 at 10:11 AM, Aureliano Buendia buendia...@gmail.comwrote:
Hi,
Does the ec2 support for spark 0.9
means
for daemonizing and monitoring spark streaming apps which are supposed to
run 24/7? If not, any suggestions for how to do this?
On Thu, Feb 27, 2014 at 8:23 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Zookeeper is automatically set up in the cluster as Spark uses Zookeeper
Are you launching your application using scala or java command? scala
command bring in a version of Akka that we have found to cause conflicts
with Spark's version for Akka. So its best to launch using Java.
TD
On Thu, Mar 6, 2014 at 3:45 PM, Deepak Nulu deepakn...@gmail.com wrote:
I see the
I dont have a Eclipse setup so I am not sure what is going on here. I would
try to use maven in the command line with a pom to see if this compiles.
Also, try to cleanup your system maven cache. Who knows if it had pulled in
a wrong version of kafka 0.8 and using it all the time. Blowing away the
I am not sure how to debug this without any more information about the
source. Can you monitor on the receiver side that data is being accepted by
the receiver but not reported?
TD
On Wed, Mar 5, 2014 at 7:23 AM, eduardocalfaia e.costaalf...@unibs.itwrote:
Hi TD,
I have seen in the web UI
Hello Sanjay,
Yes, your understanding of lazy semantics is correct. But ideally
every batch should read based on the batch interval provided in the
StreamingContext. Can you open a JIRA on this?
On Mon, Mar 24, 2014 at 7:45 AM, Sanjay Awatramani
sanjay_a...@yahoo.com wrote:
Hi All,
I found
Yes, I believe that is current behavior. Essentially, the first few
RDDs will be partial windows (assuming window duration sliding
interval).
TD
On Mon, Mar 24, 2014 at 1:12 PM, Adrian Mocanu
amoc...@verticalscope.com wrote:
I have what I would call unexpected behaviour when using window on a
correct?
Thanks,
Sourav
On Tue, Mar 25, 2014 at 3:26 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
You can use RollingFileAppenders in log4j.properties.
http://logging.apache.org/log4j/extras/apidocs/org/apache/log4j/rolling/RollingFileAppender.html
You can have other scripts
Unfortunately there isnt one right now. But it is probably too hard to
start with the
JavaNetworkWordCounthttps://github.com/apache/incubator-spark/blob/master/examples/src/main/java/org/apache/spark/streaming/examples/JavaNetworkWordCount.java,
and use the ZeroMQUtils in the same way as the
to count how many
windows there are.
-A
-Original Message-
From: Tathagata Das [mailto:tathagata.das1...@gmail.com]
Sent: March-24-14 6:55 PM
To: user@spark.apache.org
Cc: u...@spark.incubator.apache.org
Subject: Re: [bug?] streaming window unexpected behaviour
Yes, I believe
When you say launch long-running tasks does it mean long running Spark
jobs/tasks, or long-running tasks in another system?
If the rate of requests from Kafka is not low (in terms of records per
second), you could collect the records in the driver, and maintain the
shared bag in the driver. A
*Answer 1:*Make sure you are using master as local[n] with n 1
(assuming you are running it in local mode). The way Spark Streaming
works is that it assigns a code to the data receiver, and so if you
run the program with only one core (i.e., with local or local[1]),
then it wont have resources to
Can you give us the more detailed exception + stack trace in the log? It
should be in the driver log. If not, please take a look at the executor
logs, through the web ui to find the stack trace.
TD
On Tue, Mar 25, 2014 at 10:43 PM, gaganbm gagan.mis...@gmail.com wrote:
Hi Folks,
Is this
Seems like the configuration of the Spark worker is not right. Either the
worker has not been given enough memory or the allocation of the memory to
the RDD storage needs to be fixed. If configured correctly, the Spark
workers should not get OOMs.
On Thu, Mar 27, 2014 at 2:52 PM, Evgeny
27, 2014 at 3:47 PM, Evgeny Shishkin itparan...@gmail.comwrote:
On 28 Mar 2014, at 01:44, Tathagata Das tathagata.das1...@gmail.com
wrote:
The more I think about it the problem is not about /tmp, its more about
the workers not having enough memory. Blocks of received data could be
falling
Few things that would be helpful.
1. Environment settings - you can find them on the environment tab in the
Spark application UI
2. Are you setting the HDFS configuration correctly in your Spark program?
For example, can you write a HDFS file from a Spark program (say
spark-shell) to your HDFS
-examples_2.10-assembly-0.9.0-incubating.jar
System Classpath
http://10.10.41.67:40368/jars/spark-examples_2.10-assembly-0.9.0-incubating.jarAdded
By User
From: Tathagata Das [mailto:tathagata.das1...@gmail.com]
Sent: Monday, April 07, 2014 7:54 PM
To: user@spark.apache.org
, Matei Zaharia, Nan Zhu, Nick Lanham,
Patrick Wendell, Prabin Banka, Prashant Sharma, Qiuzhuang,
Raymond Liu, Reynold Xin, Sandy Ryza, Sean Owen, Shixiong Zhu,
shiyun.wxm, Stevo Slavić, Tathagata Das, Tom Graves, Xiangrui Meng
TD
A small additional note: Please use the direct download links in the Spark
Downloads http://spark.apache.org/downloads.html page. The Apache mirrors
take a day or so to sync from the main repo, so may not work immediately.
TD
On Wed, Apr 9, 2014 at 2:54 PM, Tathagata Das
tathagata.das1
Does this happen at low event rate for that topic as well, or only for a
high volume rate?
TD
On Wed, Apr 9, 2014 at 11:24 PM, gaganbm gagan.mis...@gmail.com wrote:
I am really at my wits' end here.
I have different Streaming contexts, lets say 2, and both listening to same
Kafka topics. I
help me in how to use some external data to manipulate the RDD
records ?
Thanks and regards
Gagan B Mishra
*Programmer*
*560034, Bangalore*
*India*
On Tue, Apr 15, 2014 at 4:09 AM, Tathagata Das [via Apache Spark User
List] [hidden email]
http://user/SendEmail.jtp?type=nodenode=4434i
As long as the socket server sends data through the same connection, the
existing code is going to work. The socket.getInputStream returns a input
stream which will continuously allow you to pull data sent over the
connection. The bytesToObject function continuously reads data from the
input
Diana, that is a good question.
When you persist an RDD, the system still remembers the whole lineage of
parent RDDs that created that RDD. If one of the executor fails, and the
persist data is lost (both local disk and memory data will get lost), then
the lineage is used to recreate the RDD. The
Hello Neha,
This is the result of a known bug in 0.9. Can you try running the latest
Spark master branch to see if this problem is resolved?
TD
On Tue, Apr 22, 2014 at 2:48 AM, NehaS Singh nehas.si...@lntinfotech.comwrote:
Hi,
I have installed
You can get the internal AvroFlumeEvent inside the SparkFlumeEvent using
SparkFlumeEvent.event. That should probably give you all the original text
data.
On Mon, Apr 28, 2014 at 5:46 AM, Kulkarni, Vikram vikram.kulka...@hp.comwrote:
Hi Spark-users,
Within my Spark Streaming program, I am
You may have already seen it, but I will mention it anyways. This example
may help.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/streaming/examples/StatefulNetworkWordCount.scala
Here the state is essentially a running count of the words seen. So the
value
are not using stream context with master local, we have 1 Master and 8
Workers and 1 word source. The command line that we are using is:
bin/run-example org.apache.spark.streaming.examples.JavaNetworkWordCount
spark://192.168.0.13:7077
On Apr 30, 2014, at 0:09, Tathagata Das tathagata.das1
Yeah, I remember changing fold to sum in a few places, probably in
testsuites, but missed this example I guess.
On Wed, Apr 30, 2014 at 1:29 PM, Sean Owen so...@cloudera.com wrote:
S is the previous count, if any. Seq[V] are potentially many new
counts. All of them have to be added together
Take a look at the RDD.pipe() operation. That allows you to pipe the data
in a RDD to any external shell command (just like Unix Shell pipe).
On May 1, 2014 10:46 AM, Mohit Singh mohit1...@gmail.com wrote:
Hi,
I guess Spark is using streaming in context of streaming live data but
what I mean
Depends on your code. Referring to the earlier example, if you do
words.map(x = (x,1)).updateStateByKey()
then for a particular word, if a batch contains 6 occurrences of that word,
then the Seq[V] will be [1, 1, 1, 1, 1, 1]
Instead if you do
words.map(x = (x,1)).reduceByKey(_ +
Ordered by what? arrival order? sort order?
TD
On Thu, May 1, 2014 at 2:35 PM, Adrian Mocanu amoc...@verticalscope.comwrote:
If I use a range partitioner, will this make updateStateByKey take the
tuples in order?
Right now I see them not being taken in order (most of them are ordered
Could be a bug. Can you share a code with data that I can use to reproduce
this?
TD
On May 2, 2014 9:49 AM, Adrian Mocanu amoc...@verticalscope.com wrote:
Has anyone else noticed that *sometimes* the same tuple calls update
state function twice?
I have 2 tuples with the same key in 1 RDD
with
the version spark uses. Sorry for spamming.
Weide
On Sat, May 3, 2014 at 7:04 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
I am a little confused about the version of Spark you are using. Are you
using Spark 0.9.1 that uses scala 2.10.3 ?
TD
On Sat, May 3, 2014 at 6:16 PM, Weide
Can you tell which version of Spark you are using? Spark 1.0 RC3, or
something intermediate?
And do you call sparkContext.stop at the end of your application? If so,
does this error occur before or after the stop()?
TD
On Sun, May 4, 2014 at 2:40 AM, wxhsdp wxh...@gmail.com wrote:
Hi, all
i
One main reason why Spark Streaming can achieve higher throughput than
Storm is because Spark Streaming operates in coarser-grained batches -
second-scale massive batches - which reduce per-tuple of overheads in
shuffles, and other kinds of data movements, etc.
Note that, this is also true that
A few high-level suggestions.
1. I recommend using the new Receiver API in almost-released Spark 1.0 (see
branch-1.0 / master branch on github). Its a slightly better version of the
earlier NetworkReceiver, as it hides away blockgenerator (which needed to
be unnecessarily manually started and
Doesnt the run-example script work for you? Also, are you on the latest
commit of branch-1.0 ?
TD
On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta soumya.sima...@gmail.comwrote:
Yes, I'm struggling with a similar problem where my class are not found on
the worker nodes. I'm using
Okay, this needs to be fixed. Thanks for reporting this!
On Mon, May 5, 2014 at 11:00 PM, wxhsdp wxh...@gmail.com wrote:
Hi, TD
i tried on v1.0.0-rc3 and still got the error
--
View this message in context:
This gives dependency tree in SBT (spark uses this).
https://github.com/jrudolph/sbt-dependency-graph
TD
On Mon, May 12, 2014 at 4:55 PM, Sean Owen so...@cloudera.com wrote:
It sounds like you are doing everything right.
NoSuchMethodError suggests it's finding log4j, just not the right
A very crucial thing to remember when using file stream is that the files
must be written to the monitored directory atomically. That is when the
file system show the file in its listing, the file should not be appended /
updated after that. That often causes this kind of issues, as spark
Use DStream.foreachRDD to do an operation on the final RDD of every batch.
val sumandcount = numbers.map(n = (n.toDouble, 1)).reduce{ (a, b) = (a._1
+ b._1, a._2 + b._2) }
sumandcount.foreachRDD { rdd = val first: (Double, Int) = rdd.take(1) ;
... }
DStream.reduce creates DStream whose RDDs
Since you are using the latest Spark code and not Spark 0.9.1 (guessed from
the log messages), you can actually do graceful shutdown of a streaming
context. This ensures that the receivers are properly stopped and all
received data is processed and then the system terminates (stop() stays
blocked
Doesnt DStream.foreach() suffice?
anyDStream.foreach { rdd =
// do something with rdd
}
On Wed, May 14, 2014 at 9:33 PM, Stephen Boesch java...@gmail.com wrote:
Looking further it appears the functionality I am seeking is in the
following *private[spark] * class ForEachdStream
(version
That's one the main motivation in using Tachyon ;)
http://tachyon-project.org/
It gives off heap in-memory caching. And starting Spark 0.9, you can cache
any RDD in Tachyon just by specifying the appropriate StorageLevel.
TD
On Mon, May 19, 2014 at 10:22 PM, Mohit Jaggi mohitja...@gmail.com
Akka is under Apache 2 license too.
http://doc.akka.io/docs/akka/snapshot/project/licenses.html
On Tue, May 20, 2014 at 2:16 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote:
Hi
Just know akka is under a commercial license,however Spark is under the
apache
license.
Is there any problem?
Unfortunately, there is no API support for this right now. You could
implement it yourself by implementing your own receiver and controlling the
rate at which objects are received. If you are using any of the standard
receivers (Flume, Kafka, etc.), I recommended looking at the source code of
the
this in a generic that it can be used for all receivers ;)
TD
On Wed, May 21, 2014 at 12:57 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Unfortunately, there is no API support for this right now. You could
implement it yourself by implementing your own receiver and controlling the
rate
This do happens sometimes, but it is a warning because Spark is designed
try successive ports until it succeeds. So unless a cray number of
successive ports are blocked (runaway processes?? insufficient clearing of
ports by OS??), these errors should not be a problem for tests passing.
On
You could do
records.filter { _.isInstanceOf[Orange] } .map { _.asInstanceOf[Orange] }
On Wed, May 21, 2014 at 3:28 PM, Ian Holsman i...@holsman.com.au wrote:
Hi.
Firstly I'm a newb (to both Scala Spark).
I have a stream, that contains multiple types of records, and I would like
to
Are you running a vanilla Hadoop 2.3.0 or the one that comes with CDH5 /
HDP(?) ? We may be able to reproduce this in that case.
TD
On Wed, May 21, 2014 at 8:35 PM, Tom Graves tgraves...@yahoo.com wrote:
It sounds like something is closing the hdfs filesystem before everyone is
really done
How are you launching the application? sbt run ? spark-submit? local
mode or Spark standalone cluster? Are you packaging all your code into
a jar?
Looks to me that you seem to have spark classes in your execution
environment but missing some of Spark's dependencies.
TD
On Thu, May 22, 2014 at
The cleaner should remain up while the sparkcontext is still active (not
stopped). However, here it seems you are stopping the sparkContext
(ssc.stop(true)), the cleaner should be stopped. However, there was a bug
earlier where some of the cleaners may not have been stopped when the
context is
persists.
On Thu, May 22, 2014 at 3:59 PM, Shrikar archak shrika...@gmail.com wrote:
I am running as sbt run. I am running it locally .
Thanks,
Shrikar
On Thu, May 22, 2014 at 3:53 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
How are you launching the application? sbt run ? spark
Currently Spark Streaming does not support addition/deletion/modification
of DStream after the streaming context has been started.
Nor can you restart a stopped streaming context.
Also, multiple spark contexts (and therefore multiple streaming contexts)
cannot be run concurrently in the same JVM.
I am assuming that you are referring to the OneForOneStrategy: key not
found: 1401753992000 ms error, and not to the previous Time 1401753992000
ms is invalid Those two seem a little unrelated to me. Can you give
us the stacktrace associated with the key-not-found error?
TD
On Mon, Jun 2,
Do you have the info level logs of the application? Can you grep the
value 32855
to find any references to it? Also what version of the Spark are you using
(so that I can match the stack trace, does not seem to match with Spark
1.0)?
TD
On Mon, Jun 2, 2014 at 3:27 PM, Michael Chang
Can you give all the logs? Would like to see what is clearing the key
1401754908000
ms
TD
On Mon, Jun 2, 2014 at 5:38 PM, Vadim Chekan kot.bege...@gmail.com wrote:
Ok, it seems like Time ... is invalid is part of normal workflow, when
window DStream will ignore RDDs at moments in time when
, Michael Chang m...@tellapart.com wrote:
Thanks Tathagata,
Thanks for all your hard work! In the future, is it possible to mark
experimental features as such on the online documentation?
Thanks,
Michael
On Mon, Jun 2, 2014 at 6:12 PM, Tathagata Das tathagata.das1...@gmail.com
wrote
.
TD
On Tue, Jun 3, 2014 at 10:09 AM, Michael Chang m...@tellapart.com wrote:
I only had the warning level logs, unfortunately. There were no other
references of 32855 (except a repeated stack trace, I believe). I'm using
Spark 0.9.1
On Mon, Jun 2, 2014 at 5:50 PM, Tathagata Das
wrote:
Hi Tathagata,
Thanks for your help! By not using coalesced RDD, do you mean not
repartitioning my Dstream?
Thanks,
Mike
On Tue, Jun 3, 2014 at 12:03 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
I think I know what is going on! This probably a race condition
Yes, thanks updating this old thread! We heard our community demands and
added support for Java receivers!
TD
On Wed, Jun 4, 2014 at 12:15 PM, lbustelo g...@bustelos.com wrote:
Not that what TD was referring above, is already in 1.0.0
You should be able to see the streaming tab in the Spark web ui (running on
port 4040) if you have created StreamingContext and you are using Spark 1.0
TD
On Thu, Jun 12, 2014 at 1:06 AM, Ravi Hemnani raviiihemn...@gmail.com
wrote:
Hey,
I did
This is very odd. If it is running fine on mesos, I dont see a obvious
reason why it wont work on Spark standalone cluster.
Is the .4 million file already present in the monitored directory when the
context is started? In that case, the file will not be picked up (unless
textFileStream is created
Thats quite odd. Yes, with checkpoint the lineage does not increase. Can
you tell which stage is the processing of each batch is causing the
increase in the processing time?
Also, what is the batch interval, and checkpoint interval?
TD
On Thu, Jun 19, 2014 at 8:45 AM, Skogberg, Fredrik
zeroTime marks the time when the streaming job started, and the first batch
of data is from zeroTime to zeroTime + slideDuration. The validity check of
time - zeroTime) being multiple of slideDuration is to ensure that for a
given dstream, it generates RDD at the right times. For example, say the
Hello all,
Apologies for the late response, this thread went below my radar. There are
a number of things that can be done to improve the performance. Here are
some of them of the top of my head based on what you have mentioned. Most
of them are mentioned in the streaming guide's performance
If the metadata is directly related to each individual records, then it can
be done either ways. Since I am not sure how easy or hard will it be for
you add tags before putting the data into spark streaming, its hard to
recommend one method over the other.
However, if the metadata is related to
Are you by any change using only memory in the storage level of the input
streams?
TD
On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Bill,
let's say the processing time is t' and the window size t. Spark does not
*require* t' t. In fact, for *temporary* peaks in
1. Multiple output operations are processed in the order they are defined.
That is because by default each one output operation is processed at a
time. This *can* be parallelized using an undocumented config parameter
spark.streaming.concurrentJobs which is by default set to 1.
2. Yes, the output
I confirm that is indeed the case. It is designed to be so because it
keeps things simpler - less chances of issues related to cleanup when
stop() is called. Also it keeps things consistent with the spark context -
once a spark context is stopped it cannot be used any more.
You can create a new
Do you see any errors in the logs of the driver?
On Thu, Jul 10, 2014 at 1:21 AM, Haopu Wang hw...@qilinsoft.com wrote:
I'm running an App for hours in a standalone cluster. From the data
injector and Streaming tab of web ui, it's running well.
However, I see quite a lot of Active stages in
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with
your dataset as well, I got the expected answer. And I believe that even
though initialization is done using sampling, the example actually sets the
seed to a constant 42, so the result should always be the same no matter
how
The implementation of the input-stream-to-iterator function in #2 is
incorrect. The function should be such that, when the hasNext is called on
the iterator, it should try to read from the buffered reader. If an object
(that is, line) can be read, then return it, otherwise block and wait for
data
Spark Streaming uses twitter4j 3.0.3. 3.0.6 should probably work fine. The
exception that you are seeing is something that should be looked into. Can
you give us more logs (specially executor logs) with stack traces that has
the error.
TD
On Thu, Jul 10, 2014 at 2:42 PM, Nick Chammas
Are you specifying the number of reducers in all the DStream.ByKey
operations? If the reduce by key is not set, then the number of reducers
used in the stages can keep changing across batches.
TD
On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay bill.jaypeter...@gmail.com wrote:
Hi all,
I have a
This bug has been fixed. Either use the master branch of Spark, or maybe
wait a few days for Spark 1.0.1 to be released (voting has successfully
closed).
TD
On Thu, Jul 10, 2014 at 2:33 AM, richiesgr richie...@gmail.com wrote:
Hi
I get exactly the same problem here, do you've found the
The fileStream is not designed to work with continuously updating file, as
the one of the main design goals of Spark is immutability (to guarantee
fault-tolerance by recomputation), and files that are appending (mutating)
defeats that. It rather designed to pickup new files added atomically
(using
then 3 minutes. Thanks!
Bill
On Thu, Jul 10, 2014 at 4:28 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Are you specifying the number of reducers in all the DStream.ByKey
operations? If the reduce by key is not set, then the number of reducers
used in the stages can keep
Yeah, the right solution is to have something like SchemaDStream, where the
schema of all the schemaRDD generated by it can be stored. Something I
really would like to see happen in the future :)
TD
On Thu, Jul 10, 2014 at 6:37 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
I think it
Right this uses NextIterator, which is elsewhere in the repo. It just makes
it cleaner to implement a custom iterator. But i guess you got the high
level point, so its okay.
TD
On Thu, Jul 10, 2014 at 7:21 PM, kytay kaiyang@gmail.com wrote:
Hi TD
Thanks.
I have problem understanding
Do you want to continuously maintain the set of unique integers seen since
the beginning of stream?
var uniqueValuesRDD: RDD[Int] = ...
dstreamOfIntegers.transform(newDataRDD = {
val newUniqueValuesRDD = newDataRDD.union(distinctValues).distinct
uniqueValuesRDD = newUniqueValuesRDD
//
instead of accumulative number of unique integers.
I do have two questions about your code:
1. Why do we need uniqueValuesRDD? Why do we need to call
uniqueValuesRDD.checkpoint()?
2. Where is distinctValues defined?
Bill
On Thu, Jul 10, 2014 at 8:46 PM, Tathagata Das
tathagata.das1
for this.
Thanks.
On Thursday, July 10, 2014 7:24 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
How are you supplying the text file?
On Wed, Jul 9, 2014 at 11:51 AM, M Singh mans6si...@yahoo.com wrote:
Hi Folks:
I am working on an application which uses spark streaming (version
What is the error you are getting when you say ??I was trying to write the
data to hdfs..but it fails…
TD
On Thu, Jul 10, 2014 at 1:36 PM, Sundaram, Muthu X.
muthu.x.sundaram@sabre.com wrote:
I am new to spark. I am trying to do the following.
NetcatàFlumeàSpark streaming(process Flume
...@yahoo.com wrote:
So, is it expected for the process to generate stages/tasks even after
processing a file ?
Also, is there a way to figure out the file that is getting processed and
when that process is complete ?
Thanks
On Friday, July 11, 2014 1:51 PM, Tathagata Das
tathagata.das1
The same executor can be used for both receiving and processing,
irrespective of the deployment mode (yarn, spark standalone, etc.) It boils
down to the number of cores / task slots that executor has. Each receiver
is like a long running task, so each of them occupy a slot. If there are
free slots
that the computation can be efficiently finished. I am
not sure how to achieve this.
Thanks!
Bill
On Thu, Jul 10, 2014 at 5:59 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Can you try setting the number-of-partitions in all the shuffle-based
DStream operations, explicitly. It may
You dont get any exception from twitter.com, saying credential error or
something?
I have seen this happen when once one was behind vpn to his office, and
probably twitter was blocked in their office.
You could be having a similar issue.
TD
On Fri, Jul 11, 2014 at 2:57 PM, SK
Yes, even though we dont have immediate plans, I definitely would like to
see it happen some time in not-so-distant future.
TD
On Thu, Jul 10, 2014 at 7:55 PM, Shao, Saisai saisai.s...@intel.com wrote:
No specific plans to do so, since there has some functional loss like
time based
I totally agree that doing if we are able to do this it will be very cool.
However, this requires having a common trait / interface between RDD and
DStream, which we dont have as of now. It would be very cool though. On my
wish list for sure.
TD
On Thu, Jul 10, 2014 at 11:53 AM, mshah
Does nothing get printed on the screen? If you are not getting any tweets
but spark streaming is running successfully you should get at least counts
being printed every batch (which would be zero). But they are not being
printed either, check the spark web ui to see stages are running or not. If
?
Best,
Fang, Yan
yanfang...@gmail.com
+1 (206) 849-4108
On Fri, Jul 11, 2014 at 1:45 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
The same executor can be used for both receiving and processing,
irrespective of the deployment mode (yarn, spark standalone, etc.) It boils
down
days. No findings why there are
always 2 executors for the groupBy stage. Thanks a lot!
Bill
On Fri, Jul 11, 2014 at 1:50 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Can you show us the program that you are running. If you are setting
number of partitions in the XYZ-ByKey operation
This is not in the current streaming API.
Queue stream is useful for testing with generated RDDs, but not for actual
data. For actual data stream, the slack time can be implemented by doing
DStream.window on a larger window that take slack time in consideration,
and then the required
know it's just an intermediate hack, but still ;-)
greetz,
aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]
http://about.me/noootsab
On Sat, Jul 12, 2014 at 12:57 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
I totally agree that doing if we are able to do
Yes, thats a bug i just discovered. Race condition in the Twitter Receiver,
will fix asap.
Here is the JIRA https://issues.apache.org/jira/browse/SPARK-2464
TD
On Sat, Jul 12, 2014 at 3:21 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
To add a potentially relevant piece of
Try doing DStream.foreachRDD and then printing the RDD count and further
inspecting the RDD.
On Jul 13, 2014 1:03 AM, Walrus theCat walrusthe...@gmail.com wrote:
Hi,
I have a DStream that works just fine when I say:
dstream.print
If I say:
dstream.map(_,1).print
that works, too.
In case you still have issues with duplicate files in uber jar, here is a
reference sbt file with assembly plugin that deals with duplicates
https://github.com/databricks/training/blob/sparkSummit2014/streaming/scala/build.sbt
On Fri, Jul 11, 2014 at 10:06 AM, Bill Jay
at 5:48 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
In case you still have issues with duplicate files in uber jar, here is a
reference sbt file with assembly plugin that deals with duplicates
https://github.com/databricks/training/blob/sparkSummit2014/streaming/scala/build.sbt
());
System.out.println(LOG RECORD = +
logRecord);
??I was trying to write the data to hdfs..but
it fails…
*From:* Tathagata Das [mailto:tathagata.das1...@gmail.com]
*Sent:* Friday, July 11, 2014 1:43 PM
*To:* user@spark.apache.org
*Cc:* u
1 - 100 of 859 matches
Mail list logo