Why isnt a simple window function sufficient?
eventData.window(Minutes(15), Seconds(3)) will keep generating RDDs every 3
second, each containing last 15 minutes of data.
TD
On Wed, Aug 6, 2014 at 3:43 PM, salemi alireza.sal...@udo.edu wrote:
Hi,
I have a DStream called eventData and it
Hey Aniket,
Great thoughts! I understand the usecase. But as you have realized yourself
it is not trivial to cleanly stream a RDD as a DStream. Since RDD
operations are defined to be scan based, it is not efficient to define RDD
based on slices of data within a partition of another RDD, using
I narrowed down the error. Unfortunately this is not quick fix. I have
opened a JIRA for this.
https://issues.apache.org/jira/browse/SPARK-2892
On Wed, Aug 6, 2014 at 3:59 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Okay let me give it a shot.
On Wed, Aug 6, 2014 at 3:57 PM
Okay, going back to your origin question, it wasnt clear what is the reduce
function that you are trying to implement. Going by the 2nd example using
window() operation, following by a count+filter (using sql), I am guessing
you are trying to maintain a count of the all the active states in the
It could be because of the variable enableOpStat. Since its defined
outside foreachRDD, referring to it inside the rdd.foreach is probably
causing the whole streaming context being included in the closure. Scala
funkiness. Try this, see if it works.
msgCount.join(ddCount).foreachRDD((rdd:
The problem boils down to how to write an RDD in that way. You could use
the HDFS Filesystem API to write each partition directly.
pairRDD.groupByKey().foreachPartition(iterator =
iterator.map { case (key, values) =
// Open an output stream to destination file
base-path/key/whatever
For future reference in this thread, a better set of examples than the
MetricAggregatorHBase
on the JIRA to look at are here
https://github.com/tmalaska/SparkOnHBase
On Thu, Aug 7, 2014 at 1:41 AM, Khanderao Kand khanderao.k...@gmail.com
wrote:
I hope this has been resolved, were u
for BlockGenerator
called at time 1406336129800
14/07/25 17:55:30 [DEBUG] RecurringTimer: Callback for BlockGenerator
called at time 140633613
On Jul 25, 2014, at 3:20 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Is this error on the executor or on the driver? Can you provide a larger
, Padmanabhan, Mahesh (contractor)
mahesh.padmanab...@twc-contractor.com wrote:
Thanks TD but unfortunately that did not work.
From: Tathagata Das tathagata.das1...@gmail.com
Date: Thursday, August 7, 2014 at 10:55 AM
To: Mahesh Padmanabhan mahesh.padmanab...@twc-contractor.com
Cc: user
I am not sure if it is a typo-error or not, but how are you using
groupByKey to get the summed_values? Assuming you meant reduceByKey(),
these workflows seems pretty efficient.
TD
On Thu, Aug 7, 2014 at 10:18 AM, Dan H. dch.ema...@gmail.com wrote:
I wanted to post for validation to understand
From the extended info, I see that you have a function called
createStreamingContext() in your code. Somehow that is getting referenced
in in the foreach function. Is the whole foreachRDD code inside the
createStreamingContext() function? Did you try marking the ssc field as
transient?
Here is a
= KafkaConsumer.messageStream(ssc)
messageStream.foreachRDD(rdd = {
A.func1(ssc.sparkContext)
}
Seems like the call A.func1(ssc.sparkContext) above is the cause of the
exception.
Thanks,
Mahesh
From: Tathagata Das tathagata.das1...@gmail.com
Date: Thursday, August 7, 2014 at 1:11 PM
To: amit amit.codenam
LOL! Glad it solved it.
TD
On Thu, Aug 7, 2014 at 2:23 PM, Padmanabhan, Mahesh (contractor)
mahesh.padmanab...@twc-contractor.com wrote:
Slap my head moment ā using rdd.context solved it!
Thanks TD,
Mahesh
From: Tathagata Das tathagata.das1...@gmail.com
Date: Thursday, August 7, 2014
Are you running on a cluster but giving a local path in ssc.checkpoint(...)
?
TD
On Thu, Aug 7, 2014 at 3:24 PM, salemi alireza.sal...@udo.edu wrote:
Hi,
Thank you or your help. With the new code I am getting the following error
in the driver. What is going wrong here?
14/08/07 13:22:28
That is required for driver fault-tolerance, as well as for some
transformations like updateSTateByKey that persist information across
batches. It must be a HDFS directory when running on a cluster.
TD
On Thu, Aug 7, 2014 at 4:25 PM, salemi alireza.sal...@udo.edu wrote:
That is correct. I do
You can do the following.
var globalMax = ...
dstreamOfNumericalType.foreachRDD( rdd = {
globalMax = math.max(rdd.max, globalMax)
})
globalMax will keep getting updated after every batch
TD
On Thu, Aug 7, 2014 at 5:31 PM, bumble123 tc1...@att.com wrote:
I can't figure out how to use
Do you mean that you want a continuously updated count as more
events/records are received in the DStream (remember, DStream is a
continuous stream of data)? Assuming that is what you want, you can use a
global counter
var globalCount = 0L
dstream.count().foreachRDD(rdd = { globalCount +=
You can always define an arbitrary RDD-to-RDD function, use it from both
Spark and Spark Streaming. For example,
def myTransofmration(rdd: RDD[X]): RDD[Y] = { }
In spark you can obvious apply it on an RDD. In spark streaming, you can
apply on the RDDs of a DStream by
The long lineage causes a long/deep Java object tree (DAG of RDD objects),
which needs to be serialized as part of the task creation. When
serializing, the whole object DAG needs to be traversed leading to the
stackoverflow error.
TD
On Mon, Aug 11, 2014 at 7:14 PM, randylu randyl...@gmail.com
FlatMap the JavaRDDBooleanPair[] to JavaRDDBooleanPair. Then it should
work.
TD
On Thu, Aug 14, 2014 at 1:23 AM, Gefei Li gefeili.2...@gmail.com wrote:
Hello,
I wrote a class named BooleanPair:
public static class BooleanPairet implements Serializable{
public Boolean
Can you be a bit more specific about what you mean by lambda architecture?
On Thu, Aug 14, 2014 at 2:27 PM, salemi alireza.sal...@udo.edu wrote:
Hi,
How would you implement the batch layer of lamda architecture with
spark/spark streaming?
Thanks,
Ali
--
View this message in context:
If just want arbitrary unique id attached to each record in a dstream (no
ordering etc), then why not create generate and attach an UUID to each
record?
On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:
I see a issue here.
If rdd.id is 1000 then rdd.id *
:
Yes, that is an option.
I started with a function of batch time, and index to generate id as long.
This may be faster than generating UUID, with added benefit of sorting
based on time.
- Original Message -
From: Tathagata Das tathagata.das1...@gmail.com
To: Soumitra Kumar
Try using local[n] with n 1, instead of local. Since receivers take up
1 slot, and local is basically 1 slot, there is no slot left to process
the data. That's why nothing gets printed.
TD
On Thu, Aug 28, 2014 at 10:28 AM, Verma, Rishi (398J)
rishi.ve...@jpl.nasa.gov wrote:
Hi Folks,
Iād
Do you see this error right in the beginning or after running for sometime?
The root cause seems to be that somehow your Spark executors got killed,
which killed receivers and caused further errors. Please try to take a look
at the executor logs of the lost executor to find what is the root cause
If you are repartitioning to 8 partitions, and your node happen to have at
least 4 cores each, its possible that all 8 partitions are assigned to only
2 nodes. Try increasing the number of partitions. Also make sure you have
executors (allocated by YARN) running on more than two nodes if you want
worker?
On Thu, Aug 28, 2014 at 4:12 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Do you see this error right in the beginning or after running for
sometime?
The root cause seems to be that somehow your Spark executors got killed,
which killed receivers and caused further errors
...@genesys.com]
*Sent:* Wednesday, August 27, 2014 6:46 PM
*To:* Tathagata Das
*Cc:* user@spark.apache.org
*Subject:* RE: [Streaming] Akka-based receiver with messages defined in
uploaded jar
Sorry for the delay with answer ā was on vacation.
As I said I was using modified version of launcher from
magic
at either Spark or application code?
*From:* Tathagata Das [mailto:tathagata.das1...@gmail.com]
*Sent:* Friday, August 29, 2014 7:21 PM
*To:* Anton Brazhnyk
*Cc:* user@spark.apache.org
*Subject:* Re: [Streaming] Akka-based receiver with messages defined in
uploaded jar
Can you
In the current state of Spark Streaming, creating separate Java processes
each having a streaming context is probably the best approach to
dynamically adding and removing of input sources. All of these should be
able to to use a YARN cluster for resource allocation.
On Wed, Sep 3, 2014 at 6:30
This is some issue with how Scala computes closures. Here because of the
function blah it is trying the serialize the whole function that this code
is part of. Can you define the function blah outside the main function? In
fact you canTry putting the function in a serializable object.
object
Yes Raymond is right. You can always run two jobs on the same cached RDD,
and they can run in parallel (assuming you launch the 2 jobs from two
different threads). However, with one copy of each RDD partition, the tasks
of two jobs will experience some slot contentions. So if you replicate it,
you
needs to be
Serializable, but the Blaher object doesn't.
On Wed, Sep 3, 2014 at 7:59 PM, Tathagata Das [via Apache Spark User List]
[hidden email] http://user/SendEmail.jtp?type=nodenode=13478i=0
wrote:
This is some issue with how Scala computes closures. Here because of the
function blah
Hey Gerard,
Spark Streaming should just queue the processing and not delete the block
data. There are reports of this error and I am still unable to reproduce
the problem. One workaround you can try the configuration
spark.streaming.unpersist = false . This stops Spark Streaming from
cleaning up
I am not sure if there is a good, clean way to do that - broadcasts
variables are not designed to be used out side spark job closures. You
could try a bit of a hacky stuff where you write the serialized variable to
file in HDFS / NFS / distributed files sytem, and then use a custom decoder
class
' work for DStream? I think similar construct won't
work for RDD, that's why there is accumulator.
On Fri, Aug 8, 2014 at 12:52 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Do you mean that you want a continuously updated count as more
events/records are received in the DStream
Which version of spark are you running?
If you are running the latest one, then could try running not a window but
a simple event count on every 2 second batch, and see if you are still
running out of memory?
TD
On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar
aniket.bhatna...@gmail.com
This is very puzzling, given that this works in the local mode.
Does running the kinesis example work with your spark-submit?
https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala
The instructions are present
There is not artifact call spark-streaming-algebird . To use the algebird,
you will have add the following dependency (in maven format)
dependency
groupIdcom.twitter/groupId
artifactIdalgebird-core_${scala.binary.version}/artifactId
version0.1.11/version
/dependency
This is
At a high-level, the suggestion sounds good to me. However regarding code,
its best to submit a Pull Request on Spark github page for community
reviewing. You will find more information here.
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
On Tue, Sep 23, 2014 at 10:11 PM,
This is actually a very tricky as their two pretty big challenges that need
to be solved.
(i) Checkpointing for broadcast variables: Unlike RDDs, broadcasts variable
dont have checkpointing support (that is you cannot write the content of a
broadcast variable to HDFS and recover it automatically
I am not sure what you mean by data checkpoint continuously increase,
leading to recovery process taking time? Do you mean that in HDFS you are
seeing rdd checkpoint files being continuously written but never being
deleted?
On Tue, Sep 23, 2014 at 2:40 AM, RodrigoB rodrigo.boav...@aspect.com
Is this the logs of the worker where the failure occurs? I think issues
similar to these have since been solved in later versions of Spark.
TD
On Tue, Sep 30, 2014 at 11:33 AM, Shaikh Riyaz shaikh@gmail.com wrote:
Dear All,
We are using Spark streaming version 1.0.0 in our Cloudea Hadoop
]
-
Your support will be highly appreciated.
Regards,
Riyaz
On Wed, Oct 1, 2014 at 1:16 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Is this the logs of the worker where the failure occurs? I think issues
similar to these have since been
Hey Gerard,
This is a very good question!
*TL;DR: *The performance should be same, except in case of shuffle-based
operations where the number of reducers is not explicitly specified.
Let me answer in more detail by dividing the set of DStream operations into
three categories.
*1. Map-like
Cc'ing Helena for more information on this.
TD
On Thu, Oct 23, 2014 at 6:30 AM, Saiph Kappa saiph.ka...@gmail.com wrote:
What is the application about? I couldn't find any proper description
regarding the purpose of killrweather ( I mean, other than just integrating
Spark with Cassandra). Do
The memory usage of blocks of data received through Spark Streaming is not
reflected in the Spark UI. It only shows the memory usage due to cached
RDDs.
I didnt find a JIRA for this, so I opened a new one.
https://issues.apache.org/jira/browse/SPARK-4072
TD
On Thu, Oct 23, 2014 at 12:47 AM,
, Tathagata Das
tathagata.das1...@gmail.com wrote:
Hey Gerard,
This is a very good question!
*TL;DR: *The performance should be same, except in case of shuffle-based
operations where the number of reducers is not explicitly specified.
Let me answer in more detail by dividing the set of DStream
It it deserialized in a streaming manner as the iterator moves over the
partition. This is a functionality of core Spark, and Spark Streaming just
uses it as is.
What do you want to customize it to?
On Tue, Nov 4, 2014 at 9:22 AM, Mohit Jaggi mohitja...@gmail.com wrote:
Folks,
If I have an RDD
Didnt oyu get any errors in the log4j logs, saying that you have to enable
checkpointing?
TD
On Tue, Nov 4, 2014 at 7:20 AM, diogo di...@uken.com wrote:
So, to answer my own n00b question, if case anyone ever needs it. You have
to enable checkpointing (by ssc.checkpoint(hdfsPath)). Windowed
Ted, any pointers?
On Thu, Nov 6, 2014 at 4:46 PM, Luiz Geovani Vier lgv...@gmail.com wrote:
Hello,
Is there a built-in way or connector to store DStream results into an
existing Hive ORC table using the Hive/HCatalog Streaming API?
Otherwise, do you have any suggestions regarding the
I am not aware of any obvious existing pattern that does exactly this.
Generally this sort of computation (subset, denormalization) things are so
generic sounding terms but actually have very specific requirements that it
hard to refer to a design pattern without more requirement info.
If you
What is the Spark master that you are using. Use local[4], not local
if you are running locally.
On Mon, Nov 10, 2014 at 3:01 PM, Something Something
mailinglist...@gmail.com wrote:
I am embarrassed to admit but I can't get a basic 'word count' to work under
Kafka/Spark streaming. My code
You could get all the tweets in the stream, and then apply filter
transformation on the DStream of tweets to filter away non-english
tweets. The tweets in the DStream is of type twitter4j.Status which
has a field describing the language. You can use that in the filter.
Though in practice, a lot
On Wed, Nov 26, 2014 at 4:02 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Let me further clarify Lalit's point on when RDDs generated by
DStreams are destroyed, and hopefully that will answer your original
questions.
1. How spark (streaming) guarantees that all the actions
some solutions for this.
Thanks!
Bill
On Wed, Nov 26, 2014 at 5:35 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Can you elaborate on the usage pattern that lead to cannot compute
split ? Are you using the RDDs generated by DStream, outside the
DStream logic? Something like
That depends! See inline. I am assuming that when you said replacing
local disk with HDFS in case 1, you are connected to a separate HDFS
cluster (like case 1) with a single 10G link. Also assumign that all
nodes (1 in case 1, and 6 in case 2) are the worker nodes, and the
spark application
Following Gerard's thoughts, here are possible things that could be happening.
1. Is there another process in the background that is deleting files
in the directory where you are trying to write? Seems like the
temporary file generated by one of the tasks is getting delete before
it is renamed to
What does process do? Maybe when this process function is being run in
the Spark executor, it is causing the some static initialization,
which fails causing this exception. For Oracle documentation,
an ExceptionInInitializerError is thrown to indicate that an exception
occurred during evaluation
You could create a lazily initialized singleton factory and connection
pool. Whenever an executor starts running the firt task that needs to
push out data, it will create the connection pool as a singleton. And
subsequent tasks running on the executor is going to use the
connection pool. You will
I am updating the docs right now. Here is a staged copy that you can
have sneak peek of. This will be part of the Spark 1.2.
http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html
The updated fault-tolerance section tries to simplify the explanation
of when and what data
Also, this is covered in the streaming programming guide in bits and pieces.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
On Thu, Dec 11, 2014 at 4:55 AM, Ashic Mahtab as...@live.com wrote:
That makes sense. I'll try that.
Thanks :)
Aditya, I think you have the mental model of spark streaming a little
off the mark. Unlike traditional streaming systems, where any kind of
state is mutable, SparkStreaming is designed on Sparks immutable RDDs.
Streaming data is received and divided into immutable blocks, then
form immutable RDDs,
First of all, how long do you want to keep doing this? The data is
going to increase infinitely and without any bounds, its going to get
too big for any cluster to handle. If all that is within bounds, then
try the following.
- Maintain a global variable having the current RDD storing all the
log
Not that I am aware of. Spark will try to spread the tasks evenly
across executors, its not aware of the workers at all. So if the
executors to worker allocation is uneven, I am not sure what can be
done. Maybe others can get smoe ideas.
On Tue, Dec 9, 2014 at 6:20 AM, Gerard Maas
This seems to be compilation errors. The second one seems to be that
you are using CassandraJavaUtil.javafunctions wrong. Look at the
documentation and set the parameter list correctly.
TD
On Mon, Dec 8, 2014 at 9:47 AM, m.sar...@accenture.com wrote:
Hi,
I am intending to save the streaming
www.chaordic.com.br
+55 48 3232.3200
On Thu, Dec 11, 2014 at 10:03 AM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Following Gerard's thoughts, here are possible things that could be
happening.
1. Is there another process in the background that is deleting files
in the directory
Spark Streaming takes care of restarting receivers if it fails.
Regarding the fault-tolerance properties and deployment options, we
made some improvements in the upcoming Spark 1.2. Here is a staged
version of the Spark Streaming programming guide that you can read for
the up-to-date explanation
Yes, socketTextStream starts a TCP client that tries to connect to a
TCP server (localhost: in your case). If there is a server running
on that port that can send data to connected TCP connections, then you
will receive data in the stream.
Did you check out the quick example in the streaming
Another point to start playing with updateStateByKey is the example
StatefulNetworkWordCount. See the streaming examples directory in the
Spark repository.
TD
On Thu, Dec 18, 2014 at 6:07 AM, Pierce Lamb
richard.pierce.l...@gmail.com wrote:
I am trying to run stateful Spark Streaming
A more updated version of the streaming programming guide is here
http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html
Please refer to this until we make the official release of Spark 1.2
TD
On Tue, Dec 16, 2014 at 3:50 PM, smallmonkey...@hotmail.com
I am not sure that can be done. Receivers are designed to be run only
on the executors/workers, whereas a SQLContext (for using Spark SQL)
can only be defined on the driver.
On Mon, Dec 29, 2014 at 6:45 PM, sranga sra...@gmail.com wrote:
Hi
Could Spark-SQL be used from within a custom actor
For windows that large (1 hour), you will probably also have to
increase the batch interval for efficiency.
TD
On Mon, Dec 29, 2014 at 12:16 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
You can use reduceByKeyAndWindow for that. Here's a pretty clean example
1. Of course, a single block / partition has many Kafka messages, and
from different Kafka topics interleaved together. The message count is
not related to the block count. Any message received within a
particular block interval will go in the same block.
2. Yes, the receiver will be started on
Thats is kind of expected due to data locality. Though you should see
some tasks running on the executors as the data gets replicated to
other nodes and can therefore run tasks based on locality. You have
two solutions
1. kafkaStream.repartition() to explicitly repartition the received
data
Which version of Spark Streaming are you using.
When the batch processing time increases to 15-20 seconds, could you
compare the task times compared to the tasks time when the application
is just launched? Basically is the increase from 6 seconds to 15-20
seconds is caused by increase in
Whats your spark-submit commands in both cases? Is it Spark Standalone or
YARN (both support client and cluster)? Accordingly what is the number of
executors/cores requested?
TD
On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji eshi...@gmail.com wrote:
Also the job was deployed from the master
This is not normal. Its a huge scheduling delay!! Can you tell me more
about the application?
- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:
Hate to do this...but...erm...bump? Would really appreciate input
Hello mingyu,
That is a reasonable way of doing this. Spark Streaming natively does
not support sticky because Spark launches tasks based on data
locality. If there is no locality (example reduce tasks can run
anywhere), location is randomly assigned. So the cogroup or join
introduces a locality
Hello Sachin,
While Akhil's solution is correct, this is not sufficient for your usecase.
RDD.foreach (that Akhil is using) will run on the workers, but you are
creating the Producer object on the driver. This will not work, a producer
create on the driver cannot be used from the worker/executor.
This is an issue that is hard to resolve without rearchitecting the whole
Kafka Receiver. There are some workarounds worth looking into.
http://mail-archives.apache.org/mod_mbox/kafka-users/201312.mbox/%3CCAFbh0Q38qQ0aAg_cj=jzk-kbi8xwf+1m6xlj+fzf6eetj9z...@mail.gmail.com%3E
On Mon, Feb 2, 2015
I dont think your screenshots came through in the email. None the less,
queueStream will not work with getOrCreate. Its mainly for testing (by
generating your own RDDs) and not really useful for production usage (where
you really need to checkpoint-based recovery).
TD
On Thu, Feb 5, 2015 at 4:12
Correct, brute force clean up is not useful. Since Spark 1.0, Spark can do
automatic cleanup of files based on which RDDs are used/garbage collected
by JVM. That would be the best way, but depends on the JVM GC
characteristics. If you force a GC periodically in the driver that might
help you get
So you have come across spark.streaming.concurrentJobs already :)
Yeah, that is an undocumented feature that does allow multiple output
operations to submitted in parallel. However, this is not made public for
the exact reasons that you realized - the semantics in case of stateful
operations is
Can you give me the whole logs?
TD
On Tue, Feb 10, 2015 at 10:48 AM, Jon Gregg jonrgr...@gmail.com wrote:
OK that worked and getting close here ... the job ran successfully for a
bit and I got output for the first couple buckets before getting a
java.lang.Exception: Could not compute split,
Thanks for looking into it.
On Thu, Feb 12, 2015 at 8:10 PM, Tathagata Das t...@databricks.com
wrote:
Hey Tim,
Let me get the key points.
1. If you are not writing back to Kafka, the delay is stable? That is,
instead of foreachRDD { // write to kafka } if you do dstream.count
Could you come up with a minimal example through which I can reproduce the
problem?
On Tue, Feb 10, 2015 at 12:30 PM, conor fennell.co...@gmail.com wrote:
I am getting the following error when I kill the spark driver and restart
the job:
15/02/10 17:31:05 INFO CheckpointReader: Attempting to
Sorry for the late response. With the amount of data you are planning join,
any system would take time. However, between Hive's MapRduce joins, and
Spark's basic shuffle, and Spark SQL's join, the latter wins hands down.
Furthermore, with the APIs of Spark and Spark Streaming, you will have to
do
Hey Tim,
Let me get the key points.
1. If you are not writing back to Kafka, the delay is stable? That is,
instead of foreachRDD { // write to kafka } if you do dstream.count,
then the delay is stable. Right?
2. If so, then Kafka is the bottleneck. Is the number of partitions, that
you spoke of
What version of Spark are you using?
TD
On Thu, Feb 19, 2015 at 2:45 AM, Emre Sevinc emre.sev...@gmail.com wrote:
Hello,
We have a Spark Streaming application that watches an input directory, and
as files are copied there the application reads them and sends the contents
to a RESTful web
Ohhh nice! Would be great if you can share us some code soon. It is
indeed a very complicated problem and there is probably no single
solution that fits all usecases. So having one way of doing things
would be a great reference. Looking forward to that!
On Wed, Jan 28, 2015 at 4:52 PM, Tobias
You could use foreachRDD to do the operations and then inside the
foreach create an accumulator to gather all the errors together
dstream.foreachRDD { rdd =
val accumulator = new Accumulator[]
rdd.map { . }.count // whatever operation that is error prone
// gather all errors
That is a known issue uncovered last week. It fails on certain
environments, not on Jenkins which is our testing environment.
There is already a PR up to fix it. For now you can build using mvn
package -DskipTests
TD
On Fri, Jan 30, 2015 at 8:59 PM, Andrew Musselman
andrew.mussel...@gmail.com
You cannot have two Spark Contexts in the same JVM active at the same time.
Just create one SparkContext and then use it for both purpose.
TD
On Fri, Feb 6, 2015 at 8:49 PM, VISHNU SUBRAMANIAN
johnfedrickena...@gmail.com wrote:
Can you try creating just a single spark context and then try
Here is an example of how you can do. Lets say myDStream contains the
data that you may want to asynchornously query, say using, Spark SQL.
val sqlContext = new SqlContext(streamingContext.sparkContext)
myDStream.foreachRDD { rdd = // rdd is a RDD of case class
It was not really meant to be pubic and overridden. Because anything you
want to do to generate jobs from RDDs can be done using DStream.foreachRDD
On Sun, Mar 15, 2015 at 11:14 PM, madhu phatak phatak@gmail.com wrote:
Hi,
I am trying to create a simple subclass of DStream. If I
It's mostly for legacy reasons. First we had added all the MappedDStream,
etc. and then later we realized we need to expose something that is more
generic for arbitrary RDD-RDD transformations. It can be easily replaced.
However, there is a slight value in having MappedDStream, for developers to
etc so they are consistent with other API's?.
Regards,
Madhukara Phatak
http://datamantra.io/
On Mon, Mar 16, 2015 at 11:42 PM, Tathagata Das t...@databricks.com
wrote:
It's mostly for legacy reasons. First we had added all the MappedDStream,
etc. and then later we realized we need to expose
If you are creating an assembly, make sure spark-streaming is marked as
provided. spark-streaming is already part of the spark installation so will
be present at run time. That might solve some of these, may be!?
TD
On Mon, Mar 16, 2015 at 11:30 AM, Kelly, Jonathan jonat...@amazon.com
wrote:
-kinesis-asl.)
Jonathan Kelly
Elastic MapReduce - SDE
Port 99 (SEA35) 08.220.C2
From: Tathagata Das t...@databricks.com
Date: Monday, March 16, 2015 at 12:45 PM
To: Jonathan Kelly jonat...@amazon.com
Cc: user@spark.apache.org user@spark.apache.org
Subject: Re: problems with spark
201 - 300 of 861 matches
Mail list logo