I am afraid not. The whole point of Spark Streaming is to make it easy to
do complicated processing on streaming data while interoperating with core
Spark, MLlib, SQL without the operational overheads of maintain 4 different
systems. As a slight cost of achieving that unification, there maybe some
be added in at some point.
On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das t...@databricks.com
wrote:
I sort-a-hacky workaround is to use a queueStream where you can manually
create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
that this is for testing only as queueStream does
How many cores are present in the works allocated to the standalone cluster
spark://ip-10-241-251-232:7077 ?
On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:
If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
seems to work. I don't understand
Just remove provided for spark-streaming-kinesis-asl
libraryDependencies += org.apache.spark %% spark-streaming-kinesis-asl
% 1.3.0
On Fri, Apr 3, 2015 at 12:45 PM, Vadim Bichutskiy
vadim.bichuts...@gmail.com wrote:
Thanks. So how do I fix it?
ᐧ
On Fri, Apr 3, 2015 at 3:43 PM, Kelly,
I sort-a-hacky workaround is to use a queueStream where you can manually
create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
that this is for testing only as queueStream does not work with driver
fautl recovery.
TD
On Fri, Apr 3, 2015 at 12:23 PM, adamgerst
Are you saying that even with the spark.cleaner.ttl set your files are not
getting cleaned up?
TD
On Thu, Apr 2, 2015 at 8:23 AM, andrem amesa...@gmail.com wrote:
Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and
the worker nodes eventually run out of inodes.
We see
We should be able to support that use case in the direct API. It may be as
simple as allowing the users to pass on a function that returns the set of
topic+partitions to read from.
That is function (Time) = Set[TopicAndPartition] This gets called every
batch interval before the offsets are
Its not a built in component of Spark. However there is a spark-package for
Apache Camel receiver which can integrate with JMS.
http://spark-packages.org/package/synsys/spark
I have not tried it but do check it out.
TD
On Wed, Apr 1, 2015 at 4:38 AM, danila danila.erma...@gmail.com wrote:
Hi
In the current state yes there will be performance issues. It can be done
much more efficiently and we are working on doing that.
TD
On Wed, Apr 1, 2015 at 7:49 AM, Vinoth Chandar vin...@uber.com wrote:
Hi all,
As I understand from docs and talks, the streaming state is in memory as
RDD
, kafka metadata, etc) =
Option[KafkaRDD]
I think it's more straightforward to give access to that additional state
via subclassing than it is to add in more callbacks for every possible use
case.
On Wed, Apr 1, 2015 at 2:01 PM, Tathagata Das t...@databricks.com
wrote:
We should be able
If it is deterministically reproducible, could you generate full DEBUG
level logs, from the driver and the workers and give it to me? Basically I
want to trace through what is happening to the block that is not being
found.
And can you tell what Cluster manager are you using? Spark Standalone,
Do you have the logs of the driver? Does that give any exceptions?
TD
On Fri, Mar 27, 2015 at 12:24 PM, Chen Song chen.song...@gmail.com wrote:
I ran a spark streaming job.
100 executors
30G heap per executor
4 cores per executor
The version I used is 1.3.0-cdh5.1.0.
The job is reading
-spark . From console I see that spark is trying to
replicate to nodes - nodes show up in Mesos active tasks ... but they
always fail with ClassNotFoundE.
2015-03-27 0:52 GMT+01:00 Tathagata Das t...@databricks.com:
Could you try running a simpler spark streaming program with receiver
(may
(StorageLevel.MEMORY_ONLY_2
etc).
2015-03-27 19:06 GMT+01:00 Tathagata Das t...@databricks.com:
Does it fail with just Spark jobs (using storage levels) on non-coarse
mode?
TD
On Fri, Mar 27, 2015 at 4:39 AM, Ondrej Smola ondrej.sm...@gmail.com
wrote:
More info
when using *spark.mesos.coarse
Yes, that is the correct understanding. There are undocumented parameters
that allow that, but I do not recommend using those :)
TD
On Wed, Mar 25, 2015 at 6:57 AM, Luis Ángel Vicente Sánchez
langel.gro...@gmail.com wrote:
I have a simple and probably dumb question about foreachRDD.
We are
etc so they are consistent with other API's?.
Regards,
Madhukara Phatak
http://datamantra.io/
On Mon, Mar 16, 2015 at 11:42 PM, Tathagata Das t...@databricks.com
wrote:
It's mostly for legacy reasons. First we had added all the MappedDStream,
etc. and then later we realized we need to expose
It was not really meant to be pubic and overridden. Because anything you
want to do to generate jobs from RDDs can be done using DStream.foreachRDD
On Sun, Mar 15, 2015 at 11:14 PM, madhu phatak phatak@gmail.com wrote:
Hi,
I am trying to create a simple subclass of DStream. If I
It's mostly for legacy reasons. First we had added all the MappedDStream,
etc. and then later we realized we need to expose something that is more
generic for arbitrary RDD-RDD transformations. It can be easily replaced.
However, there is a slight value in having MappedDStream, for developers to
If you are creating an assembly, make sure spark-streaming is marked as
provided. spark-streaming is already part of the spark installation so will
be present at run time. That might solve some of these, may be!?
TD
On Mon, Mar 16, 2015 at 11:30 AM, Kelly, Jonathan jonat...@amazon.com
wrote:
-kinesis-asl.)
Jonathan Kelly
Elastic MapReduce - SDE
Port 99 (SEA35) 08.220.C2
From: Tathagata Das t...@databricks.com
Date: Monday, March 16, 2015 at 12:45 PM
To: Jonathan Kelly jonat...@amazon.com
Cc: user@spark.apache.org user@spark.apache.org
Subject: Re: problems with spark
.
I really appreciate your help, but it looks like I’m back to the drawing
board on this one.
*From:* Tathagata Das [mailto:t...@databricks.com]
*Sent:* Thursday, March 12, 2015 7:53 PM
*To:* Jose Fernandez
*Cc:* user@spark.apache.org
*Subject:* Re: Handling worker batch processing during
Are you running the driver program (that is your application process) in
your desktop and trying to run it on the cluster in EC2?
It could very well be a hostname mismatch in some way due to the all the
public hostname, private hostname, private ip, firewall craziness of ec2.
You have to probably
Is the number of top K elements you want to keep small? That is, is K
small? In which case, you can
1. either do it in the driver on the array
DStream.foreachRDD ( rdd = {
val topK = rdd.top(K) ;
// use top K
})
2. Or, you can use the topK to create another RDD using sc.makeRDD
If you want to access the keys in an RDD that is partition by key, then you
can use RDD.mapPartition(), which gives you access to the whole partition
as an iteratorkey, value. You have the option of maintaing the
partitioning information or not by setting the preservePartitioning flag in
, Tathagata Das t...@databricks.com
wrote:
If you want to access the keys in an RDD that is partition by key, then
you can use RDD.mapPartition(), which gives you access to the whole
partition as an iteratorkey, value. You have the option of maintaing the
partitioning information or not by setting
-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/
On Thu, Mar 12, 2015 at 5:09 PM, Tathagata Das t...@databricks.com
wrote:
Well the answers you got there are correct as well.
Unfortunately I am not familiar with Neo4j enough to comment any more
I am not sure if you realized but the code snipper it pretty mangled up in
the email we received. It might be a good idea to put the code in pastebin
or gist, much much easier for everyone to read.
On Thu, Mar 12, 2015 at 12:48 AM, d34th4ck3r gautam1...@gmail.com wrote:
I'm trying to use Neo4j
at 12:58 AM, Gautam Bajaj gautam1...@gmail.com wrote:
Alright, I have also asked this question in StackOverflow:
http://stackoverflow.com/questions/28896898/using-neo4j-with-apache-spark
The code there is pretty neat.
On Thu, Mar 12, 2015 at 4:55 PM, Tathagata Das t...@databricks.com
wrote
. As for each partition, I'd need
to restart the server, that was the basic reason I was creating graphDb
object outside this loop.
On Fri, Mar 13, 2015 at 5:34 AM, Tathagata Das t...@databricks.com
wrote:
(Putting user@spark back in the to list)
In the gist, you are creating graphDB object way
using Spark
1.2 on CDH 5.3. I stop the application with yarn application -kill appID.
*From:* Tathagata Das [mailto:t...@databricks.com]
*Sent:* Thursday, March 12, 2015 1:29 PM
*To:* Jose Fernandez
*Cc:* user@spark.apache.org
*Subject:* Re: Handling worker batch processing during driver
the AVRO data to multiple files.
Thanks,
Sam
On Mar 12, 2015, at 4:09 AM, Tathagata Das t...@databricks.com wrote:
Why do you have to write a single file?
On Wed, Mar 11, 2015 at 1:00 PM, SamyaMaiti samya.maiti2...@gmail.com
wrote:
Hi Experts,
I have a scenario, where in I want to write
Can you access the batcher directly? Like is there is there a handle to get
access to the batchers on the executors by running a task on that executor?
If so, after the streamingContext has been stopped (not the SparkContext),
then you can use `sc.makeRDD()` to run a dummy task like this.
Why are you repartitioning 1? That would obviously be slow, you are
converting a distributed operation to a single node operation.
Also consider using RDD.top(). If you define the ordering right (based on
the count), then you will get top K across then without doing a shuffle for
sortByKey. Much
Why do you have to write a single file?
On Wed, Mar 11, 2015 at 1:00 PM, SamyaMaiti samya.maiti2...@gmail.com
wrote:
Hi Experts,
I have a scenario, where in I want to write to a avro file from a streaming
job that reads data from kafka.
But the issue is, as there are multiple executors
Can you show us the code that you are using?
This might help. This is the updated streaming programming guide for 1.3,
soon to be up, this is a quick preview.
http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations
TD
On Wed, Mar 11,
If you are using tools like SBT/Maven/Gradle/etc, they figure out all the
recursive dependencies and includes them in the class path. I haven't
touched Eclipse in years so I am not sure off the top of my head what's
going on instead. Just in case you only downloaded the
spark-streaming_2.10.jar
You have to include Scala libraries in the Eclipse dependencies.
TD
On Tue, Mar 10, 2015 at 10:54 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I am trying out streaming example as documented and I am using spark 1.2.1
streaming from maven for Java.
When I add this code I get compilation
Do you have event logging enabled?
That could be the problem. The Master tries to aggressively recreate the
web ui of the completed job with the event logs (when it is enabled)
causing the Master to stall.
I created a JIRA for this.
https://issues.apache.org/jira/browse/SPARK-6270
On Tue, Mar 10,
artifactIdspark-streaming_2.10/artifactId
version1.2.0/version
/dependency
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version1.2.1/version
/dependency
/dependencies
On Tue, Mar 10, 2015 at 11:06 AM, Tathagata Das t...@databricks.com
wrote:
If you
Spark Streaming has StreamingContext.socketStream()
http://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/streaming/StreamingContext.html#socketStream(java.lang.String,
int, scala.Function1, org.apache.spark.storage.StorageLevel,
scala.reflect.ClassTag)
TD
On Mon, Mar 9, 2015 at 11:37 AM,
data from RDBMs, for the details you can
refer to the docs.
Thanks
Jerry
*From:* Cui Lin [mailto:cui@hds.com]
*Sent:* Tuesday, March 10, 2015 8:36 AM
*To:* Tathagata Das
*Cc:* user@spark.apache.org
*Subject:* Re: Spark Streaming input data source list
Tathagata,
Thanks
://www.cisco.com/web/siteassets/legal/privacy.html*
*From:* Tathagata Das [mailto:t...@databricks.com]
*Sent:* Tuesday, March 03, 2015 11:11 PM
*To:* Nastooh Avessta (navesta)
*Cc:* user@spark.apache.org
*Subject:* Re: Spark Streaming Switchover Time
I am confused. Are you killing the 1st worker
Hint: Print() just gives a sample of what is in the data, and does not
enforce the processing on all the data (only the first partition of the rdd
is computed to get 10 items). Count() actually processes all the data. This
is all due to lazy eval, if you don't need to use all the data, don't
That could be a corner case bug. How do you add the 3rd party library to
the class path of the driver? Through spark-submit? Could you give the
command you used?
TD
On Wed, Mar 4, 2015 at 12:42 AM, Emre Sevinc emre.sev...@gmail.com wrote:
I've also tried the following:
Configuration
The file stream does not use receiver. May be that was not clear in the
programming guide. I am updating it for 1.3 release right now, I will make
it more clear.
And file stream has full reliability. Read this in the programming guide.
You can use DStream.transform() to do any arbitrary RDD transformations on
the RDDs generated by a DStream.
val coalescedDStream = myDStream.transform { _.coalesce(...) }
On Tue, Mar 3, 2015 at 1:47 PM, Saiph Kappa saiph.ka...@gmail.com wrote:
Sorry I made a mistake in my code. Please ignore
Can you elaborate on what is this switchover time?
TD
On Tue, Mar 3, 2015 at 9:57 PM, Nastooh Avessta (navesta) nave...@cisco.com
wrote:
Hi
On a standalone, Spark 1.0.0, with 1 master and 2 workers, operating in
client mode, running a udp streaming application, I am noting around 2
- Unsubscribe
http://www.cisco.com/offer/unsubscribe/?sid=000478327 – Privacy
http://www.cisco.com/web/siteassets/legal/privacy.html*
*From:* Tathagata Das [mailto:t...@databricks.com]
*Sent:* Tuesday, March 03, 2015 10:24 PM
*To:* Nastooh Avessta (navesta)
*Cc:* user@spark.apache.org
*Subject:* Re
Are you sure the multiple invocations are not from previous runs of the
program?
TD
On Fri, Feb 27, 2015 at 12:16 PM, Nastooh Avessta (navesta)
nave...@cisco.com wrote:
Hi
Under Spark 1.0.0, standalone, client mode am trying to invoke a 3rd
party udp traffic generator, from the streaming
, ON, Canada, M5J
2T3. Phone: 416-306-7000; Fax: 416-306-7099. *Preferences
http://www.cisco.com/offer/subscribe/?sid=000478326 - Unsubscribe
http://www.cisco.com/offer/unsubscribe/?sid=000478327 – Privacy
http://www.cisco.com/web/siteassets/legal/privacy.html*
*From:* Tathagata Das [mailto:t
26, 2015 at 1:36 AM, Tathagata Das t...@databricks.com
wrote:
Yes. # tuples processed in a batch = sum of all the tuples received by
all the receivers.
In screen shot, there was a batch with 69.9K records, and there was a
batch which took 1 s 473 ms. These two batches can be the same, can
Hey Mike,
I quickly looked through the example and I found major performance issue.
You are collecting the RDDs to the driver and then sending them to Mongo in
a foreach. Why not doing a distributed push to Mongo?
WHAT YOU HAVE
val mongoConnection = ...
WHAT YOU SHUOLD DO
rdd.foreachPartition
Yes. # tuples processed in a batch = sum of all the tuples received by all
the receivers.
In screen shot, there was a batch with 69.9K records, and there was a batch
which took 1 s 473 ms. These two batches can be the same, can be different
batches.
TD
On Wed, Feb 25, 2015 at 10:11 AM, Josh J
Spark Streaming has a new Kafka direct stream, to be release as
experimental feature with 1.3. That uses a low level consumer. Not sure if
it satisfies your purpose.
If you want more control, its best to create your own Receiver with the low
level Kafka API.
TD
On Tue, Feb 24, 2015 at 12:09 AM,
There are different kinds of checkpointing going on. updateStateByKey
requires RDD checkpointing which can be enabled only by called
sparkContext.setCheckpointDirectory. But that does not enable Spark
Streaming driver checkpoints, which is necessary for recovering from driver
failures. That is
, I'm thinking on that line.
The problem is how to send to send the query to the backend? Bundle a http
server into a spark streaming job, that will accept the parameters?
--
Nikhil Bafna
On Mon, Feb 23, 2015 at 2:04 PM, Tathagata Das t...@databricks.com
wrote:
You will have a build a split
You could do something like this.
def rddTrasnformationUsingBroadcast(rdd: RDD[...]): RDD[...] = {
val broadcastToUse = getBroadcast()// get the reference to a
broadcast variable, new or existing.
rdd.map { .. } // use broadcast variable
}
Could you find the executor logs on the executor where that task was
scheduled? That may provide more information on what caused the error.
Also take a look at where the block in question was stored, and where the
task was scheduled.
You will need to enabled log4j INFO level logs for this
You will have a build a split infrastructure - a front end that takes the
queries from the UI and sends them to the backend, and the backend (running
the Spark Streaming app) will actually run the queries on table created in
the contexts. The RPCs necessary between the frontend and backend will
The default persistence level is MEMORY_AND_DISK, so the LRU policy would
discard the blocks to disk, so the streaming app will not fail. However,
since things will get constantly read in and out of disk as windows are
processed, the performance wont be great. So it is best to have sufficient
Unless I am unaware some latest changes, the SparkUI shows stages, and
jobs, not accumulator results. And the UI not designed to be pluggable for
showing user-defined stuff.
TD
On Fri, Feb 20, 2015 at 12:25 AM, Tim Smith secs...@gmail.com wrote:
On Spark 1.2:
I am trying to capture # records
-24 12:58
*To:* Tathagata Das t...@databricks.com
*CC:* user user@spark.apache.org; bit1129 bit1...@163.com
*Subject:* Re: About FlumeUtils.createStream
I see, thanks for the clarification TD.
On 24 Feb 2015 09:56, Tathagata Das t...@databricks.com wrote:
Akhil, that is incorrect.
Spark
In case this mystery has not been solved, DStream.print() essentially does
a RDD.take(10) on each RDD, which computes only a subset of the partitions
in the RDD. But collects forces the evaluation of all the RDDs. Since you
are writing to json in the mapI() function, this could be the reason.
TD
Exactly, that is the reason.
To avoid that, in Spark 1.3 to-be-released, we have added a new Kafka API
(called direct stream) which does not use Zookeeper at all to keep track of
progress, and maintains offset within Spark Streaming. That can guarantee
all records being received exactly-once. Its
felixcheun...@hotmail.com wrote:
Kafka 0.8.2 has built-in offset management, how would that affect direct
stream in spark?
Please see KAFKA-1012
--- Original Message ---
From: Tathagata Das t...@databricks.com
Sent: February 23, 2015 9:53 PM
To: V Dineshkumar developer.dines...@gmail.com
Cc
Spark Streaming already directly supports Kafka
http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources
Is there any reason why that is not sufficient?
TD
On Sun, Feb 22, 2015 at 5:18 PM, mykidong mykid...@gmail.com wrote:
In java, you can see this example:
What version of Spark are you using?
TD
On Thu, Feb 19, 2015 at 2:45 AM, Emre Sevinc emre.sev...@gmail.com wrote:
Hello,
We have a Spark Streaming application that watches an input directory, and
as files are copied there the application reads them and sends the contents
to a RESTful web
Correct, brute force clean up is not useful. Since Spark 1.0, Spark can do
automatic cleanup of files based on which RDDs are used/garbage collected
by JVM. That would be the best way, but depends on the JVM GC
characteristics. If you force a GC periodically in the driver that might
help you get
You cannot have two Spark Contexts in the same JVM active at the same time.
Just create one SparkContext and then use it for both purpose.
TD
On Fri, Feb 6, 2015 at 8:49 PM, VISHNU SUBRAMANIAN
johnfedrickena...@gmail.com wrote:
Can you try creating just a single spark context and then try
Here is an example of how you can do. Lets say myDStream contains the
data that you may want to asynchornously query, say using, Spark SQL.
val sqlContext = new SqlContext(streamingContext.sparkContext)
myDStream.foreachRDD { rdd = // rdd is a RDD of case class
So you have come across spark.streaming.concurrentJobs already :)
Yeah, that is an undocumented feature that does allow multiple output
operations to submitted in parallel. However, this is not made public for
the exact reasons that you realized - the semantics in case of stateful
operations is
Can you give me the whole logs?
TD
On Tue, Feb 10, 2015 at 10:48 AM, Jon Gregg jonrgr...@gmail.com wrote:
OK that worked and getting close here ... the job ran successfully for a
bit and I got output for the first couple buckets before getting a
java.lang.Exception: Could not compute split,
Thanks for looking into it.
On Thu, Feb 12, 2015 at 8:10 PM, Tathagata Das t...@databricks.com
wrote:
Hey Tim,
Let me get the key points.
1. If you are not writing back to Kafka, the delay is stable? That is,
instead of foreachRDD { // write to kafka } if you do dstream.count
Could you come up with a minimal example through which I can reproduce the
problem?
On Tue, Feb 10, 2015 at 12:30 PM, conor fennell.co...@gmail.com wrote:
I am getting the following error when I kill the spark driver and restart
the job:
15/02/10 17:31:05 INFO CheckpointReader: Attempting to
Sorry for the late response. With the amount of data you are planning join,
any system would take time. However, between Hive's MapRduce joins, and
Spark's basic shuffle, and Spark SQL's join, the latter wins hands down.
Furthermore, with the APIs of Spark and Spark Streaming, you will have to
do
Hey Tim,
Let me get the key points.
1. If you are not writing back to Kafka, the delay is stable? That is,
instead of foreachRDD { // write to kafka } if you do dstream.count,
then the delay is stable. Right?
2. If so, then Kafka is the bottleneck. Is the number of partitions, that
you spoke of
I dont think your screenshots came through in the email. None the less,
queueStream will not work with getOrCreate. Its mainly for testing (by
generating your own RDDs) and not really useful for production usage (where
you really need to checkpoint-based recovery).
TD
On Thu, Feb 5, 2015 at 4:12
Hello Sachin,
While Akhil's solution is correct, this is not sufficient for your usecase.
RDD.foreach (that Akhil is using) will run on the workers, but you are
creating the Producer object on the driver. This will not work, a producer
create on the driver cannot be used from the worker/executor.
This is an issue that is hard to resolve without rearchitecting the whole
Kafka Receiver. There are some workarounds worth looking into.
http://mail-archives.apache.org/mod_mbox/kafka-users/201312.mbox/%3CCAFbh0Q38qQ0aAg_cj=jzk-kbi8xwf+1m6xlj+fzf6eetj9z...@mail.gmail.com%3E
On Mon, Feb 2, 2015
That is a known issue uncovered last week. It fails on certain
environments, not on Jenkins which is our testing environment.
There is already a PR up to fix it. For now you can build using mvn
package -DskipTests
TD
On Fri, Jan 30, 2015 at 8:59 PM, Andrew Musselman
andrew.mussel...@gmail.com
Ohhh nice! Would be great if you can share us some code soon. It is
indeed a very complicated problem and there is probably no single
solution that fits all usecases. So having one way of doing things
would be a great reference. Looking forward to that!
On Wed, Jan 28, 2015 at 4:52 PM, Tobias
You could use foreachRDD to do the operations and then inside the
foreach create an accumulator to gather all the errors together
dstream.foreachRDD { rdd =
val accumulator = new Accumulator[]
rdd.map { . }.count // whatever operation that is error prone
// gather all errors
Hello mingyu,
That is a reasonable way of doing this. Spark Streaming natively does
not support sticky because Spark launches tasks based on data
locality. If there is no locality (example reduce tasks can run
anywhere), location is randomly assigned. So the cogroup or join
introduces a locality
This is not normal. Its a huge scheduling delay!! Can you tell me more
about the application?
- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:
Hate to do this...but...erm...bump? Would really appreciate input
Whats your spark-submit commands in both cases? Is it Spark Standalone or
YARN (both support client and cluster)? Accordingly what is the number of
executors/cores requested?
TD
On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji eshi...@gmail.com wrote:
Also the job was deployed from the master
I am not sure that can be done. Receivers are designed to be run only
on the executors/workers, whereas a SQLContext (for using Spark SQL)
can only be defined on the driver.
On Mon, Dec 29, 2014 at 6:45 PM, sranga sra...@gmail.com wrote:
Hi
Could Spark-SQL be used from within a custom actor
For windows that large (1 hour), you will probably also have to
increase the batch interval for efficiency.
TD
On Mon, Dec 29, 2014 at 12:16 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
You can use reduceByKeyAndWindow for that. Here's a pretty clean example
1. Of course, a single block / partition has many Kafka messages, and
from different Kafka topics interleaved together. The message count is
not related to the block count. Any message received within a
particular block interval will go in the same block.
2. Yes, the receiver will be started on
Thats is kind of expected due to data locality. Though you should see
some tasks running on the executors as the data gets replicated to
other nodes and can therefore run tasks based on locality. You have
two solutions
1. kafkaStream.repartition() to explicitly repartition the received
data
Which version of Spark Streaming are you using.
When the batch processing time increases to 15-20 seconds, could you
compare the task times compared to the tasks time when the application
is just launched? Basically is the increase from 6 seconds to 15-20
seconds is caused by increase in
Another point to start playing with updateStateByKey is the example
StatefulNetworkWordCount. See the streaming examples directory in the
Spark repository.
TD
On Thu, Dec 18, 2014 at 6:07 AM, Pierce Lamb
richard.pierce.l...@gmail.com wrote:
I am trying to run stateful Spark Streaming
A more updated version of the streaming programming guide is here
http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html
Please refer to this until we make the official release of Spark 1.2
TD
On Tue, Dec 16, 2014 at 3:50 PM, smallmonkey...@hotmail.com
Yes, socketTextStream starts a TCP client that tries to connect to a
TCP server (localhost: in your case). If there is a server running
on that port that can send data to connected TCP connections, then you
will receive data in the stream.
Did you check out the quick example in the streaming
Following Gerard's thoughts, here are possible things that could be happening.
1. Is there another process in the background that is deleting files
in the directory where you are trying to write? Seems like the
temporary file generated by one of the tasks is getting delete before
it is renamed to
What does process do? Maybe when this process function is being run in
the Spark executor, it is causing the some static initialization,
which fails causing this exception. For Oracle documentation,
an ExceptionInInitializerError is thrown to indicate that an exception
occurred during evaluation
You could create a lazily initialized singleton factory and connection
pool. Whenever an executor starts running the firt task that needs to
push out data, it will create the connection pool as a singleton. And
subsequent tasks running on the executor is going to use the
connection pool. You will
I am updating the docs right now. Here is a staged copy that you can
have sneak peek of. This will be part of the Spark 1.2.
http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html
The updated fault-tolerance section tries to simplify the explanation
of when and what data
Also, this is covered in the streaming programming guide in bits and pieces.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
On Thu, Dec 11, 2014 at 4:55 AM, Ashic Mahtab as...@live.com wrote:
That makes sense. I'll try that.
Thanks :)
Aditya, I think you have the mental model of spark streaming a little
off the mark. Unlike traditional streaming systems, where any kind of
state is mutable, SparkStreaming is designed on Sparks immutable RDDs.
Streaming data is received and divided into immutable blocks, then
form immutable RDDs,
First of all, how long do you want to keep doing this? The data is
going to increase infinitely and without any bounds, its going to get
too big for any cluster to handle. If all that is within bounds, then
try the following.
- Maintain a global variable having the current RDD storing all the
log
501 - 600 of 861 matches
Mail list logo