Spark Streaming has StreamingContext.socketStream()
http://spark.apache.org/docs/1.2.1/api/java/org/apache/spark/streaming/StreamingContext.html#socketStream(java.lang.String,
int, scala.Function1, org.apache.spark.storage.StorageLevel,
scala.reflect.ClassTag)
TD
On Mon, Mar 9, 2015 at 11:37 AM,
data from RDBMs, for the details you can
refer to the docs.
Thanks
Jerry
*From:* Cui Lin [mailto:cui@hds.com]
*Sent:* Tuesday, March 10, 2015 8:36 AM
*To:* Tathagata Das
*Cc:* user@spark.apache.org
*Subject:* Re: Spark Streaming input data source list
Tathagata,
Thanks
-the-core-java-api-for-traversing-your-graph-data-base-instead-of-cypher-query-language/
On Thu, Mar 12, 2015 at 5:09 PM, Tathagata Das t...@databricks.com
wrote:
Well the answers you got there are correct as well.
Unfortunately I am not familiar with Neo4j enough to comment any more
I am not sure if you realized but the code snipper it pretty mangled up in
the email we received. It might be a good idea to put the code in pastebin
or gist, much much easier for everyone to read.
On Thu, Mar 12, 2015 at 12:48 AM, d34th4ck3r gautam1...@gmail.com wrote:
I'm trying to use Neo4j
at 12:58 AM, Gautam Bajaj gautam1...@gmail.com wrote:
Alright, I have also asked this question in StackOverflow:
http://stackoverflow.com/questions/28896898/using-neo4j-with-apache-spark
The code there is pretty neat.
On Thu, Mar 12, 2015 at 4:55 PM, Tathagata Das t...@databricks.com
wrote
Why do you have to write a single file?
On Wed, Mar 11, 2015 at 1:00 PM, SamyaMaiti samya.maiti2...@gmail.com
wrote:
Hi Experts,
I have a scenario, where in I want to write to a avro file from a streaming
job that reads data from kafka.
But the issue is, as there are multiple executors
Hint: Print() just gives a sample of what is in the data, and does not
enforce the processing on all the data (only the first partition of the rdd
is computed to get 10 items). Count() actually processes all the data. This
is all due to lazy eval, if you don't need to use all the data, don't
://www.cisco.com/web/siteassets/legal/privacy.html*
*From:* Tathagata Das [mailto:t...@databricks.com]
*Sent:* Tuesday, March 03, 2015 11:11 PM
*To:* Nastooh Avessta (navesta)
*Cc:* user@spark.apache.org
*Subject:* Re: Spark Streaming Switchover Time
I am confused. Are you killing the 1st worker
.
I really appreciate your help, but it looks like I’m back to the drawing
board on this one.
*From:* Tathagata Das [mailto:t...@databricks.com]
*Sent:* Thursday, March 12, 2015 7:53 PM
*To:* Jose Fernandez
*Cc:* user@spark.apache.org
*Subject:* Re: Handling worker batch processing during
Are you running the driver program (that is your application process) in
your desktop and trying to run it on the cluster in EC2?
It could very well be a hostname mismatch in some way due to the all the
public hostname, private hostname, private ip, firewall craziness of ec2.
You have to probably
Is the number of top K elements you want to keep small? That is, is K
small? In which case, you can
1. either do it in the driver on the array
DStream.foreachRDD ( rdd = {
val topK = rdd.top(K) ;
// use top K
})
2. Or, you can use the topK to create another RDD using sc.makeRDD
If you want to access the keys in an RDD that is partition by key, then you
can use RDD.mapPartition(), which gives you access to the whole partition
as an iteratorkey, value. You have the option of maintaing the
partitioning information or not by setting the preservePartitioning flag in
, Tathagata Das t...@databricks.com
wrote:
If you want to access the keys in an RDD that is partition by key, then
you can use RDD.mapPartition(), which gives you access to the whole
partition as an iteratorkey, value. You have the option of maintaing the
partitioning information or not by setting
If you are using tools like SBT/Maven/Gradle/etc, they figure out all the
recursive dependencies and includes them in the class path. I haven't
touched Eclipse in years so I am not sure off the top of my head what's
going on instead. Just in case you only downloaded the
spark-streaming_2.10.jar
You have to include Scala libraries in the Eclipse dependencies.
TD
On Tue, Mar 10, 2015 at 10:54 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I am trying out streaming example as documented and I am using spark 1.2.1
streaming from maven for Java.
When I add this code I get compilation
Do you have event logging enabled?
That could be the problem. The Master tries to aggressively recreate the
web ui of the completed job with the event logs (when it is enabled)
causing the Master to stall.
I created a JIRA for this.
https://issues.apache.org/jira/browse/SPARK-6270
On Tue, Mar 10,
artifactIdspark-streaming_2.10/artifactId
version1.2.0/version
/dependency
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version1.2.1/version
/dependency
/dependencies
On Tue, Mar 10, 2015 at 11:06 AM, Tathagata Das t...@databricks.com
wrote:
If you
. As for each partition, I'd need
to restart the server, that was the basic reason I was creating graphDb
object outside this loop.
On Fri, Mar 13, 2015 at 5:34 AM, Tathagata Das t...@databricks.com
wrote:
(Putting user@spark back in the to list)
In the gist, you are creating graphDB object way
using Spark
1.2 on CDH 5.3. I stop the application with yarn application -kill appID.
*From:* Tathagata Das [mailto:t...@databricks.com]
*Sent:* Thursday, March 12, 2015 1:29 PM
*To:* Jose Fernandez
*Cc:* user@spark.apache.org
*Subject:* Re: Handling worker batch processing during driver
the AVRO data to multiple files.
Thanks,
Sam
On Mar 12, 2015, at 4:09 AM, Tathagata Das t...@databricks.com wrote:
Why do you have to write a single file?
On Wed, Mar 11, 2015 at 1:00 PM, SamyaMaiti samya.maiti2...@gmail.com
wrote:
Hi Experts,
I have a scenario, where in I want to write
Can you access the batcher directly? Like is there is there a handle to get
access to the batchers on the executors by running a task on that executor?
If so, after the streamingContext has been stopped (not the SparkContext),
then you can use `sc.makeRDD()` to run a dummy task like this.
Why are you repartitioning 1? That would obviously be slow, you are
converting a distributed operation to a single node operation.
Also consider using RDD.top(). If you define the ordering right (based on
the count), then you will get top K across then without doing a shuffle for
sortByKey. Much
Can you show us the code that you are using?
This might help. This is the updated streaming programming guide for 1.3,
soon to be up, this is a quick preview.
http://people.apache.org/~tdas/spark-1.3.0-temp-docs/streaming-programming-guide.html#dataframe-and-sql-operations
TD
On Wed, Mar 11,
Are you sure the multiple invocations are not from previous runs of the
program?
TD
On Fri, Feb 27, 2015 at 12:16 PM, Nastooh Avessta (navesta)
nave...@cisco.com wrote:
Hi
Under Spark 1.0.0, standalone, client mode am trying to invoke a 3rd
party udp traffic generator, from the streaming
, ON, Canada, M5J
2T3. Phone: 416-306-7000; Fax: 416-306-7099. *Preferences
http://www.cisco.com/offer/subscribe/?sid=000478326 - Unsubscribe
http://www.cisco.com/offer/unsubscribe/?sid=000478327 – Privacy
http://www.cisco.com/web/siteassets/legal/privacy.html*
*From:* Tathagata Das [mailto:t
You can use DStream.transform() to do any arbitrary RDD transformations on
the RDDs generated by a DStream.
val coalescedDStream = myDStream.transform { _.coalesce(...) }
On Tue, Mar 3, 2015 at 1:47 PM, Saiph Kappa saiph.ka...@gmail.com wrote:
Sorry I made a mistake in my code. Please ignore
Can you elaborate on what is this switchover time?
TD
On Tue, Mar 3, 2015 at 9:57 PM, Nastooh Avessta (navesta) nave...@cisco.com
wrote:
Hi
On a standalone, Spark 1.0.0, with 1 master and 2 workers, operating in
client mode, running a udp streaming application, I am noting around 2
- Unsubscribe
http://www.cisco.com/offer/unsubscribe/?sid=000478327 – Privacy
http://www.cisco.com/web/siteassets/legal/privacy.html*
*From:* Tathagata Das [mailto:t...@databricks.com]
*Sent:* Tuesday, March 03, 2015 10:24 PM
*To:* Nastooh Avessta (navesta)
*Cc:* user@spark.apache.org
*Subject:* Re
That could be a corner case bug. How do you add the 3rd party library to
the class path of the driver? Through spark-submit? Could you give the
command you used?
TD
On Wed, Mar 4, 2015 at 12:42 AM, Emre Sevinc emre.sev...@gmail.com wrote:
I've also tried the following:
Configuration
The file stream does not use receiver. May be that was not clear in the
programming guide. I am updating it for 1.3 release right now, I will make
it more clear.
And file stream has full reliability. Read this in the programming guide.
There are different kinds of checkpointing going on. updateStateByKey
requires RDD checkpointing which can be enabled only by called
sparkContext.setCheckpointDirectory. But that does not enable Spark
Streaming driver checkpoints, which is necessary for recovering from driver
failures. That is
, I'm thinking on that line.
The problem is how to send to send the query to the backend? Bundle a http
server into a spark streaming job, that will accept the parameters?
--
Nikhil Bafna
On Mon, Feb 23, 2015 at 2:04 PM, Tathagata Das t...@databricks.com
wrote:
You will have a build a split
Spark Streaming already directly supports Kafka
http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources
Is there any reason why that is not sufficient?
TD
On Sun, Feb 22, 2015 at 5:18 PM, mykidong mykid...@gmail.com wrote:
In java, you can see this example:
You could do something like this.
def rddTrasnformationUsingBroadcast(rdd: RDD[...]): RDD[...] = {
val broadcastToUse = getBroadcast()// get the reference to a
broadcast variable, new or existing.
rdd.map { .. } // use broadcast variable
}
Could you find the executor logs on the executor where that task was
scheduled? That may provide more information on what caused the error.
Also take a look at where the block in question was stored, and where the
task was scheduled.
You will need to enabled log4j INFO level logs for this
You will have a build a split infrastructure - a front end that takes the
queries from the UI and sends them to the backend, and the backend (running
the Spark Streaming app) will actually run the queries on table created in
the contexts. The RPCs necessary between the frontend and backend will
The default persistence level is MEMORY_AND_DISK, so the LRU policy would
discard the blocks to disk, so the streaming app will not fail. However,
since things will get constantly read in and out of disk as windows are
processed, the performance wont be great. So it is best to have sufficient
Unless I am unaware some latest changes, the SparkUI shows stages, and
jobs, not accumulator results. And the UI not designed to be pluggable for
showing user-defined stuff.
TD
On Fri, Feb 20, 2015 at 12:25 AM, Tim Smith secs...@gmail.com wrote:
On Spark 1.2:
I am trying to capture # records
-24 12:58
*To:* Tathagata Das t...@databricks.com
*CC:* user user@spark.apache.org; bit1129 bit1...@163.com
*Subject:* Re: About FlumeUtils.createStream
I see, thanks for the clarification TD.
On 24 Feb 2015 09:56, Tathagata Das t...@databricks.com wrote:
Akhil, that is incorrect.
Spark
In case this mystery has not been solved, DStream.print() essentially does
a RDD.take(10) on each RDD, which computes only a subset of the partitions
in the RDD. But collects forces the evaluation of all the RDDs. Since you
are writing to json in the mapI() function, this could be the reason.
TD
Exactly, that is the reason.
To avoid that, in Spark 1.3 to-be-released, we have added a new Kafka API
(called direct stream) which does not use Zookeeper at all to keep track of
progress, and maintains offset within Spark Streaming. That can guarantee
all records being received exactly-once. Its
felixcheun...@hotmail.com wrote:
Kafka 0.8.2 has built-in offset management, how would that affect direct
stream in spark?
Please see KAFKA-1012
--- Original Message ---
From: Tathagata Das t...@databricks.com
Sent: February 23, 2015 9:53 PM
To: V Dineshkumar developer.dines...@gmail.com
Cc
26, 2015 at 1:36 AM, Tathagata Das t...@databricks.com
wrote:
Yes. # tuples processed in a batch = sum of all the tuples received by
all the receivers.
In screen shot, there was a batch with 69.9K records, and there was a
batch which took 1 s 473 ms. These two batches can be the same, can
Yes. # tuples processed in a batch = sum of all the tuples received by all
the receivers.
In screen shot, there was a batch with 69.9K records, and there was a batch
which took 1 s 473 ms. These two batches can be the same, can be different
batches.
TD
On Wed, Feb 25, 2015 at 10:11 AM, Josh J
Spark Streaming has a new Kafka direct stream, to be release as
experimental feature with 1.3. That uses a low level consumer. Not sure if
it satisfies your purpose.
If you want more control, its best to create your own Receiver with the low
level Kafka API.
TD
On Tue, Feb 24, 2015 at 12:09 AM,
Hey Mike,
I quickly looked through the example and I found major performance issue.
You are collecting the RDDs to the driver and then sending them to Mongo in
a foreach. Why not doing a distributed push to Mongo?
WHAT YOU HAVE
val mongoConnection = ...
WHAT YOU SHUOLD DO
rdd.foreachPartition
If it is deterministically reproducible, could you generate full DEBUG
level logs, from the driver and the workers and give it to me? Basically I
want to trace through what is happening to the block that is not being
found.
And can you tell what Cluster manager are you using? Spark Standalone,
Do you have the logs of the driver? Does that give any exceptions?
TD
On Fri, Mar 27, 2015 at 12:24 PM, Chen Song chen.song...@gmail.com wrote:
I ran a spark streaming job.
100 executors
30G heap per executor
4 cores per executor
The version I used is 1.3.0-cdh5.1.0.
The job is reading
-spark . From console I see that spark is trying to
replicate to nodes - nodes show up in Mesos active tasks ... but they
always fail with ClassNotFoundE.
2015-03-27 0:52 GMT+01:00 Tathagata Das t...@databricks.com:
Could you try running a simpler spark streaming program with receiver
(may
(StorageLevel.MEMORY_ONLY_2
etc).
2015-03-27 19:06 GMT+01:00 Tathagata Das t...@databricks.com:
Does it fail with just Spark jobs (using storage levels) on non-coarse
mode?
TD
On Fri, Mar 27, 2015 at 4:39 AM, Ondrej Smola ondrej.sm...@gmail.com
wrote:
More info
when using *spark.mesos.coarse
Yes, that is the correct understanding. There are undocumented parameters
that allow that, but I do not recommend using those :)
TD
On Wed, Mar 25, 2015 at 6:57 AM, Luis Ángel Vicente Sánchez
langel.gro...@gmail.com wrote:
I have a simple and probably dumb question about foreachRDD.
We are
Are you saying that even with the spark.cleaner.ttl set your files are not
getting cleaned up?
TD
On Thu, Apr 2, 2015 at 8:23 AM, andrem amesa...@gmail.com wrote:
Apparently Spark Streaming 1.3.0 is not cleaning up its internal files and
the worker nodes eventually run out of inodes.
We see
What he meant is that look it up in the Spark UI, specifically in the Stage
tab to see what is taking so long. And yes code snippet helps us debug.
TD
On Fri, Apr 3, 2015 at 12:47 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You need open the Stage\'s page which is taking time, and see how
Very good question! This is because the current code is written such that
the ui considers a batch as waiting only when it has actually started being
processed. Thats batched waiting in the job queue is not considered in the
calculation. It is arguable that it may be more intuitive to count that
, Tathagata Das t...@databricks.com
wrote:
Very good question! This is because the current code is written such that
the ui considers a batch as waiting only when it has actually started being
processed. Thats batched waiting in the job queue is not considered in the
calculation. It is arguable
Responses inline.
On Mon, Apr 20, 2015 at 3:27 PM, Ankit Patel patel7...@hotmail.com wrote:
What you said is correct and I am expecting the printlns to be in my
console or my SparkUI. I do not see it in either places.
Can you actually login into the machine running the executor which runs the
What was the state of your streaming application? Was it falling behind
with a large increasing scheduling delay?
TD
On Thu, Apr 23, 2015 at 11:31 AM, N B nb.nos...@gmail.com wrote:
Thanks for the response, Conor. I tried with those settings and for a
while it seemed like it was cleaning up
Did you checkout the latest streaming programming guide?
http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
You also need to be aware of that to convert json RDDs to dataframe,
sqlContext has to make a pass on the data to learn the schema. This will
is being serialized must mean something. Under which
circumstances, such an exception would trigger?
On Tue, Apr 21, 2015 at 4:12 PM, Tathagata Das t...@databricks.com
wrote:
Yeah, I am not sure what is going on. The only way to figure to take a
look at the disassembled bytecodes using javap
Is the mylist present on every executor? If not, then you have to pass it
on. And broadcasts are the best way to pass them on. But note that once
broadcasted it will immutable at the executors, and if you update the list
at the driver, you will have to broadcast it again.
TD
On Wed, Apr 22, 2015
Furthermore, just to explain, doing arr.par.foreach does not help because
it not really running anything, it only doing setup of the computation.
Doing the setup in parallel does not mean that the jobs will be done
concurrently.
Also, from your code it seems like your pairs of dstreams dont
that.
The puzzling thing is why removing the context bounds solve the
problem... What does this exception mean in general?
On Mon, Apr 20, 2015 at 1:33 PM, Tathagata Das t...@databricks.com
wrote:
When are you getting this exception? After starting the context?
TD
On Mon, Apr 20, 2015 at 10:44
Absolutely. The same code would work for local as well as distributed mode!
On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy
vadim.bichuts...@gmail.com wrote:
Can I use broadcast vars in local mode?
ᐧ
On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das t...@databricks.com
wrote:
Yep
it locally...and I plan to move it to production
on EC2.
The way I fixed it is by doing myrdd.map(lambda x: (x,
mylist)).map(myfunc) but I don't think it's efficient?
mylist is filled only once at the start and never changes.
Vadim
ᐧ
On Wed, Apr 22, 2015 at 1:42 PM, Tathagata Das t
It could very well be that your executor memory is not enough to store the
state RDDs AND operate on the data. 1G per executor is quite low.
Definitely give more memory. And have you tried increasing the number of
partitions (specify number of partitions in updateStateByKey) ?
On Wed, Apr 22,
:
Great. Will try to modify the code. Always room to optimize!
ᐧ
On Wed, Apr 22, 2015 at 2:11 PM, Tathagata Das t...@databricks.com
wrote:
Absolutely. The same code would work for local as well as distributed
mode!
On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy
vadim.bichuts
is unexpectedly null
when DStream is being serialized must mean something. Under which
circumstances, such an exception would trigger?
On Tue, Apr 21, 2015 at 4:12 PM, Tathagata Das t...@databricks.com
wrote:
Yeah, I am not sure what is going on. The only way to figure to take a
look
Is the function ingestToMysql running on the driver or on the executors?
Accordingly you can try debugging while running in a distributed manner,
with and without calling the function.
If you dont get too many open files without calling ingestToMysql(), the
problem is likely to be in
Have you taken a look at the join section in the streaming programming
guide?
http://spark.apache.org/docs/latest/streaming-programming-guide.html#stream-dataset-joins
On Wed, Apr 29, 2015 at 7:11 AM, Rendy Bambang Junior
rendy.b.jun...@gmail.com wrote:
Let say I have transaction data and
I believe that the implicit def is pulling in the enclosing class (in which
the def is defined) in the closure which is not serializable.
On Wed, Apr 29, 2015 at 4:20 AM, guoqing0...@yahoo.com.hk
guoqing0...@yahoo.com.hk wrote:
Hi guys,
I`m puzzled why i cant use the implicit function in
Also cc;ing Cody.
@Cody maybe there is a reason for doing connection pooling even if there is
not performance difference.
TD
On Wed, Apr 29, 2015 at 4:06 PM, Tathagata Das t...@databricks.com wrote:
Is the function ingestToMysql running on the driver or on the executors?
Accordingly you can
It could be related to this.
https://issues.apache.org/jira/browse/SPARK-6737
This was fixed in Spark 1.3.1.
On Wed, Apr 29, 2015 at 8:38 AM, Sean Owen so...@cloudera.com wrote:
Not sure what you mean. It's already in CDH since 5.4 = 1.3.0
(This isn't the place to ask about CDH)
I also
in the enclosing class ?
*From:* Tathagata Das t...@databricks.com
*Date:* 2015-04-30 07:00
*To:* guoqing0...@yahoo.com.hk
*CC:* user user@spark.apache.org
*Subject:* Re: implicit function in SparkStreaming
I believe that the implicit def is pulling in the enclosing class (in
which the def
Incorrect. The receiver runs in an executor just like a any other tasks. In
the cluster mode, the driver runs in a worker, however it launches
executors in OTHER workers in the cluster. Its those executors running in
other workers that run tasks, and also the receivers.
On Wed, May 6, 2015 at
This may help.
http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala
On Wed, May 6, 2015 at 5:35 PM, Sergio Jiménez Barrio drarse.a...@gmail.com
wrote:
I have a Counter family colums in Cassandra. I want update this counters
@Vadim What happened when you tried unioning using DStream.union in python?
TD
On Tue, May 12, 2015 at 9:53 AM, Evo Eftimov evo.efti...@isecc.com wrote:
I can confirm it does work in Java
*From:* Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com]
*Sent:* Tuesday, May 12, 2015 5:53 PM
On Tue, May 12, 2015 at 12:57 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
@Vadim What happened when you tried unioning using DStream.union in
python?
TD
On Tue, May 12, 2015 at 9:53 AM, Evo Eftimov evo.efti...@isecc.com
wrote:
I can confirm it does work in Java
*From:* Vadim
On Wednesday, May 13, 2015 12:02 PM, Tathagata Das t...@databricks.com
wrote:
That is a good question. I dont see a direct way to do that.
You could do try the following
val jobGroupId = group-id-based-on-current-time
rdd.sparkContext.setJobGroup(jobGroupId)
val approxCount
.
Thanks,
Du
On Wednesday, May 13, 2015 1:12 PM, Tathagata Das t...@databricks.com
wrote:
That is not supposed to happen :/ That is probably a bug.
If you have the log4j logs, would be good to file a JIRA. This may be
worth debugging.
On Wed, May 13, 2015 at 12:13 PM, Du Li l...@yahoo
What is the error you are seeing?
TD
On Thu, May 14, 2015 at 9:00 AM, Erich Ess er...@simplerelevance.com
wrote:
Hi,
Is it possible to setup streams from multiple Kinesis streams and process
them in a single job? From what I have read, this should be possible,
however, the Kinesis layer
?
it should work the same way - including union() of streams from totally
different source types (kafka, kinesis, flume).
On Thu, May 14, 2015 at 2:07 PM, Tathagata Das t...@databricks.com
wrote:
What is the error you are seeing?
TD
On Thu, May 14, 2015 at 9:00 AM, Erich Ess er
Marscher [mailto:rmarsc...@localytics.com]
*Sent:* Friday, May 15, 2015 7:20 PM
*To:* Evo Eftimov
*Cc:* Tathagata Das; user
*Subject:* Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond
The doc is a bit confusing IMO, but at least for my application I had to
use a fair pool
It would be good if you can tell what I should add to the documentation to
make it easier to understand. I can update the docs for 1.4.0 release.
On Tue, May 12, 2015 at 9:52 AM, Lee McFadden splee...@gmail.com wrote:
Thanks for explaining Sean and Cody, this makes sense now. I'd like to
help
Do you get this failure repeatedly?
On Thu, May 14, 2015 at 12:55 AM, kf wangf...@huawei.com wrote:
Hi, all, i got following error when i run unit test of spark by
dev/run-tests
on the latest branch-1.4 branch.
the latest commit id:
commit d518c0369fa412567855980c3f0f426cde5c190d
If you wanted to stop it gracefully, then why are you not calling
ssc.stop(stopGracefully = true, stopSparkContext = true)? Then it doesnt
matter whether the shutdown hook was called or not.
TD
On Mon, May 18, 2015 at 9:43 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Hi,
If you dont want the fileStream to start only after certain event has
happened, why not start the streamingContext after that event?
TD
On Sun, May 17, 2015 at 7:51 PM, Haopu Wang hw...@qilinsoft.com wrote:
I want to use file stream as input. And I look at SparkStreaming
document again, it's
From the code it seems that as soon as the rdd.countApprox(5000)
returns, you can call pResult.initialValue() to get the approximate count
at that point of time (that is after timeout). Calling
pResult.getFinalValue() will further block until the job is over, and
give the final correct values
Significant optimizations can be made by doing the joining/cogroup in a
smart way. If you have to join streaming RDDs with the same batch RDD, then
you can first partition the batch RDDs using a partitions and cache it, and
then use the same partitioner on the streaming RDDs. That would make sure
and RAM allocated for the result RDD
The optimizations mentioned still don’t change the total number of
elements included in the result RDD and RAM allocated – right?
*From:* Tathagata Das [mailto:t...@databricks.com]
*Sent:* Wednesday, April 15, 2015 9:25 PM
*To:* Evo Eftimov
*Cc:* user
:* Tathagata Das [mailto:t...@databricks.com]
*Sent:* Wednesday, April 15, 2015 9:48 PM
*To:* Evo Eftimov
*Cc:* user
*Subject:* Re: RAM management during cogroup and join
Agreed.
On Wed, Apr 15, 2015 at 1:29 PM, Evo Eftimov evo.efti...@isecc.com
wrote:
That has been done Sir
asynchronously--- window1
--- window2
While writing it, I start to believe they can, because windows are
time-triggered, not triggered when previous window has finished... But it's
better to ask:)
2015-04-15 2:08 GMT+02:00 Tathagata Das t
It may be worthwhile to do architect the computation in a different way.
dstream.foreachRDD { rdd =
rdd.foreach { record =
// do different things for each record based on filters
}
}
TD
On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I have a
Have you created a class called SQLContextSingleton ? If so, is it in the
compile class path?
On Fri, Apr 10, 2015 at 6:47 AM, Mukund Ranjan (muranjan)
muran...@cisco.com wrote:
Hi All,
Any idea why I am getting this error?
wordsTenSeconds.foreachRDD((rdd: RDD[String], time: Time)
Fundamentally, stream processing systems are designed for processing
streams of data, not for storing large volumes of data for a long period of
time. So if you have to maintain that much state for months, then its best
to use another system that is designed for long term storage (like
Cassandra)
What version of Spark are you using? There was a known bug which could be
causing this. It got fixed in Spark 1.3
TD
On Mon, Apr 13, 2015 at 11:44 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
When you say done fetching documents, does it mean that you are stopping
the streamingContext?
Have you tried marking only spark-streaming-kinesis-asl as not provided,
and the rest as provided? Then you will not even need to add
kinesis-asl.jar in the spark-submit.
TD
On Tue, Apr 14, 2015 at 2:27 PM, Mike Trienis mike.trie...@orcsol.com
wrote:
Richard,
You response was very helpful
, InitialPositionInStream.LATEST,
StorageLevel.MEMORY_AND_DISK_2)}/* Union all the streams */val
unionStreams = ssc.union(kinesisStreams).map(byteArray = new
String(byteArray))unionStreams.print()ssc.start()
ssc.awaitTermination() }}*
On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das t
, Apr 6, 2015 at 9:55 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
Interesting, I see 0 cores in the UI?
- *Cores:* 0 Total, 0 Used
On Fri, Apr 3, 2015 at 2:55 PM, Tathagata Das t...@databricks.com wrote:
What does the Spark Standalone UI at port 8080 say about number of cores
So you want to sort based on the total count of the all the records
received through receiver? In that case, you have to combine all the counts
using updateStateByKey (
: 3
processor : 4
processor : 5
processor : 6
processor : 7
On Fri, Apr 3, 2015 at 2:33 PM, Tathagata Das t...@databricks.com wrote:
How many cores are present in the works allocated to the standalone
cluster spark://ip-10-241-251-232:7077 ?
On Fri, Apr 3, 2015
301 - 400 of 861 matches
Mail list logo