.com
*From:* Haopu Wang hw...@qilinsoft.com
*Date:* 2015-06-19 18:47
*To:* Enno Shioji eshi...@gmail.com; Tathagata Das
t...@databricks.com
*CC:* prajod.vettiyat...@wipro.com; Cody Koeninger c...@koeninger.org;
bit1...@163.com; Jordan Pilat jrpi...@gmail.com; Will Briggs
wrbri...@gmail.com
, this is with
CDH 5.4.1 and Spark 1.3.
On Thu, Jun 18, 2015 at 7:05 PM, Tim Smith secs...@gmail.com wrote:
Thanks for the super-fast response, TD :)
I will now go bug my hadoop vendor to upgrade from 1.3 to 1.4. Cloudera,
are you listening? :D
On Thu, Jun 18, 2015 at 7:02 PM, Tathagata Das
deploy mode
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
On Sat, Jun 20, 2015 at 3:18 AM, Tathagata Das t...@databricks.com
wrote
())) {
rdd.collect().forEach(s - out.println(s));
}
return null;
});
context.start();
context.awaitTermination();
}
On 17 June 2015 at 17:25, Tathagata Das t...@databricks.com wrote:
The default behavior should be that batch X + 1 starts processing only
Depends on what cluster manager are you using. Its all pretty well
documented in the online documentation.
http://spark.apache.org/docs/latest/submitting-applications.html
On Fri, Jun 19, 2015 at 2:29 PM, anshu shukla anshushuk...@gmail.com
wrote:
Hey ,
*[For Client Mode]*
1- Is there any
I dont think there was any enhancments that can change this behavior.
On Fri, Jun 19, 2015 at 6:16 PM, Tim Smith secs...@gmail.com wrote:
On Fri, Jun 19, 2015 at 5:15 PM, Tathagata Das t...@databricks.com
wrote:
Also, can you find from the spark UI the break up of the stages in each
batch's
vs days.
On Fri, Jun 19, 2015 at 1:40 PM, Tathagata Das t...@databricks.com
wrote:
Yes, please tell us what operation are you using.
TD
On Fri, Jun 19, 2015 at 11:42 AM, Cody Koeninger c...@koeninger.org
wrote:
Is there any more info you can provide / relevant code?
On Fri, Jun 19
If the current documentation is confusing, we can definitely improve the
documentation. However, I dont not understand why is the term
transactional confusing. If your output operation has to add 5, then the
user has to implement the following mechanism
1. If the unique id of the batch of data is
Also, could you give a screenshot of the streaming UI. Even better, could
you run it on Spark 1.4 which has a new streaming UI and then use that for
debugging/screenshot?
TD
On Thu, Jun 18, 2015 at 3:05 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Which version of spark? and what is your
is aprox 1500
tuples/sec).
On Fri, Jun 19, 2015 at 2:50 AM, Tathagata Das t...@databricks.com
wrote:
Couple of ways.
1. Easy but approx way: Find scheduling delay and processing time using
StreamingListener interface, and then calculate end-to-end delay = 0.5 *
batch interval + scheduling
Are you using Spark 1.3.x ? That explains. This issue has been fixed in
Spark 1.4.0. Bonus you get a fancy new streaming UI with more awesome
stats. :)
On Thu, Jun 18, 2015 at 7:01 PM, Tim Smith secs...@gmail.com wrote:
Hi,
I just switched from createStream to the createDirectStream API for
the off-heap allocation of netty?
Best Regards,
Shixiong Zhu
2015-06-03 11:58 GMT+08:00 Ji ZHANG zhangj...@gmail.com:
Hi,
Thanks for you information. I'll give spark1.4 a try when it's released.
On Wed, Jun 3, 2015 at 11:31 AM, Tathagata Das t...@databricks.com
wrote:
Could you try it out
I think you may be including a different version of Spark Streaming in your
assembly. Please mark spark-core nd spark-streaming as provided
dependencies. Any installation of Spark will automatically provide Spark in
the classpath so you do not have to bundle it.
On Thu, Jun 18, 2015 at 8:44 AM,
Its not clear what you are asking. Find what among RDD?
On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla anshushuk...@gmail.com
wrote:
Is there any fixed way to find among RDD in stream processing systems ,
in the Distributed set-up .
--
Thanks Regards,
Anshu Shukla
initial
D-STREAM to final/last D-STREAM .
Help Please !!
On Fri, Jun 19, 2015 at 12:40 AM, Tathagata Das t...@databricks.com
wrote:
Its not clear what you are asking. Find what among RDD?
On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla anshushuk...@gmail.com
wrote:
Is there any fixed way
The default behavior should be that batch X + 1 starts processing only
after batch X completes. If you are using Spark 1.4.0, could you show us a
screenshot of the streaming tab, especially the list of batches? And could
you also tell us if you are setting any SparkConf configurations?
On Wed,
To add more information beyond what Matei said and answer the original
question, here are other things to consider when comparing between Spark
Streaming and Storm.
* Unified programming model and semantics - Most occasions you have to
process the same data again in batch jobs. If you have two
stdin. Also I'm working with Scala.
It would be great if you could talk about multi stdin case as well!
Thanks.
From: Tathagata Das t...@databricks.com
Date: Thursday, June 11, 2015 at 8:11 PM
To: Heath Guo heath...@fb.com
Cc: user user@spark.apache.org
Subject: Re: Spark Streaming reads from
BTW, in Spark 1.4 announced today, I added SQLContext.getOrCreate. So you
dont need to create the singleton yourself.
On Wed, Jun 10, 2015 at 3:21 AM, Sergio Jiménez Barrio
drarse.a...@gmail.com wrote:
Note: CCing user@spark.apache.org
First, you must check if the RDD is empty:
Let me try to add some clarity in the different thought directions that's
going on in this thread.
1. HOW TO DETECT THE NEED FOR MORE CLUSTER RESOURCES?
If there are not rate limits set up, the most reliable way to detect
whether the current Spark cluster is being insufficient to handle the data
Are you going to receive data from one stdin from one machine, or many
stdins on many machines?
On Thu, Jun 11, 2015 at 7:25 PM, foobar heath...@fb.com wrote:
Hi, I'm new to Spark Streaming, and I want to create a application where
Spark Streaming could create DStream from stdin. Basically I
Do you have the event logging enabled?
TD
On Thu, Jun 11, 2015 at 11:24 AM, scar0909 scar0...@gmail.com wrote:
I have the same problem. i realized that the master spark becomes
unresponsive when we kill the leader zookeeper (of course i assigned the
leader election task to the zookeeper).
Another approach not mentioned is to use a function to get the RDD that is
to be joined. Something like this.
Not sure, but you can try something like this also:
kvDstream.foreachRDD(rdd = {
val rdd = getOrUpdateRDD(params...)
rdd.join(kvFile)
})
The
You could take at RDD *async operations, their source code. May be that can
help if getting some early results.
TD
On Fri, Jun 5, 2015 at 8:41 AM, Pietro Gentile
pietro.gentile89.develo...@gmail.com wrote:
Hi all,
what is the best way to perform Spark SQL queries and obtain the result
have missed it.
@TD Is this project still active?
I'm not sure what the status is but it may provide some insights on how to
achieve what your looking to do.
On Fri, Jun 5, 2015 at 6:34 PM, Tathagata Das t...@databricks.com wrote:
You could take at RDD *async operations, their source code
But compile scope is supposed to be added to the assembly.
https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope
On Thu, Jun 4, 2015 at 1:24 PM, algermissen1971 algermissen1...@icloud.com
wrote:
Hi Iulian,
On 26 May 2015, at 13:04, Iulian
the official website.
I could use spark1.0 + yarn, but I can't find a way to handle the logs,
level and rolling, so it'll explode the harddrive.
Currently I'll stick to spark1.0 + standalone, until our ops team decides
to upgrade cdh.
On Tue, Jun 2, 2015 at 2:58 PM, Tathagata Das t...@databricks.com
Interesting, only in local[*]! In the github you pointed to, what is the
main that you were running.
TD
On Mon, May 25, 2015 at 9:23 AM, rsearle eggsea...@verizon.net wrote:
Further experimentation indicates these problems only occur when master is
local[*].
There are no issues if a
In the receiver-less direct approach, there is no concept of consumer
group as we dont use the Kafka High Level consumer (that uses ZK). Instead
Spark Streaming manages offsets on its own, giving tighter guarantees. If
you want to monitor the progress of the processing of offsets, you will
have to
in the jar I've specified with the —jars argument on the command
line are available in the REPL.
Cheers
Alex
On Thu, May 28, 2015 at 8:38 AM, Tathagata Das t...@databricks.com
wrote:
Would be great if you guys can test out the Spark 1.4.0 RC2 (RC3 coming
out soon) with Scala 2.11 and report
Would be great if you guys can test out the Spark 1.4.0 RC2 (RC3 coming out
soon) with Scala 2.11 and report issues.
TD
On Tue, May 26, 2015 at 9:15 AM, Koert Kuipers ko...@tresata.com wrote:
we are still running into issues with spark-shell not working on 2.11, but
we are running on somewhat
You can throttle the no receiver direct Kafka stream using
spark.streaming.kafka.maxRatePerPartition
http://spark.apache.org/docs/latest/configuration.html#spark-streaming
On Wed, May 27, 2015 at 4:34 PM, Ted Yu yuzhih...@gmail.com wrote:
Have you seen
,
Hemant
On Thu, May 21, 2015 at 2:21 AM, Tathagata Das t...@databricks.com
wrote:
Correcting the ones that are incorrect or incomplete. BUT this is good
list for things to remember about Spark Streaming.
On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat hemant9...@gmail.com
wrote:
Hi,
I have
Can you show us the rest of the program? When are you starting, or stopping
the context. Is the exception occuring right after start or stop? What
about log4j logs, what does it say?
On Fri, May 22, 2015 at 7:12 AM, Cody Koeninger c...@koeninger.org wrote:
I just verified that the following
to push the data into
database. Meanwhile, Spark is unable to receive data probably because the
process is blocked.
On Thu, May 21, 2015 at 5:28 PM, Tathagata Das t...@databricks.com
wrote:
Can you elaborate on how the data loss is occurring?
On Thu, May 21, 2015 at 1:10 AM, Gautam Bajaj
Thanks for the JIRA. I will look into this issue.
TD
On Thu, May 21, 2015 at 1:31 AM, Aniket Bhatnagar
aniket.bhatna...@gmail.com wrote:
I ran into one of the issues that are potentially caused because of this
and have logged a JIRA bug -
https://issues.apache.org/jira/browse/SPARK-7788
If you cannot push data as fast as you are generating it, then async isnt
going to help either. The work is just going to keep piling up as many
many async jobs even though your batch processing times will be low as that
processing time is not going to reflect how much of overall work is pending
Looks like somehow the file size reported by the FSInputDStream of
Tachyon's FileSystem interface, is returning zero.
On Mon, May 11, 2015 at 4:38 AM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Just to follow up this thread further .
I was doing some fault tolerant testing
Doesnt seem like a Cassandra specific issue. Could you give us more
information (code, errors, stack traces)?
On Thu, May 21, 2015 at 1:33 PM, tshah77 tejasrs...@gmail.com wrote:
TD,
Do you have any example about reading from cassandra using spark streaming
in java?
I am trying to connect
Correcting the ones that are incorrect or incomplete. BUT this is good list
for things to remember about Spark Streaming.
On Wed, May 20, 2015 at 3:40 AM, Hemant Bhanawat hemant9...@gmail.com
wrote:
Hi,
I have compiled a list (from online sources) of knobs/design
considerations that need to
If you are talking about handling driver crash failures, then all bets are
off anyways! Adding a shutdown hook in the hope of handling driver process
failure, handles only a some cases (Ctrl-C), but does not handle cases like
SIGKILL (does not run JVM shutdown hooks) or driver machine crash. So
Has this been fixed for you now? There has been a number of patches since
then and it may have been fixed.
On Thu, May 14, 2015 at 7:20 AM, Wangfei (X) wangf...@huawei.com wrote:
Yes it is repeatedly on my locally Jenkins.
发自我的 iPhone
在 2015年5月14日,18:30,Tathagata Das t...@databricks.com 写道
If you wanted to stop it gracefully, then why are you not calling
ssc.stop(stopGracefully = true, stopSparkContext = true)? Then it doesnt
matter whether the shutdown hook was called or not.
TD
On Mon, May 18, 2015 at 9:43 PM, Dibyendu Bhattacharya
dibyendu.bhattach...@gmail.com wrote:
Hi,
If you dont want the fileStream to start only after certain event has
happened, why not start the streamingContext after that event?
TD
On Sun, May 17, 2015 at 7:51 PM, Haopu Wang hw...@qilinsoft.com wrote:
I want to use file stream as input. And I look at SparkStreaming
document again, it's
Marscher [mailto:rmarsc...@localytics.com]
*Sent:* Friday, May 15, 2015 7:20 PM
*To:* Evo Eftimov
*Cc:* Tathagata Das; user
*Subject:* Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond
The doc is a bit confusing IMO, but at least for my application I had to
use a fair pool
What is the error you are seeing?
TD
On Thu, May 14, 2015 at 9:00 AM, Erich Ess er...@simplerelevance.com
wrote:
Hi,
Is it possible to setup streams from multiple Kinesis streams and process
them in a single job? From what I have read, this should be possible,
however, the Kinesis layer
?
it should work the same way - including union() of streams from totally
different source types (kafka, kinesis, flume).
On Thu, May 14, 2015 at 2:07 PM, Tathagata Das t...@databricks.com
wrote:
What is the error you are seeing?
TD
On Thu, May 14, 2015 at 9:00 AM, Erich Ess er
It would be good if you can tell what I should add to the documentation to
make it easier to understand. I can update the docs for 1.4.0 release.
On Tue, May 12, 2015 at 9:52 AM, Lee McFadden splee...@gmail.com wrote:
Thanks for explaining Sean and Cody, this makes sense now. I'd like to
help
Do you get this failure repeatedly?
On Thu, May 14, 2015 at 12:55 AM, kf wangf...@huawei.com wrote:
Hi, all, i got following error when i run unit test of spark by
dev/run-tests
on the latest branch-1.4 branch.
the latest commit id:
commit d518c0369fa412567855980c3f0f426cde5c190d
On Wednesday, May 13, 2015 12:02 PM, Tathagata Das t...@databricks.com
wrote:
That is a good question. I dont see a direct way to do that.
You could do try the following
val jobGroupId = group-id-based-on-current-time
rdd.sparkContext.setJobGroup(jobGroupId)
val approxCount
.
Thanks,
Du
On Wednesday, May 13, 2015 1:12 PM, Tathagata Das t...@databricks.com
wrote:
That is not supposed to happen :/ That is probably a bug.
If you have the log4j logs, would be good to file a JIRA. This may be
worth debugging.
On Wed, May 13, 2015 at 12:13 PM, Du Li l...@yahoo
@Vadim What happened when you tried unioning using DStream.union in python?
TD
On Tue, May 12, 2015 at 9:53 AM, Evo Eftimov evo.efti...@isecc.com wrote:
I can confirm it does work in Java
*From:* Vadim Bichutskiy [mailto:vadim.bichuts...@gmail.com]
*Sent:* Tuesday, May 12, 2015 5:53 PM
On Tue, May 12, 2015 at 12:57 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
@Vadim What happened when you tried unioning using DStream.union in
python?
TD
On Tue, May 12, 2015 at 9:53 AM, Evo Eftimov evo.efti...@isecc.com
wrote:
I can confirm it does work in Java
*From:* Vadim
From the code it seems that as soon as the rdd.countApprox(5000)
returns, you can call pResult.initialValue() to get the approximate count
at that point of time (that is after timeout). Calling
pResult.getFinalValue() will further block until the job is over, and
give the final correct values
Incorrect. The receiver runs in an executor just like a any other tasks. In
the cluster mode, the driver runs in a worker, however it launches
executors in OTHER workers in the cluster. Its those executors running in
other workers that run tasks, and also the receivers.
On Wed, May 6, 2015 at
This may help.
http://www.slideshare.net/helenaedelson/lambda-architecture-with-spark-spark-streaming-kafka-cassandra-akka-and-scala
On Wed, May 6, 2015 at 5:35 PM, Sergio Jiménez Barrio drarse.a...@gmail.com
wrote:
I have a Counter family colums in Cassandra. I want update this counters
Is the function ingestToMysql running on the driver or on the executors?
Accordingly you can try debugging while running in a distributed manner,
with and without calling the function.
If you dont get too many open files without calling ingestToMysql(), the
problem is likely to be in
Have you taken a look at the join section in the streaming programming
guide?
http://spark.apache.org/docs/latest/streaming-programming-guide.html#stream-dataset-joins
On Wed, Apr 29, 2015 at 7:11 AM, Rendy Bambang Junior
rendy.b.jun...@gmail.com wrote:
Let say I have transaction data and
I believe that the implicit def is pulling in the enclosing class (in which
the def is defined) in the closure which is not serializable.
On Wed, Apr 29, 2015 at 4:20 AM, guoqing0...@yahoo.com.hk
guoqing0...@yahoo.com.hk wrote:
Hi guys,
I`m puzzled why i cant use the implicit function in
Also cc;ing Cody.
@Cody maybe there is a reason for doing connection pooling even if there is
not performance difference.
TD
On Wed, Apr 29, 2015 at 4:06 PM, Tathagata Das t...@databricks.com wrote:
Is the function ingestToMysql running on the driver or on the executors?
Accordingly you can
It could be related to this.
https://issues.apache.org/jira/browse/SPARK-6737
This was fixed in Spark 1.3.1.
On Wed, Apr 29, 2015 at 8:38 AM, Sean Owen so...@cloudera.com wrote:
Not sure what you mean. It's already in CDH since 5.4 = 1.3.0
(This isn't the place to ask about CDH)
I also
in the enclosing class ?
*From:* Tathagata Das t...@databricks.com
*Date:* 2015-04-30 07:00
*To:* guoqing0...@yahoo.com.hk
*CC:* user user@spark.apache.org
*Subject:* Re: implicit function in SparkStreaming
I believe that the implicit def is pulling in the enclosing class (in
which the def
What was the state of your streaming application? Was it falling behind
with a large increasing scheduling delay?
TD
On Thu, Apr 23, 2015 at 11:31 AM, N B nb.nos...@gmail.com wrote:
Thanks for the response, Conor. I tried with those settings and for a
while it seemed like it was cleaning up
Did you checkout the latest streaming programming guide?
http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
You also need to be aware of that to convert json RDDs to dataframe,
sqlContext has to make a pass on the data to learn the schema. This will
Is the mylist present on every executor? If not, then you have to pass it
on. And broadcasts are the best way to pass them on. But note that once
broadcasted it will immutable at the executors, and if you update the list
at the driver, you will have to broadcast it again.
TD
On Wed, Apr 22, 2015
Furthermore, just to explain, doing arr.par.foreach does not help because
it not really running anything, it only doing setup of the computation.
Doing the setup in parallel does not mean that the jobs will be done
concurrently.
Also, from your code it seems like your pairs of dstreams dont
Absolutely. The same code would work for local as well as distributed mode!
On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy
vadim.bichuts...@gmail.com wrote:
Can I use broadcast vars in local mode?
ᐧ
On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das t...@databricks.com
wrote:
Yep
it locally...and I plan to move it to production
on EC2.
The way I fixed it is by doing myrdd.map(lambda x: (x,
mylist)).map(myfunc) but I don't think it's efficient?
mylist is filled only once at the start and never changes.
Vadim
ᐧ
On Wed, Apr 22, 2015 at 1:42 PM, Tathagata Das t
It could very well be that your executor memory is not enough to store the
state RDDs AND operate on the data. 1G per executor is quite low.
Definitely give more memory. And have you tried increasing the number of
partitions (specify number of partitions in updateStateByKey) ?
On Wed, Apr 22,
:
Great. Will try to modify the code. Always room to optimize!
ᐧ
On Wed, Apr 22, 2015 at 2:11 PM, Tathagata Das t...@databricks.com
wrote:
Absolutely. The same code would work for local as well as distributed
mode!
On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy
vadim.bichuts
is unexpectedly null
when DStream is being serialized must mean something. Under which
circumstances, such an exception would trigger?
On Tue, Apr 21, 2015 at 4:12 PM, Tathagata Das t...@databricks.com
wrote:
Yeah, I am not sure what is going on. The only way to figure to take a
look
is being serialized must mean something. Under which
circumstances, such an exception would trigger?
On Tue, Apr 21, 2015 at 4:12 PM, Tathagata Das t...@databricks.com
wrote:
Yeah, I am not sure what is going on. The only way to figure to take a
look at the disassembled bytecodes using javap
that.
The puzzling thing is why removing the context bounds solve the
problem... What does this exception mean in general?
On Mon, Apr 20, 2015 at 1:33 PM, Tathagata Das t...@databricks.com
wrote:
When are you getting this exception? After starting the context?
TD
On Mon, Apr 20, 2015 at 10:44
Responses inline.
On Mon, Apr 20, 2015 at 3:27 PM, Ankit Patel patel7...@hotmail.com wrote:
What you said is correct and I am expecting the printlns to be in my
console or my SparkUI. I do not see it in either places.
Can you actually login into the machine running the executor which runs the
Significant optimizations can be made by doing the joining/cogroup in a
smart way. If you have to join streaming RDDs with the same batch RDD, then
you can first partition the batch RDDs using a partitions and cache it, and
then use the same partitioner on the streaming RDDs. That would make sure
and RAM allocated for the result RDD
The optimizations mentioned still don’t change the total number of
elements included in the result RDD and RAM allocated – right?
*From:* Tathagata Das [mailto:t...@databricks.com]
*Sent:* Wednesday, April 15, 2015 9:25 PM
*To:* Evo Eftimov
*Cc:* user
:* Tathagata Das [mailto:t...@databricks.com]
*Sent:* Wednesday, April 15, 2015 9:48 PM
*To:* Evo Eftimov
*Cc:* user
*Subject:* Re: RAM management during cogroup and join
Agreed.
On Wed, Apr 15, 2015 at 1:29 PM, Evo Eftimov evo.efti...@isecc.com
wrote:
That has been done Sir
asynchronously--- window1
--- window2
While writing it, I start to believe they can, because windows are
time-triggered, not triggered when previous window has finished... But it's
better to ask:)
2015-04-15 2:08 GMT+02:00 Tathagata Das t
It may be worthwhile to do architect the computation in a different way.
dstream.foreachRDD { rdd =
rdd.foreach { record =
// do different things for each record based on filters
}
}
TD
On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I have a
Fundamentally, stream processing systems are designed for processing
streams of data, not for storing large volumes of data for a long period of
time. So if you have to maintain that much state for months, then its best
to use another system that is designed for long term storage (like
Cassandra)
What version of Spark are you using? There was a known bug which could be
causing this. It got fixed in Spark 1.3
TD
On Mon, Apr 13, 2015 at 11:44 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
When you say done fetching documents, does it mean that you are stopping
the streamingContext?
Have you tried marking only spark-streaming-kinesis-asl as not provided,
and the rest as provided? Then you will not even need to add
kinesis-asl.jar in the spark-submit.
TD
On Tue, Apr 14, 2015 at 2:27 PM, Mike Trienis mike.trie...@orcsol.com
wrote:
Richard,
You response was very helpful
Have you created a class called SQLContextSingleton ? If so, is it in the
compile class path?
On Fri, Apr 10, 2015 at 6:47 AM, Mukund Ranjan (muranjan)
muran...@cisco.com wrote:
Hi All,
Any idea why I am getting this error?
wordsTenSeconds.foreachRDD((rdd: RDD[String], time: Time)
Coalesce tries to reduce the number of partitions into smaller number of
partitions, without moving the data around (as much as possible). Since
most of received data is in a few machines (those running receivers),
coallesce just makes bigger merged partitions in those.
Without coalesce
Machine
Responses inline. Hope they help.
On Thu, Apr 9, 2015 at 8:20 AM, Amit Assudani aassud...@impetus.com wrote:
Hi Friends,
I am trying to solve a use case in spark streaming, I need help on
getting to right approach on lookup / update the master data.
Use case ( simplified )
I’ve a
, Tathagata Das t...@databricks.com
wrote:
If it is deterministically reproducible, could you generate full DEBUG
level logs, from the driver and the workers and give it to me? Basically I
want to trace through what is happening to the block that is not being
found.
And can you tell what Cluster
Well, you are running in local mode, so it cannot find another peer to
replicate the blocks received from receivers. That's it. Its not a real
concern and that error will go away when you are run it in a cluster.
On Thu, Apr 9, 2015 at 11:24 AM, Nandan Tammineedi nan...@defend7.com
wrote:
Hi,
There are a couple of options. Increase timeout (see Spark configuration).
Also see past mails in the mailing list.
Another option you may try (I have gut feeling that may work, but I am not
sure) is calling GC on the driver periodically. The cleaning up of stuff is
tied to GCing of RDD objects
as the driver?
Thanks
NB
On Wed, Apr 8, 2015 at 1:29 PM, Tathagata Das t...@databricks.com wrote:
Its does take effect on the executors, not on the driver. Which is okay
because executors have all the data and therefore have GC issues, not so
usually for the driver. If you want to double
Aah yes. The jsonRDD method needs to walk through the whole RDD to
understand the schema, and does not work if there is not data in it. Making
sure there is no data in it using take(1) should work.
TD
What is the computation you are doing in the foreachRDD, that is throwing
the exception?
One way to guard against is to do a take(1) to see if you get back any
data. If there is none, then don't do anything with the RDD.
TD
On Wed, Apr 8, 2015 at 1:08 PM, Vadim Bichutskiy
, InitialPositionInStream.LATEST,
StorageLevel.MEMORY_AND_DISK_2)}/* Union all the streams */val
unionStreams = ssc.union(kinesisStreams).map(byteArray = new
String(byteArray))unionStreams.print()ssc.start()
ssc.awaitTermination() }}*
On Fri, Apr 3, 2015 at 3:48 PM, Tathagata Das t
, Apr 6, 2015 at 9:55 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
Interesting, I see 0 cores in the UI?
- *Cores:* 0 Total, 0 Used
On Fri, Apr 3, 2015 at 2:55 PM, Tathagata Das t...@databricks.com wrote:
What does the Spark Standalone UI at port 8080 say about number of cores
So you want to sort based on the total count of the all the records
received through receiver? In that case, you have to combine all the counts
using updateStateByKey (
What he meant is that look it up in the Spark UI, specifically in the Stage
tab to see what is taking so long. And yes code snippet helps us debug.
TD
On Fri, Apr 3, 2015 at 12:47 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You need open the Stage\'s page which is taking time, and see how
Very good question! This is because the current code is written such that
the ui considers a batch as waiting only when it has actually started being
processed. Thats batched waiting in the job queue is not considered in the
calculation. It is arguable that it may be more intuitive to count that
, Tathagata Das t...@databricks.com
wrote:
Very good question! This is because the current code is written such that
the ui considers a batch as waiting only when it has actually started being
processed. Thats batched waiting in the job queue is not considered in the
calculation. It is arguable
: 3
processor : 4
processor : 5
processor : 6
processor : 7
On Fri, Apr 3, 2015 at 2:33 PM, Tathagata Das t...@databricks.com wrote:
How many cores are present in the works allocated to the standalone
cluster spark://ip-10-241-251-232:7077 ?
On Fri, Apr 3, 2015
I am afraid not. The whole point of Spark Streaming is to make it easy to
do complicated processing on streaming data while interoperating with core
Spark, MLlib, SQL without the operational overheads of maintain 4 different
systems. As a slight cost of achieving that unification, there maybe some
be added in at some point.
On Fri, Apr 3, 2015 at 12:47 PM, Tathagata Das t...@databricks.com
wrote:
I sort-a-hacky workaround is to use a queueStream where you can manually
create RDDs (using sparkContext.hadoopFile) and insert into the queue. Note
that this is for testing only as queueStream does
401 - 500 of 859 matches
Mail list logo