(streamTuple._1, SaveMode.Append)
}
}
On Tue, Jul 28, 2015 at 3:42 PM, Tathagata Das t...@databricks.com
wrote:
I dont think any one has really run 500 text streams.
And parSequences do nothing out there, you are only parallelizing the
setup code which does not really compute anything. Also
I dont think any one has really run 500 text streams.
And parSequences do nothing out there, you are only parallelizing the setup
code which does not really compute anything. Also it setsup 500 foreachRDD
operations that will get executed in each batch sequentially, so does not
make sense. The
If you are trying to keep such long term state, it will be more robust in
the long term to use a dedicated data store (cassandra/HBase/etc.) that is
designed for long term storage.
On Tue, Jul 28, 2015 at 4:37 PM, swetha swethakasire...@gmail.com wrote:
Hi TD,
We have a requirement to
Rather than using accumulator directly, what you can do is something like
this to lazily create an accumulator and use it (will get lazily recreated
if driver restarts from checkpoint)
dstream.transform { rdd =
val accum = SingletonObject.getOrCreateAccumulator() // single object
method to
.
sc.saveAs) and then run a modified version of the job that skips Stage
0, assuming you have a full understanding of the breakdown of stages in
your job.
On Tue, Jul 28, 2015 at 9:28 PM, Tathagata Das t...@databricks.com
wrote:
Okay, may I am confused on the word would be useful to *restart
There is a known issue that Kafka cannot return leader if there is not data
in the topic. I think it was raised in another thread in this forum. Is
that the issue?
On Wed, Jul 29, 2015 at 10:38 AM, unk1102 umesh.ka...@gmail.com wrote:
Hi I have Spark Streaming code which streams from Kafka
Yes, and that is indeed the problem. It is trying to process all the data
in Kafka, and therefore taking 60 seconds. You need to set the rate limits
for that.
On Thu, Jul 30, 2015 at 8:51 AM, Cody Koeninger c...@koeninger.org wrote:
If you don't set it, there is no maximum rate, it will get
For the first time it needs to list them. AFter that the list should be
cached by the file stream implementation (as far as I remember).
On Thu, Jul 30, 2015 at 3:55 PM, Brandon White bwwintheho...@gmail.com
wrote:
Is this a known bottle neck for Spark Streaming textFileStream? Does it
need
You have to read the original Spark paper to understand how RDD lineage
works.
https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
On Thu, Jul 30, 2015 at 9:25 PM, Ted Yu yuzhih...@gmail.com wrote:
Please take a look at:
core/src/test/scala/org/apache/spark/CheckpointSuite.scala
I do suggest that the non-spark related discussions be taken to a different
this forum as it does not directly contribute to the contents of this user
list.
On Thu, Jul 30, 2015 at 8:52 PM, UMESH CHAUDHARY umesh9...@gmail.com
wrote:
Thanks for the valuable suggestion.
I also started with
To clarify, that is the number of executors requested by the SparkContext
from the cluster manager.
On Tue, Jul 28, 2015 at 5:18 PM, amkcom amk...@gmail.com wrote:
try sc.getConf.getInt(spark.executor.instances, 1)
--
View this message in context:
.
--
bit1...@163.com
*From:* bit1...@163.com
*Date:* 2015-07-31 13:11
*To:* Tathagata Das tathagata.das1...@gmail.com; yuzhihong
yuzhih...@gmail.com
*CC:* user user@spark.apache.org
*Subject:* Re: Re: How RDD lineage works
Thanks TD and Zhihong for the guide
:
Tathagata,
Could the bottleneck possibility be the number of executor nodes in our
cluster? Since we are creating 500 Dstreams based off 500 textfile
directories, do we need at least 500 executors / nodes to be receivers for
each one of the streams?
On Tue, Jul 28, 2015 at 6:09 PM, Tathagata Das t
You have to partition that data on the Spark Streaming by the primary key,
and then make sure insert data into Cassandra atomically per key, or per
set of keys in the partition. You can use the combination of the (batch
time, and partition Id) of the RDD inside foreachRDD as the unique id for
the
If you are using the same RDDs in the both the attempts to run the job, the
previous stage outputs generated in the previous job will indeed be reused.
This applies to core though. For dataframes, depending on what you do, the
physical plan may get generated again leading to new RDDs which may
Can you tell us more about streaming app? DStream operation that you are
using?
On Sun, Aug 2, 2015 at 9:14 PM, Anand Nalya anand.na...@gmail.com wrote:
Hi,
I'm writing a Streaming application in Spark 1.3. After running for some
time, I'm getting following execption. I'm sure, that no other
StreamingContext.stop(stopGracefully = true) stops the streaming context
gracefully.
Then you can safely terminate the Spark cluster. They are two different
steps and needs to be done separately ensuring that the driver process has
been completely terminated before the Spark cluster is the
previous batch interval's
offsets..
On Thu, Jul 30, 2015 at 2:49 AM, Tathagata Das t...@databricks.com
wrote:
Rather than using accumulator directly, what you can do is something like
this to lazily create an accumulator and use it (will get lazily recreated
if driver restarts from
in logic , in my case sleep(t.s.) is not
working .) So i used to kill coarseGrained job at each slave by script
.Please suggest something .
On Thu, Jul 30, 2015 at 5:14 AM, Tathagata Das t...@databricks.com
wrote:
StreamingContext.stop(stopGracefully = true) stops the streaming context
If you are running on a cluster, the listening is occurring on one of the
executors, not in the driver.
On Mon, Aug 10, 2015 at 10:29 AM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I am trying to run this program as a yarn-client. The job seems to be
submitting successfully however I don't
:
I am running as a yarn-client which probably means that the program that
submitted the job is where the listening is also occurring? I thought that
the yarn is only used to negotiate resources in yarn-client master mode.
On Mon, Aug 10, 2015 at 11:34 AM, Tathagata Das t...@databricks.com
wrote
I think this may be expected. When the streaming context is stopped without
the SparkContext, then the receivers are stopped by the driver. The
receiver sends back the message that it has been stopped. This is being
(probably incorrectly) logged with ERROR level.
On Sun, Aug 9, 2015 at 12:52 AM,
on or not.
On Mon, Aug 10, 2015 at 12:43 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:
I've verified all the executors and I don't see a process listening on the
port. However, the application seem to show as running in the yarn UI
On Mon, Aug 10, 2015 at 11:56 AM, Tathagata Das t
-with-dependencies.jar localhost
/user/ec2-user/checkpoint/ /user/ec2-user/out
Even though I am running as local I see it being scheduled and managed by
yarn.
On Mon, Aug 10, 2015 at 12:56 PM, Tathagata Das t...@databricks.com
wrote:
Is it receiving any data? If so, then it must
In general, it is a little risky to put long running stuff in a shutdown
hook as it may delay shutdown of the process which may delay other things.
That said, you could try it out.
A better way to explicitly shutdown gracefully is to use an RPC to signal
the driver process to start shutting down,
July 2015 at 08:19, anshu shukla anshushuk...@gmail.com wrote:
Yes I was doing same , if You mean that this is the correct way to do
Then I will verify it once more in my case .
On Thu, Jul 30, 2015 at 1:02 PM, Tathagata Das t...@databricks.com
wrote:
How is sleep not working? Are you
at end of each batch and if set then
gracefully stop the application ?
On Tue, Aug 11, 2015 at 12:43 AM, Tathagata Das t...@databricks.com
wrote:
In general, it is a little risky to put long running stuff in a shutdown
hook as it may delay shutdown of the process which may delay other things
You are correct. The earlier Kinesis receiver (as of Spark 1.4) was not
saving checkpoints correctly and was in general not reliable (even with WAL
enabled). We have improved this in Spark 1.5 with updated Kinesis receiver,
that keeps track of the Kinesis sequence numbers as part of the Spark
A hacky workaround is to create a customer InputDStream that creates the
right RDDs based on a function. The TestInputDStream
https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala#L61
does something similar for Spark Streaming unit
way to kill the app.after stopping context?
On Tue, Aug 11, 2015 at 2:55 AM, Shushant Arora
shushantaror...@gmail.com wrote:
Thanks!
On Tue, Aug 11, 2015 at 1:34 AM, Tathagata Das t...@databricks.com
wrote:
1. RPC can be done in many ways, and a web service is one of many ways.
A even
Yes.
On Wed, Aug 12, 2015 at 12:12 PM, Mohit Anchlia mohitanch...@gmail.com
wrote:
Thanks! To write to hdfs I do need to use saveAs method?
On Wed, Aug 12, 2015 at 12:01 PM, Tathagata Das t...@databricks.com
wrote:
This is how Spark does. It writes the task output to a uniquely-named
You could very easily strip out the BlockGenerator code from the Spark
source code and use it directly in the same way the Reliable Kafka Receiver
uses it. BTW, you should know that we will be deprecating the receiver
based approach for the Direct Kafka approach. That is quite flexible, can
give
underhood? If
compete batch is not loaded will using custim size of 100-200 request in
one batch and post will help instead of whole partition ?
On Wed, Jul 22, 2015 at 12:18 AM, Tathagata Das t...@databricks.com
wrote:
If you can post multiple items at a time, then use foreachPartition
For Java, do
OffsetRange[] offsetRanges = ((HasOffsetRanges)rdd*.rdd()*).offsetRanges();
If you fix that error, you should be seeing data.
You can call arbitrary RDD operations on a DStream, using
DStream.transform. Take a look at the docs.
For the direct kafka approach you are doing,
- tasks
With DirectKafkaStream there are two approaches.
1. you increase the number of KAfka partitions Spark will automatically
read in parallel
2. if that's not possible, then explicitly repartition only if there are
more cores in the cluster than the number of Kafka partitions, AND the
first map-like
From what I understand about your code, it is getting data from different
partitions of a topic - get all data from partition 1, then from partition
2, etc. Though you have configured it to read from just one partition
(topicCount has count = 1). So I am not sure what your intention is, read
all
You could contact the authors of the spark-packages.. maybe that will help?
On Mon, Jul 20, 2015 at 6:41 AM, Jeetendra Gangele gangele...@gmail.com
wrote:
Thanks Todd,
I m not sure whether somebody has used it or not. can somebody confirm if
this integrate nicely with Spark streaming?
On
If you can post multiple items at a time, then use foreachPartition to post
the whole partition in a single request.
On Tue, Jul 21, 2015 at 9:35 AM, Richard Marscher rmarsc...@localytics.com
wrote:
You can certainly create threads in a map transformation. We do this to do
concurrent DB
Most shuffle files are really kept around in the OS's buffer/disk cache, so
it is still pretty much in memory. If you are concerned about performance,
you have to do a holistic comparison for end-to-end performance. You could
take a look at this.
Currently that is the best way.
On Thu, Jul 23, 2015 at 12:51 AM, JoneZhang joyoungzh...@gmail.com wrote:
My spark streaming on kafka application is running in spark 1.3.
I want upgrade spark to 1.4 now.
How to deal with the spark streaming application?
Save the kafka topic partition
Yep :)
On Tue, Jul 14, 2015 at 2:44 PM, algermissen1971 algermissen1...@icloud.com
wrote:
On 14 Jul 2015, at 23:26, Tathagata Das t...@databricks.com wrote:
Just to be clear, you mean the Spark Standalone cluster manager's
master and not the applications driver, right.
Sorry, by now I
Just to be clear, you mean the Spark Standalone cluster manager's master
and not the applications driver, right.
In that case, the earlier responses are correct.
TD
On Tue, Jul 14, 2015 at 11:26 AM, Mohammed Guller moham...@glassbeam.com
wrote:
The master node does not have to be similar to
MAke sure you provide the filterFunction with the invertible
reduceByKeyAndWindow. Otherwise none of the keys will get removed, and the
key space will continue increase. This is what is leading to the lag. So
use the filtering function to filter out the keys that are not needed any
more.
On Thu,
[Apologies for repost, for those who have seen this response already in the
dev mailing list]
1. When you set ssc.checkpoint(checkpointDir), the spark streaming
periodically saves the state RDD (which is a snapshot of all the state
data) to HDFS using RDD checkpointing. In fact, a streaming app
, Chen Song chen.song...@gmail.com wrote:
Thanks TD.
As for 1), if timing is not guaranteed, how does exactly once semantics
supported? It feels like exactly once receiving is not necessarily exactly
once processing.
Chen
On Tue, Jul 14, 2015 at 10:16 PM, Tathagata Das t...@databricks.com
This is a known race condition - root cause of SPARK-5681
https://issues.apache.org/jira/browse/SPARK-5681
On Mon, Jul 13, 2015 at 3:35 AM, Juan Rodríguez Hortalá
juan.rodriguez.hort...@gmail.com wrote:
Hi,
I have noticed that when StreamingContext.stop is called when no receiver
has
PM, Tathagata Das t...@databricks.com
wrote:
In the receiver based approach, If the receiver crashes for any
reason (receiver crashed or executor crashed) the receiver should get
restarted on another executor and should start reading data from the
offset
present in the zookeeper
Why is .remember not ideal?
On Sun, Jul 12, 2015 at 7:22 PM, Brandon White bwwintheho...@gmail.com
wrote:
Hi Yin,
Yes there were no new rows. I fixed it by doing a .remember on the
context. Obviously, this is not ideal.
On Sun, Jul 12, 2015 at 6:31 PM, Yin Huai yh...@databricks.com wrote:
I do not recommend using IndexRDD for state management in Spark Streaming.
What it does not solve out-of-the-box is checkpointing of indexRDDs, which
important because long running streaming jobs can lead to infinite chain of
RDDs. Spark Streaming solves it for the updateStateByKey operation which
High Level API.
Regards,
Dibyendu
On Sat, Jun 27, 2015 at 3:04 PM, Tathagata Das
t...@databricks.com wrote:
In the receiver based approach, If the receiver crashes for any
reason (receiver crashed or executor crashed) the receiver should
get
restarted on another executor and should
This has been discussed in a number of threads in this mailing list. Here
is a summary.
1. Processing of batch T+1 always starts after all the processing of batch
T has completed. But here a batch is defined by data of all the receivers
running the in the system receiving within the batch
You can do this.
// global variable to keep track of latest stuff
var latestTime = _
var latestRDD = _
dstream.foreachRDD((rdd: RDD[..], time: Time) = {
latestTime = time
latestRDD = rdd
})
Now you can asynchronously access the latest RDD. However if you are going
to run jobs on the
Your streaming job may have been seemingly running ok, but the DStream
checkpointing must have been failing in the background. It would have been
visible in the log4j logs. In 1.4.0, we enabled fast-failure for that so
that checkpointing failures dont get hidden in the background.
The fact that
/SPARK-7180 and squashes the following commits:
On Wed, Jul 15, 2015 at 2:37 PM, Chen Song chen.song...@gmail.com wrote:
Thanks
Can you point me to the patch to fix the serialization stack? Maybe I can
pull it in and rerun my job.
Chen
On Wed, Jul 15, 2015 at 4:40 PM, Tathagata Das t
to provide sufficient
information as part of the value so that you can take that decision in
the filter function.
As always, thanks for your help
Nikunj
On Thu, Jul 16, 2015 at 1:16 AM, Tathagata Das t...@databricks.com
wrote:
MAke sure you provide the filterFunction with the invertible
Yes. More information in my talk -
https://www.youtube.com/watch?v=d5UJonrruHk
On Fri, Jul 17, 2015 at 1:15 AM, JoneZhang joyoungzh...@gmail.com wrote:
I see now.
There are three steps in SparkStreaming + Kafka date processing
1.Receiving the data
2.Transforming the data
3.Pushing out the
Spark 1.4.0 added shutdown hooks in the driver to cleanly shutdown the
Sparkcontext in the driver, which would shutdown the executors. I am not
sure whether this is related or not, but somehow the executor's shutdown
hook is being called.
Can you check the driver logs to see if driver's shutdown
It was added, but its not documented publicly. I am planning to change the
name of the conf to spark.streaming.fileStream.minRememberDuration to make
it easier to understand
On Mon, Jul 13, 2015 at 9:43 PM, Terry Hole hujie.ea...@gmail.com wrote:
A new configuration named
The scala version of the Kafka is something that we have been working on
for a while, and is likely to be more optimized than the python one. The
python one definitely requires pass the data back and forth between JVM and
Python VM and decoding the raw bytes to the Python strings (probably less
What do you mean by checking when a "DStream is empty"? DStream represents
an endless stream of data, and at point of time checking whether it is
empty or not does not make sense.
FYI, there is RDD.isEmpty()
On Wed, Oct 21, 2015 at 10:03 AM, diplomatic Guru
wrote:
>
If you just want to control the number of reducers, then setting the
numPartitions is sufficient. If you want to control how exact partitioning
scheme (that is some other scheme other than hash-based) then you need to
implement a custom partitioner. It can be used to improve data skews, etc.
which
Your observation is correct! The current implementation of checkpointing to
DynamoDB is tied to the presence of new data from Kinesis (I think that
emulates the KCL behavior), if there is no data for while, the
checkpointing does not occur. That explains your observation.
I have filed a JIRA to
(Long, String)])] =
> { val grpdRecs = rdd.groupByKey(); val srtdRecs =
> grpdRecs.mapValues[(List[(Long, String)])](iter =>
> iter.toList.sortBy(_._1)) srtdRecs }
>
> rdd.reduceByKey((a, b) => {
> (Math.max(a._1, b._1), (a._2 ++ b._2))
> })
>
>
>
> On Tue, Oct 2
, String)])](iter =>
> iter.toList.sortBy(_._1)) srtdRecs }
>
>
> On Tue, Oct 27, 2015 at 6:47 PM, Tathagata Das <t...@databricks.com>
> wrote:
>
>> if you specify the same partitioner (custom or otherwise) for both
>> partitionBy and groupBy, then may be it wi
Yeah, of course. Just create an RDD from jdbc, call cache()/persist(), then
force it to be evaluated using something like count(). Once it is cached,
you can use it in a StreamingContext. Because of the cache it should not
access JDBC any more.
On Tue, Oct 27, 2015 at 12:04 PM, diplomatic Guru
However, if your executor dies. Then it may reconnect to JDBC to
reconstruct the RDD partitions that were lost. To prevent that you can
checkpoint the RDD to a HDFS-like filesystem (using rdd.checkpoint()). Then
you are safe, it wont reconnect to JDBC.
On Tue, Oct 27, 2015 at 11:17 PM, Tathagata
Yes, the workaround is the same that has been suggested in the JIRA for
accumulator and broadcast variables. Basically make a singleton object
which lazily initializes the HBaseContext. Because of singleton, it wont
get serialized through checkpoint. After recovering, it will be
reinitialized
UpdateStateByKey automatically caches its RDDs.
On Tue, Oct 27, 2015 at 8:05 AM, diplomatic Guru
wrote:
>
> Hello All,
>
> When I checked my running Stream job on WebUI, I can see that some RDDs
> are being listed that were not requested to be cached. What more is that
to
> run it from eclipse.
>
> Is there any problem running the application from eclipse ?
>
>
>
> On 9 November 2015 at 12:27, Tathagata Das <t...@databricks.com> wrote:
>
>> How are you submitting the spark application?
>> You are supposed to submit the f
The reason the existing dynamic allocation does not work out of the box for
spark streaming is because the heuristics used for decided when to scale
up/down is not the right one for micro-batch workloads. It works great for
typical batch workloads. However you can use the underlying developer API
How are you submitting the spark application?
You are supposed to submit the fat-jar of the application that include the
spark-streaming-twitter dependency (and its subdeps) but not
spark-streaming and spark-core.
On Mon, Nov 9, 2015 at 1:02 AM, أنس الليثي wrote:
> I
You have to run some action after rdd.checkpointi() for the checkpointing
to actually occur. Have you done that?
On Fri, Oct 30, 2015 at 3:10 PM, Panos Str wrote:
> Hi all!
>
> Here's a part of a Scala recursion that produces a stack overflow after
> many
> recursions. I've
Spark streaming automatically takes care of unpersisting any RDDs generated
by DStream. You can set the StreamingContext.remember() to set the minimum
persistence duration. Any persisted RDD older than that will be
automatically unpersisted
On Thu, Nov 5, 2015 at 9:12 AM, swetha kasireddy
The Kinesis integration underneath uses the KCL libraries which takes a
minute or so sometimes to spin up the threads and start getting data from
Kinesis. That is under normal conditions. In your case, it could be
happening that because of your killing and restarting, the restarted KCL
may be
y around
>> lossless processing of kinesis records in Spark-1.5 + kinesis-asl-1.5 when
>> jobs are aborted. Any pointers or quick explanation would be very helpful.
>>
>>
>> On Tue, Oct 13, 2015 at 4:04 PM, Tathagata Das <t...@databricks.com>
>> wrote:
>>
Aaah, interesting, you are doing 15 minute slide duration. Yeah, internally
the streaming scheduler waits for the last "batch" interval which has data
to be processed, but if there is a sliding interval (i.e. 15 mins) that is
higher than batch interval, then that might not be run. This is indeed a
Its hard to help without any stacktrace associated
with UnsupportedOperationException.
On Thu, Oct 15, 2015 at 10:40 PM, Chandra Mohan, Ananda Vel Murugan <
ananda.muru...@honeywell.com> wrote:
> One of my co-worker(Yogesh) was trying to get this posted in spark mailing
> and it seems it did not
o-generated method stub
>
> if(state.toString()==null)
>
>return Optional.of(events);
>
> else {
>
> //UnsupportedOperationException here
>
>return Optional.of(events.add(state.toString()););
>
>
RDD and DF are not compatible data types. So you cannot return a DF when
you have to return an RDD. What rather you can do is return the underlying
RDD of the dataframe by dataframe.rdd().
On Fri, Oct 16, 2015 at 12:07 PM, Jason White
wrote:
> Hi Ken, thanks for
all the executors or just the same executor?
>
> On Fri, Oct 16, 2015 at 5:49 PM, Tathagata Das <t...@databricks.com>
> wrote:
>
>> Setting a ttl is not recommended any more as Spark works with Java GC to
>> clean up stuff (RDDs, shuffles, broadcasts,etc.) that are not
If cassandra is down, does saveToCassandra throw an exception? If it does,
you can catch that exception and write your own logic to retry and/or no
update. Once the foreachRDD function completes, that batch will be
internally marked as completed.
TD
On Mon, Oct 19, 2015 at 5:48 AM, varun sharma
Unfortunately the title on the JIRA is extremely confusing. I have fixed it.
The reason why dynamic allocation does not work well with streaming is that
the heuristic that is used to automatically scale up or down the number of
executors works for the pattern of task schedules in batch jobs, not
Each DStream creates one RDD per batch.
On Mon, Oct 19, 2015 at 4:39 AM, vaibhavrtk wrote:
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-one-batch-created-by-Streaming-Context-always-equal-to-one-RDD-tp25117.html
>
ersion from RDD -> DF involves a `.take(10)` in PySpark, even if
> you provide the schema, so I was avoiding back-and-forth conversions. I’ll
> see if I can create a ‘trusted’ conversion that doesn’t involve the `take`.
>
> --
> Jason
>
> On October 19, 2015 at 5:23:59
You can set the max cores for the first submitted job such that it does not
take all the resources from the master. See
http://spark.apache.org/docs/latest/submitting-applications.html
# Run on a Spark standalone cluster in client deploy mode
./bin/spark-submit \
--class
again.
>>
>> Thanks
>> Varun
>>
>> On Tue, Oct 20, 2015 at 2:50 AM, Tathagata Das <t...@databricks.com>
>> wrote:
>>
>>> If cassandra is down, does saveToCassandra throw an exception? If it
>>> does, you can catch that exception and
eep on retrying?
> Also If some batch fail, I want to block further batches to be processed
> as it would create inconsistency in updation of zookeeper offsets and maybe
> kill the job itself after lets say 3 retries.
>
> Any pointers to achieve same are appreciated.
>
> On Wed, Oct 2
Setting a ttl is not recommended any more as Spark works with Java GC to
clean up stuff (RDDs, shuffles, broadcasts,etc.) that are not in reference
any more.
So you can keep an RDD cached in Spark, and every minute uncache the
previous one, and cache a new one.
TD
On Fri, Oct 16, 2015 at 12:02
Well, reduceByKey needs to shutffle if your intermediate data is not
already partitioned in the same way as reduceByKey's partitioning.
reduceByKey() has other signatures that take in a partitioner, or simply
number of partitions. So you can set the same partitioner as your previous
stage.
Since it is about encryption and decryption, its also good know how the raw
data is actually saved in disk. If the write ahead log is enabled, then the
raw data will be saved to the WAL in HDFS. You probably do not want to save
decrypted data in that. So its better not to decrupt in the receiver,
Are you sure that there are not log4j errors in the driver logs? What if
you try enabling debug level? And what does the streaming UI say?
On Mon, Oct 12, 2015 at 12:50 PM, Uthayan Suthakar <
uthayan.sutha...@gmail.com> wrote:
> Any suggestions? Is there anyway that I could debug this issue?
>
Make sure the broadcast variable works independent of the streaming
application. Then make sure it work without have
StreamingContext.getOrCreate(). That will disambiguate whether that error
is thrown when starting a new context, or when recovering a context from
checkpoint (as getOrCreate is
he executor (since the number of messages
> in the rdd is known already)
>
> do a foreachPartition and println or count the iterator manually.
>
> On Tue, Oct 6, 2015 at 6:02 PM, Tathagata Das <t...@databricks.com> wrote:
>
>> Are sure that this is not related to Cassandra in
Given the interest, I am also inclining towards making it a public
developer API. Maybe even experimental. Cody, mind submitting a patch?
On Tue, Oct 6, 2015 at 7:45 AM, Sean Owen wrote:
> For what it's worth, I also use this class in an app, but it happens
> to be from
Are sure that this is not related to Cassandra inserts? Could you just do
foreachRDD { _.count } instead to keep Cassandra out of the picture and
then test this agian.
On Tue, Oct 6, 2015 at 12:33 PM, Adrian Tanase wrote:
> Also check if the Kafka cluster is still balanced.
How many executors and cores do you acquire?
td
On Thu, Oct 8, 2015 at 6:11 PM, Bharath Mukkati
wrote:
> Hi Spark Users,
>
> I am testing my application on Spark 1.5 and kinesis-asl-1.5. The
> streaming application starts but I see a ton of stages scheduled for
>
http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/
Recently it has been merged into HBase
https://issues.apache.org/jira/browse/HBASE-13992
There are other options to use. See spark-packages.org.
On Fri, Oct 9, 2015 at 4:33 PM, Xiao Li wrote:
>
gt;
> On Fri, Oct 9, 2015 at 2:18 PM, Tathagata Das <t...@databricks.com> wrote:
>
>> Can you provide the before stop and after restart log4j logs for this?
>>
>> On Fri, Oct 9, 2015 at 2:13 PM, Spark Newbie <sparknewbie1...@gmail.com>
>> wrote:
>&
Can you provide the before stop and after restart log4j logs for this?
On Fri, Oct 9, 2015 at 2:13 PM, Spark Newbie
wrote:
> Hi Spark Users,
>
> I'm seeing checkpoint restore failures causing the application startup to
> fail with the below exception. When I do "ls"
Is this happening too often? Is it slowing things down or blocking
progress. Failures once in a while is part of the norm, and the system
should take care of itself.
On Tue, Oct 13, 2015 at 2:47 PM, Spark Newbie
wrote:
> Hi Spark users,
>
> I'm seeing the below
501 - 600 of 861 matches
Mail list logo