You said you downloaded a prebuilt version.
You shouldn't have to mess with maven or building spark at all. All
you need is a jvm, which it looks like you already have installed.
You should be able to follow the instructions at
http://spark.apache.org/docs/latest/
and
No, looks like you'd have to catch them in the serializer and have the
serializer return option or something. The new consumer builds a buffer
full of records, not one at a time.
On Mar 8, 2016 4:43 AM, "Marius Soutier" <mps@gmail.com> wrote:
>
> > On 04.03.2016, at
Wanted to survey what people are using the direct stream
messageHandler for, besides just extracting key / value / offset.
Would your use case still work if that argument was removed, and the
stream just contained ConsumerRecord objects
essing further stages and fetching
> next batch.
>
> I will start with higher number of executor cores and see how it goes.
>
> --
> Thanks
> Jatin Kumar | Rocket Scientist
> +91-7696741743 m
>
> On Tue, Mar 1, 2016 at 9:07 PM, Cody Koeninger <c...@koeninger.org> wrote:
&
What code is triggering the stack overflow?
On Mon, Feb 29, 2016 at 11:13 PM, Vinti Maheshwari
wrote:
> Hi All,
>
> I am getting below error in spark-streaming application, i am using kafka
> for input stream. When i was doing with socket, it was working fine. But
> when i
> "How do I keep a balance of executors which receive data from Kafka and
which process data"
I think you're misunderstanding how the direct stream works. The executor
which receives data is also the executor which processes data, there aren't
separate receivers. If it's a single stage worth of
Does this issue involve Spark at all? Otherwise you may have better luck
on a perl or kafka related list.
On Mon, Feb 29, 2016 at 3:26 PM, Vinti Maheshwari
wrote:
> Hi All,
>
> I wrote kafka producer using kafka perl api, But i am getting error when i
> am passing
You're getting confused about what code is running on the driver vs what
code is running on the executor. Read
http://spark.apache.org/docs/latest/programming-guide.html#understanding-closures-a-nameclosureslinka
On Mon, Feb 29, 2016 at 8:00 AM, franco barrientos <
Spark in general isn't a good fit if you're trying to make sure that
certain tasks only run on certain executors.
You can look at overriding getPreferredLocations and increasing the value
of spark.locality.wait, but even then, what do you do when an executor
fails?
On Fri, Feb 26, 2016 at 8:08
in Kafka?
>
> 发自WPS邮箱客戶端
> 在 Cody Koeninger <c...@koeninger.org>,2016年2月25日 上午11:58写道:
>
> The per partition offsets are part of the rdd as defined on the driver.
> Have you read
>
> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
>
>
The per partition offsets are part of the rdd as defined on the driver.
Have you read
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
and/or watched
https://www.youtube.com/watch?v=fXnNEq1v3VA
On Wed, Feb 24, 2016 at 9:05 PM, Yuhang Chen wrote:
he/javax.activation/activation/jars/activation-1.1.jar:javax/activation/ActivationDataFlavor.class*
>
> Here is complete error log:
> https://gist.github.com/Vibhuti/07c24d2893fa6e520d4c
>
>
> Regards,
> ~Vinti
>
> On Wed, Feb 24, 2016 at 12:16 PM, Cody Koeninger <c.
y. I need to check how to use sbt
> assembly.
>
> Regards,
> ~Vinti
>
> On Wed, Feb 24, 2016 at 11:10 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> Are you using sbt assembly? That's what will include all of the
>> non-provided dependencies in a s
encies ++= Seq(
> "org.apache.spark" % "spark-streaming_2.10" % "1.5.2",
> "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.5.2"
> )
>
>
>
> Regards,
> ~Vinti
>
> On Wed, Feb 24, 2016 at 9:33 AM, Cody K
spark streaming is provided, kafka is not.
This build file
https://github.com/koeninger/kafka-exactly-once/blob/master/build.sbt
includes some hacks for ivy issues that may no longer be strictly
necessary, but try that build and see if it works for you.
On Wed, Feb 24, 2016 at 11:14 AM, Vinti
That's correct, when you create a direct stream, you specify the
topicpartitions you want to be a part of the stream (the other method for
creating a direct stream is just a convenience wrapper).
On Wed, Feb 24, 2016 at 2:15 AM, 陈宇航 wrote:
> Here I use the
The direct stream will let you do both of those things. Is there a reason
you want to use receivers?
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
http://spark.apache.org/docs/latest/configuration.html#spark-streaming
look for maxRatePerPartition
On Mon, Feb 22, 2016
g :
> Can this issue be resolved by having a smaller block interval?
>
> Regards,
> Praveen
> On 18 Feb 2016 21:30, "praveen S" <mylogi...@gmail.com> wrote:
>
>> Can having a smaller block interval only resolve this?
>>
>> Regards,
>> Praveen
&
Backpressure won't help you with the first batch, you'd need
spark.streaming.kafka.maxRatePerPartition
for that
On Thu, Feb 18, 2016 at 9:40 AM, praveen S wrote:
> Have a look at
>
> spark.streaming.backpressure.enabled
> Property
>
> Regards,
> Praveen
> On 18 Feb 2016
You can print whatever you want wherever you want, it's just a question of
whether it's going to show up on the driver or the various executors logs
On Wed, Feb 17, 2016 at 5:50 AM, Cyril Scetbon
wrote:
> I don't think we can print an integer value in a spark streaming
Just use a kafka rdd in a batch job or two, then start your streaming job.
On Wed, Feb 17, 2016 at 12:57 AM, Abhishek Anand
wrote:
> I have a spark streaming application running in production. I am trying to
> find a solution for a particular use case when my
You could use sc.parallelize... but the offsets are already available at
the driver, and they're a (hopefully) small enough amount of data that's
it's probably more straightforward to just use the normal cassandra client
to save them from the driver.
On Tue, Feb 16, 2016 at 1:15 AM, Abhishek
uot;500");
> props.put("zookeeper.sync.time.ms", "250");
> props.put("auto.commit.interval.ms", "1000");
>
>
> How can I do the same for the receiver inside spark-streaming for Spark V1.3.1
>
>
> Thanks
>
> Nipun
>
>
>
> On Wed, F
Please don't change the behavior of DirectKafkaInputDStream.
Returning an empty rdd is (imho) the semantically correct thing to do, and
some existing jobs depend on that behavior.
If it's really an issue for you, you can either override
directkafkainputdstream, or just check isEmpty as the first
It's a pair because there's a key and value for each message.
If you just want a single topic, put a single topic in the map of topic ->
number of partitions.
See
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
On
sages processed per event in
>> sparkstreaming web UI . Also I am counting the messages inside
>> foreachRDD .
>> Removed the settings for backpressure but still the same .
>>
>>
>>
>>
>>
>> Sent from Samsung Mobile.
>>
If you're using the direct stream, you have 0 receivers. Do you mean you
have 1 executor?
Can you post the relevant call to createDirectStream from your code, as
well as any relevant spark configuration?
On Thu, Feb 4, 2016 at 8:13 PM, Diwakar Dhanuskodi <
diwakar.dhanusk...@gmail.com> wrote:
2 things:
- you're only attempting to read from a single TopicAndPartition. Since
your topic has multiple partitions, this probably isn't what you want
- you're getting an offset out of range exception because the offset you're
asking for doesn't exist in kafka.
Use the other
Partition=100" --driver-memory 2g
> --executor-memory 1g --class com.tcs.dime.spark.SparkReceiver --files
> /etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml,/etc/hadoop/conf/mapred-site.xml,/etc/hadoop/conf/yarn-site.xml,/etc/hive/conf/hive-site.xml
> --jars
am counting the messages inside
> foreachRDD .
> Removed the settings for backpressure but still the same .
>
>
>
>
>
> Sent from Samsung Mobile.
>
>
> ---- Original message
> From: Cody Koeninger <c...@koeninger.org>
> Date:06/02/2016 00
own issue ?
>
> On Wed, Jan 27, 2016 at 11:52 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> Have you tried spark 1.5?
>>
>> On Wed, Jan 27, 2016 at 11:14 AM, vimal dinakaran <vimal3...@gmail.com>
>> wrote:
>>
>>> Hi ,
>>>
KafkaRDD will use the standard kafka configuration parameter
refresh.leader.backoff.ms if it is set in the kafkaParams map passed to
createDirectStream.
On Tue, Feb 2, 2016 at 9:10 PM, Chen Song wrote:
> For Kafka direct stream, is there a way to set the time between
It's possible you could (ab)use updateStateByKey or mapWithState for this.
But honestly it's probably a lot more straightforward to just choose a
reasonable batch size that gets you a reasonable file size for most of your
keys, then use filecrush or something similar to deal with the hdfs small
That indicates a problem in network communication between the executor and
the kafka broker. Have you done any network troubleshooting?
On Mon, Feb 1, 2016 at 9:59 AM, SRK wrote:
> Hi,
>
> I see the following error in Spark Streaming with Kafka Direct. I think
>
The kafka direct stream doesn't do any explicit caching. I haven't looked
through the underlying simple consumer code in the kafka project in detail,
but I doubt it does either.
Honestly, I'd recommend not using auto created topics (it makes it too easy
to pollute your topics if someone
Have you tried spark 1.5?
On Wed, Jan 27, 2016 at 11:14 AM, vimal dinakaran
wrote:
> Hi ,
> I am using spark 1.4 with direct kafka api . In my streaming ui , I am
> able to see the events listed in UI only if add stream.print() statements
> or else event rate and input
Should be socket.timeout.ms on the map of kafka config parameters. The
lack of retry is probably due to the differences between running spark in
local mode vs standalone / mesos / yarn.
On Mon, Jan 25, 2016 at 1:19 PM, Supreeth wrote:
> We are running a Kafka Consumer
Where are you calling checkpointing? Metadata checkpointing for a kafa
direct stream should just be the offsets, not the data.
TD can better speak to reduceByKeyAndWindow behavior when restoring from a
checkpoint, but ultimately the only available choices would be replay the
prior window data
Yes, you should query Kafka if you want to know the latest available
offsets.
There's code to make this straightforward in KafkaCluster.scala, but the
interface isnt public. There's an outstanding pull request to expose the
api at
https://issues.apache.org/jira/browse/SPARK-10963
but frankly
Offsets are stored in the checkpoint. If you want to manage offsets
yourself, don't restart from the checkpoint, specify the starting offsets
when you create the stream.
Have you read / watched the materials linked from
https://github.com/koeninger/kafka-exactly-once
Regarding the small files
6 kafka partitions will result in 6 spark partitions, not 6 spark rdds.
The question of whether you will have a backlog isn't just a matter of
having 1 executor per partition. If a single executor can process all of
the partitions fast enough to complete a batch in under the required time,
you
If you can share an isolated example I'll take a look. Not something I've
run into before.
On Wed, Jan 20, 2016 at 3:53 PM, jpocalan wrote:
> Hi,
>
> I have an application which creates a Kafka Direct Stream from 1 topic
> having 5 partitions.
> As a result each batch is
Looks like this response did go to the list.
As far as OffsetOutOfRange goes, right now that's an unrecoverable error,
because it breaks the underlying invariants (e.g. that the number of
messages in a partition is deterministic once the RDD is defined)
If you want to do some hacking for your
Reading commands from kafka and triggering a redshift copy is sufficiently
simple it could just be a bash script. But if you've already got a spark
streaming job set up, may as well use it for consistency's sake. There's
definitely no need to mess around with akka.
On Fri, Jan 15, 2016 at 6:25
You can't really use spark batches as the basis for any kind of reliable
time aggregation. Time of batch processing in general has nothing to do
with time of event.
You need to filter / aggregate by the time interval you care about, in your
own code, or use a data store that can do the
; my use case) until the following code is executed
> stream.transform { rdd =>
> val wrapped = YourWrapper(cp, rdd)
> wrapped.join(reference)
> }
> In which case it will run through the partitioner of the wrapped RDD when
> it arrives in the cluster for the first time i.e. no shuf
If two rdds have an identical partitioner, joining should not involve a
shuffle.
You should be able to override the partitioner without calling partitionBy.
Two ways I can think of to do this:
- subclass or modify the direct stream and kafkardd. They're private, so
you'd need to rebuild just
llow Spark to know where the data for each
> partition resides in the cluster.
>
> Thanks,
> Dave.
>
>
> On 13/01/16 16:21, Cody Koeninger wrote:
>
> If two rdds have an identical partitioner, joining should not involve a
> shuffle.
>
> You should be able to override
fsets using createDirectStream method. Now
> here I want to get the exact offset that is being picked up by the
> createDirectStream method at the begining of the batch. I need this to
> create an initialRDD.
>
> Please let me know if anything is unclear.
>
> Thanks !!!
>
I'm not 100% sure what you're asking.
If you're asking if it's possible to start a stream at a particular set of
offsets, yes, one of the createDirectStream methods takes a map from
topicpartition to starting offset.
If you're asking if it's possible to query Kafka for the offset
corresponding
Have you read
http://kafka.apache.org/documentation.html#compaction
On Wed, Jan 6, 2016 at 8:52 AM, Julien Naour wrote:
> Context: Process data coming from Kafka and send back results to Kafka.
>
> Issue: Each events could take several seconds to process (Work in progress
kind of user id. I want
> to process last events by each user id once ie skip intermediate events by
> user id.
> I have only one Kafka topic with all theses events.
>
> Regards,
>
> Julien Naour
>
> Le mer. 6 janv. 2016 à 16:13, Cody Koeninger <c...@koeninger.org>
rds,
>
> Julien
>
> Le mer. 6 janv. 2016 à 17:35, Cody Koeninger <c...@koeninger.org> a
> écrit :
>
>> if you don't have hot users, you can use the user id as the hash key for
>> publishing into kafka.
>> That will put all events for a given user in the sa
Read the documentation
spark.apache.org/docs/latest/streaming-kafka-integration.html
If you still have questions, read the resources linked from
https://github.com/koeninger/kafka-exactly-once
On Wed, Dec 23, 2015 at 7:24 AM, Vyacheslav Yanuk
wrote:
> Colleagues
>
Honestly it's a lot easier to deal with this using transactions.
Someone else would have to speak to the possibility of getting task
failures added to listener callbacks.
On Sat, Dec 19, 2015 at 5:44 PM, Neelesh wrote:
> Hi,
> I'm trying to build automatic Kafka watermark
in a single stream, the
>> processing delay is as bad as the slowest task in the number of tasks
>> created. Even though the topics are unrelated to each other, RDD at time
>> "t1" has to wait for the RDD at "t0" is fully executed, even if most
>> cores
river to submit the job?
>
> Thanks!
>
> On Mon, Dec 21, 2015 at 8:05 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> Spark streaming by default wont start the next batch until the current
>> batch is completely done, even if only a few cores are still working.
Compared to what alternatives?
Honestly, if someone actually read the kafka docs, yet is still having
trouble getting a single test node up and running, the problem is probably
them. Kafka's docs are pretty good.
On Mon, Dec 21, 2015 at 11:31 AM, Andy Davidson <
a...@santacruzintegration.com>
If you're really doing a daily batch job, have you considered just using
KafkaUtils.createRDD rather than a streaming job?
On Fri, Dec 18, 2015 at 5:04 AM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:
> Hi All,
>
> Imagine I have a Production spark streaming kafka (direct connection)
., I rely on kafka offset
> for Incremental data am I right ? so no duplicate data will be returned.
>
>
> Thanks
> Sri
>
>
>
>
>
> On Fri, Dec 18, 2015 at 2:41 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> If you're really doing a daily bat
Using spark.streaming.concurrentJobs for this probably isn't a good idea,
as it allows the next batch to start processing before current one is
finished, which may have unintended consequences.
Why can't you use a single stream with all the topics you care about, or
multiple streams if you're
am not sure joining them would
> be really efficient, unless you know something that I don't.
>
> As I really don't need any interaction between those streams, I think I
> might end up running 3 different streaming apps instead of one.
>
> Thanks again!
>
> On
I'm a little confused as to why you have fake events rather than just doing
foreachRDD or foreachPartition on your kafka stream and updating the
accumulator there. I'd expect that to run each batch even if the batch had
0 kafka messages in it.
On Thu, Dec 10, 2015 at 2:05 PM, AliGouta
Kafka provides buffering, ordering, decoupling of producers from multiple
consumers. So pretty much any time you have requirements for asynchronous
process, fault tolerance, and/or a common view of the order of events
across multiple consumers kafka is worth a look.
Spark provides a much richer
ue. Any idea what the fix
> version is?
>
> On Wed, Dec 9, 2015 at 11:10 AM Cody Koeninger <c...@koeninger.org> wrote:
>
>> Which version of spark are you on? I thought that was added to the spark
>> UI in recent versions.
>>
>> DIrect api doesn't have an
Which version of spark are you on? I thought that was added to the spark
UI in recent versions.
DIrect api doesn't have any inherent interaction with zookeeper. If you
need number of messages per batch and aren't on a recent enough version of
spark to see them in the ui, you can get them
Personally, for jobs that I care about I store offsets in transactional
storage rather than checkpoints, which eliminates that problem (just
enforce whatever constraints you want when storing offsets).
Regarding the question of communication of errors back to the
streamingListener, there is an
Just to be clear, spark checkpoints have nothing to do with zookeeper,
they're stored in the filesystem you specify.
On Sun, Dec 6, 2015 at 1:25 AM, manasdebashiskar
wrote:
> When you enable check pointing your offsets get written in zookeeper. If
> you
> program dies or
spark/streaming/kafka/KafkaUtils.scala#L395-L423
>
> Thanks,
> - Alan
>
> On Tue, Dec 1, 2015 at 8:12 AM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> I actually haven't tried that, since I tend to do the offset lookups if
>> necessary.
>>
>> It's
ease consumer rate..or do both ..
>
> What I am trying to say, streaming job should not fail in any cases ..
>
> Dibyendu
>
> On Thu, Dec 3, 2015 at 9:40 AM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> I believe that what differentiates reliable system
There's a 1:1 relationship between Kafka partitions and Spark partitions.
Have you read
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
A direct stream job will use up to spark.executor.cores number of cores.
If you have fewer partitions than cores, there probably won't be
ably just do some if/else logic and use the basic
> createDirectStream and the more advanced
> createDirectStream(...fromOffsets...) if the offsets for my topic name
> exists in the database. Any reason that wouldn't work? If I don't include
> an offset range for a particular partit
e are OK
>> with at least once processing and use receiver based approach which uses
>> ZooKeeper but not query Kafka directly, would these errors(Couldn't find
>> leader offsets for
>> Set([test_stream,5])))be avoided?
>>
>> On Tue, Dec 1, 2015 at 3:40 PM, Cod
Use the direct stream. You can put multiple topics in a single stream, and
differentiate them on a per-partition basis using the offset range.
On Wed, Dec 2, 2015 at 2:13 PM, dutrow wrote:
> I found the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2388
>
> It
;
> Dibyendu
>
> On Thu, Dec 3, 2015 at 8:26 AM, Cody Koeninger <c...@koeninger.org> wrote:
>
>> No, silently restarting from the earliest offset in the case of offset
>> out of range exceptions during a streaming job is not the "correct way of
>> recovery&quo
ach...@gmail.com> wrote:
> This consumer which I mentioned does not silently throw away data. If
> offset out of range it start for earliest offset and that is correct way of
> recovery from this error.
>
> Dibyendu
> On Dec 2, 2015 9:56 PM, "Cody Koeninger" <c...@koeninger.or
requirements.
>
>
> 2.Catch that exception and somehow force things to "reset" for that
> partition And how would it handle the offsets already calculated in the
> backlog (if there is one)?
>
> On Tue, Dec 1, 2015 at 6:51 AM, Cody Koeninger <c...@koeninger.org
Yes, there is a version of createDirectStream that lets you specify
fromOffsets: Map[TopicAndPartition, Long]
On Mon, Nov 30, 2015 at 7:43 PM, Alan Braithwaite
wrote:
> Is there any mechanism in the kafka streaming source to specify the exact
> partition id that we want a
If you're consistently getting offset out of range exceptions, it's
probably because messages are getting deleted before you've processed them.
The only real way to deal with this is give kafka more retention, consume
faster, or both.
If you're just looking for a quick "fix" for an infrequent
of the offsets.
On Tue, Dec 1, 2015 at 9:58 AM, Alan Braithwaite <a...@cloudflare.com>
wrote:
> Neat, thanks. If I specify something like -1 as the offset, will it
> consume from the latest offset or do I have to instrument that manually?
>
> - Alan
>
> On Tue, Dec 1, 2015 at
If you had exactly 1 message in the 0th topicpartition, to read it you
would use
OffsetRange("topicname", 0, 0, 1)
Kafka's simple shell consumer in that case would print
next offset = 1
So instead trying to consume
OffsetRange("topicname", 0, 1, 2)
shouldn't be expected to work
On Sat,
Can you post the relevant code?
On Fri, Nov 27, 2015 at 4:25 AM, u...@moosheimer.com
wrote:
> Hi,
>
> we have some strange behavior with KafkaUtils DirectStream and the size of
> the MapPartitionsRDDs.
>
> We use a permanent direct steam where we consume about 8.500 json
>
Starting from the checkpoint using getOrCreate should be sufficient if all
you need is at-least-once semantics
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
On Mon, Nov 30, 2015 at 9:38 AM, Guillermo Ortiz
wrote:
> Hello,
>
> I have
>
> On Thu, Nov 12, 2015 at 9:07 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> To be blunt, if you care about being able to recover from weird
>> situations, you should be tracking offsets yourself and specifying offsets
>> on job start, not relying on
The direct stream shouldn't silently lose data in the case of a leader
loss. Loss of a leader is handled like any other failure, retrying
up to spark.task.maxFailures
times.
But really if you're losing leaders and taking that long to rebalance you
should figure out what's wrong with your
am partition 88 start
> 221563725. This should not happen, and indicates that messages may have
> been
> lost
>
> On Tue, Nov 24, 2015 at 6:31 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> No, the direct stream only communicates with Kafka brokers, not Zookeeper
>>
What exactly do you mean by kafka consumer reporting?
I'd log the offsets in your spark job and try running
kafka-simple-consumer-shell.sh --partition $yourbadpartition --print-offsets
at the same time your spark job is running
On Mon, Nov 23, 2015 at 7:37 PM, swetha
If you really want to just not process the bad topicpartitions, you can use
the version of createDirectStream that takes
fromOffsets: Map[TopicAndPartition, Long]
and exclude the broken topicpartitions from the map.
On Mon, Nov 23, 2015 at 4:54 PM, Hudong Wang wrote:
>
No, that means that at the time the batch was scheduled, the kafka leader
reported the ending offset was 221572238, but during processing, kafka
stopped returning messages before reaching that ending offset.
That probably means something got screwed up with Kafka - e.g. you lost a
leader and lost
rtitioned after mapToPair
> function. It would be great if you could brieftly explain ( or send me some
> document, i couldnt find it) about how shuffle work on mapToPair function.
>
> Thank you very much.
> On Nov 23, 2015 12:26 AM, "Cody Koeninger" <c...@koening
You're confused about which parts of your code are running on the driver vs
the executor, which is why you're getting serialization errors.
Read
http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
On Fri, Nov 20, 2015 at 1:07 PM, Saiph
Not sure what you mean by "no documentation regarding ways to achieve
effective communication between the 2", but the docs on integrating with
kafka are at
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
As far as custom partitioners go, Learning Spark from O'Reilly has a
Sure, just call count on each rdd and track it in your driver however you
want.
If count is called directly on a kafkardd (e.g. createDirectStream, then
foreachRDD before doing any other transformations), it should just be using
the beginning and ending offsets rather than doing any real work.
Are you using the direct stream? Each batch should contain all of the
unprocessed messages for each topic, unless you're doing some kind of rate
limiting.
On Tue, Nov 17, 2015 at 3:07 AM, Antony Mayi
wrote:
> Hi,
>
> I have two streams coming from two different
Ordering would be on a per-partition basis, not global ordering.
You typically want to acquire resources inside the foreachpartition
closure, just before handling the iterator.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
On Mon, Nov
int does not have rdd stored for beyond 2 hrs which is my window
> duration. Because of this my job takes more time than usual.
>
> Is there a way or some configuration parameter which would help avoid
> repartitioning twice ?
>
> I am attaching the snapshot for the same.
>
>
Unless you change maxRatePerPartition, a batch is going to contain all of
the offsets from the last known processed to the highest available.
Offsets are not time-based, and Kafka's time-based api currently has very
poor granularity (it's based on filesystem timestamp of the log segment).
There's
int folder to help the job
> recover? E.g. Go 2 steps back, hoping that kafka has those offsets.
>
> -adrian
>
> From: swetha kasireddy
> Date: Monday, November 9, 2015 at 10:40 PM
> To: Cody Koeninger
> Cc: "user@spark.apache.org"
> Subject: Re: Kafka Direct does
The direct stream will fail the task if there is a problem with the kafka
broker. Spark will retry failed tasks automatically, which should handle
broker rebalances that happen in a timely fashion. spark.tax.maxFailures
controls the maximum number of retries before failing the job. Direct
stream
t; for monitoring like every 5 minutes and then send an email alert and
> automatically restart the Streaming job by deleting the Checkpoint
> directory. Would that help?
>
>
>
> Thanks!
>
> On Mon, Nov 9, 2015 at 11:09 AM, Cody Koeninger <c...@koeninger.org>
> wrote:
301 - 400 of 652 matches
Mail list logo