Hi,
I am trying to read data from confluent Kafka using avro schema registry.
Messages are always empty and stream always shows empty records. Any suggestion
on this please ??
Thanks,
Asmath
-
To unsubscribe e-mail:
Hi,
this is my Word Count demo. https://github.com/kevincmchen/wordcount
MohitAbbi 于2020年11月4日周三 上午3:32写道:
> Hi,
>
> Can you please share the correct versions of JAR files which you used to
> resolve the issue. I'm also facing the same issue.
>
> Thanks
>
>
>
>
> --
> Sent from:
Hi,
Can you please share the correct versions of JAR files which you used to
resolve the issue. I'm also facing the same issue.
Thanks
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe
tion.html
>
> Regards
>
> On Wed, 12 Aug 2020 at 14:12, Hamish Whittal
> wrote:
>>
>> Hi folks,
>>
>> Thought I would ask here because it's somewhat confusing. I'm using Spark
>> 2.4.5 on EMR 5.30.1 with Amazon MSK.
>>
>> The version of Scal
Thought I would ask here because it's somewhat confusing. I'm using Spark
> 2.4.5 on EMR 5.30.1 with Amazon MSK.
>
> The version of Scala used is 2.11.12. I'm using this version of the
> libraries spark-streaming-kafka-0-8_2.11-2.4.5.jar
>
> Now I'm wanting to read from Kafka
Hi folks,
Thought I would ask here because it's somewhat confusing. I'm using Spark
2.4.5 on EMR 5.30.1 with Amazon MSK.
The version of Scala used is 2.11.12. I'm using this version of the
libraries spark-streaming-kafka-0-8_2.11-2.4.5.jar
Now I'm wanting to read from Kafka topics using Python
Hi
I am able to correct the issue. The issue was due to wrong version of JAR
file I have used. I have removed the these JAR files and copied correct
version of JAR files and the error has gone away.
Regards
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
I can't reproduce. Could you please make sure you're running spark-shell
with official spark 3.0.0 distribution? Please try out changing the
directory and using relative path like "./spark-shell".
On Thu, Jul 2, 2020 at 9:59 PM dwgw wrote:
> Hi
> I am trying to stream kafka topic from spark
Hi
I am trying to stream kafka topic from spark shell but i am getting the
following error.
I am using *spark 3.0.0/scala 2.12.10* (Java HotSpot(TM) 64-Bit Server VM,
*Java 1.8.0_212*)
*[spark@hdp-dev ~]$ spark-shell --packages
org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0*
Ivy Default Cache
HiI am trying to stream kafka topic from spark shell but i am getting the
following error. I am using *spark 3.0.0/scala 2.12.10* (Java HotSpot(TM)
64-Bit Server VM, *Java 1.8.0_212*)*[spark@hdp-dev ~]$ spark-shell
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0*Ivy Default
Cache set
Hi,
Why is `value` column in streamed dataframe obtained from kafka topic
natively of binary type (look at the table
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html)
if in fact it holds a string with the message's data and we CAST it as
string anyways?
When I terminate a spark streaming application and restart it, it always
stuck in this step:
>
> Revoking previously assigned partitions [] for group [mygroup]
> (Re-)joing group [mygroup]
If I use a new group id, even though it works fine, I may lose the data
from the last time I read the
The expected setting to clean these files is :
- spark.sql.streaming.minBatchesToRetain
More info on structured streaming settings :
https://github.com/jaceklaskowski/spark-structured-streaming-book/blob/master/spark-sql-streaming-properties.adoc
--
Sent from:
Hi Spark Users ! :)
I come to you with a question about checkpoints.
I have a streaming application that consumes and produces to Kafka.
The computation requires a window and watermarking.
Since this is a streaming application with a Kafka output, a checkpoint is
expected.
The application runs
How can I handle an error with Kafka with my DirectStream (network issue,
zookeeper or broker going down) ? For example when the consumer fails to
connect with Kafka (at startup) I only get a DEBUG log (not even an ERROR)
and no exception are thrown ...
I'm using Spark 2.1.1 and spark-streaming
;>> What is it you're actually trying to accomplish?
>>>>>>
>>>>>> You can get topic, partition, and offset bounds from an offset range like
>>>>>>
>>>>>> http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#
;>>> On 26 Apr 2017, at 19:26, Cody Koeninger <c...@koeninger.org> wrote:
>>>>>
>>>>> What is it you're actually trying to accomplish?
>>>>>
>>>>> You can get topic, partition, and offset bounds from an offset range like
...@koeninger.org> wrote:
>>>>
>>>> What is it you're actually trying to accomplish?
>>>>
>>>> You can get topic, partition, and offset bounds from an offset range like
>>>>
>>>> http://spark.apache.org/docs/latest/streaming-ka
-0-10-integration.html#obtaining-offsets
>>>
>>> Timestamp isn't really a meaningful idea for a range of offsets.
>>>
>>>
>>> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
>>> <dominiksafa...@gmail.com> wrote:
>>>> Hi
;> Timestamp isn't really a meaningful idea for a range of offsets.
>>
>>
>> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
>> <dominiksafa...@gmail.com> wrote:
>>> Hi all,
>>>
>>> Because the Spark Streaming direct Kafka consume
ntegration.html#obtaining-offsets
>
> Timestamp isn't really a meaningful idea for a range of offsets.
>
>
> On Tue, Apr 25, 2017 at 2:43 PM, Dominik Safaric
> <dominiksafa...@gmail.com> wrote:
>> Hi all,
>>
>> Because the Spark Streaming direct Kafka consumer maps
25, 2017 at 2:43 PM, Dominik Safaric
<dominiksafa...@gmail.com> wrote:
> Hi all,
>
> Because the Spark Streaming direct Kafka consumer maps offsets for a given
> Kafka topic and a partition internally while having enable.auto.commit set
> to false, how can I retrieve th
Hi all,
Because the Spark Streaming direct Kafka consumer maps offsets for a given
Kafka topic and a partition internally while having enable.auto.commit set to
false, how can I retrieve the offset of each made consumer’s poll call using
the offset ranges of an RDD? More precisely
Ok,
Thanks for your answers
On 3/22/17, 1:34 PM, "Cody Koeninger" wrote:
If you're talking about reading the same message multiple times in a
failure situation, see
https://github.com/koeninger/kafka-exactly-once
If you're talking about producing
If you're talking about reading the same message multiple times in a
failure situation, see
https://github.com/koeninger/kafka-exactly-once
If you're talking about producing the same message multiple times in a
failure situation, keep an eye on
You have to handle de-duplication upstream or downstream. It might
technically be possible to handle this in Spark but you'll probably have a
better time handling duplicates in the service that reads from Kafka.
On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart
wrote:
>
Hi,
we are trying to build a spark streaming solution that subscribe and push to
kafka.
But we are running into the problem of duplicates events.
Right now, I am doing a “forEachRdd” and loop over the message of each
partition and send those message to kafka.
Is there any good way of solving
Thanks TD.
On Tue, Mar 14, 2017 at 4:37 PM, Tathagata Das wrote:
> This setting allows multiple spark jobs generated through multiple
> foreachRDD to run concurrently, even if they are across batches. So output
> op2 from batch X, can run concurrently with op1 of batch X+1
eiver.maxRate
>
> Thanks,
> Edwin
>
> On Mar 18, 2017, 12:53 AM -0400, sagarcasual . <sagarcas...@gmail.com>,
> wrote:
>
> Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic using direct
> approach. The streaming part works fine but when we initially start
Hi,
You can enable backpressure to handle this.
spark.streaming.backpressure.enabled
spark.streaming.receiver.maxRate
Thanks,
Edwin
On Mar 18, 2017, 12:53 AM -0400, sagarcasual . <sagarcas...@gmail.com>, wrote:
> Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic usi
Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic using direct
approach. The streaming part works fine but when we initially start the
job, we have to deal with really huge Kafka message backlog, millions of
messages, and that first batch runs for over 40 hours, and after 12 hours
or so
This setting allows multiple spark jobs generated through multiple
foreachRDD to run concurrently, even if they are across batches. So output
op2 from batch X, can run concurrently with op1 of batch X+1
This is not safe because it breaks the checkpointing logic in subtle ways.
Note that this was
Thanks TD for the response. Can you please provide more explanation. I am
having multiple streams in the spark streaming application (Spark 2.0.2
using DStreams). I know many people using this setting. So your
explanation will help a lot of people.
Thanks
On Fri, Mar 10, 2017 at 6:24 PM,
That config I not safe. Please do not use it.
On Mar 10, 2017 10:03 AM, "shyla deshpande"
wrote:
> I have a spark streaming application which processes 3 kafka streams and
> has 5 output operations.
>
> Not sure what should be the setting for
I have a spark streaming application which processes 3 kafka streams and
has 5 output operations.
Not sure what should be the setting for spark.streaming.concurrentJobs.
1. If the concurrentJobs setting is 4 does that mean 2 output operations
will be run sequentially?
2. If I had 6 cores what
If you haven't looked at the offset ranges in the logs for the time period
in question, I'd start there.
On Jan 24, 2017 2:51 PM, "Hakan İlter" wrote:
Sorry for misunderstanding. When I said that, I meant there are no lag in
consumer. Kafka Manager shows each consumer's
cs. After starting the job, everything works
> fine
> >> >> > first
> >> >> > (like 700 req/sec) but after a while (couples of days or a week) it
> >> >> > starts
> >> >> > processing only some part of the data (like 350
; >> > starts
>> >> > processing only some part of the data (like 350 req/sec). When I
>> >> > check
>> >> > the
>> >> > kafka topics, I can see that there are still 700 req/sec coming to
>> >> > the
>> >> > to
cs, I can see that there are still 700 req/sec coming to the
> >> > topics. I don't see any errors, exceptions or any other problem. The
> job
> >> > works fine when I start the same code with just single kafka topic.
> >> >
> >> > Do you
single kafka topic.
>> >
>> > Do you have any idea or a clue to understand the problem?
>> >
>> > Thanks.
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-lis
; >
> > Thanks.
> >
> >
> >
> > --
> > View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-streaming-multiple-kafka-
> topic-doesn-t-work-at-least-once-tp28334.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
ew this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-multiple-kafka-topic-doesn-t-work-at-least-once-tp28334.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>
.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-multiple-kafka-topic-doesn-t-work-at-least-once-tp28334.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
7:11 GMT+01:00 Timur Shenkao <t...@timshenkao.su>:
> >>>
> >>> Hi,
> >>> Usual general questions are:
> >>> -- what is your Spark version?
> >>> -- what is your Kafka version?
> >>> -- do you use "standard" Kafka con
sion?
>>> -- do you use "standard" Kafka consumer or try to implement something
>>> custom (your own multi-threaded consumer)?
>>>
>>> The freshest docs
>>> https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
>>>
&g
>>
>> The freshest docs https://spark.apache.org/docs/
>> latest/streaming-kafka-0-10-integration.html
>>
>> AFAIK, yes, you should use unique group id for each stream (KAFKA 0.10
>> !!!)
>>
>>> kafkaParams.put("group.id", "use_a_separa
1, 2016 at 5:51 PM, Anton Okolnychyi <
> anton.okolnyc...@gmail.com> wrote:
>
>> Hi,
>>
>> I am experimenting with Spark Streaming and Kafka. I will appreciate if
>> someone can say whether the following assumption is correct.
>>
>&
kafka-0-10-integration.html
AFAIK, yes, you should use unique group id for each stream (KAFKA 0.10 !!!)
> kafkaParams.put("group.id", "use_a_separate_group_id_for_each_stream");
>
>
On Sun, Dec 11, 2016 at 5:51 PM, Anton Okolnychyi <
anton.okolnyc...@gmail.com> wrote:
>
Hi,
I am experimenting with Spark Streaming and Kafka. I will appreciate if
someone can say whether the following assumption is correct.
If I have multiple computations (each with its own output) on one stream
(created as KafkaUtils.createDirectStream), then there is a chance to have
Hi Tariq and Jon,
At first thanks for quick response. I really appreciate that.
Well, I would like to start from the very begging of using Kafka with
Spark. For example, in the Spark distribution, I found an example using
Kafka with Spark streaming that demonstrates a Direct Kafka Word Count
16, 2016, Karim, Md. Rezaul <
> rezaul.ka...@insight-centre.org> wrote:
>
>> Hi All,
>>
>> I am completely new with Kafka. I was wondering if somebody could provide
>> me some guidelines on how to develop real-time streaming applications using
>> Spark Stre
w with Kafka. I was wondering if somebody could provide
> me some guidelines on how to develop real-time streaming applications using
> Spark Streaming API with Kafka.
>
> I am aware the Spark Streaming and Kafka integration [1]. However, a real
> life example should be better to start?
>
Hi All,
I am completely new with Kafka. I was wondering if somebody could provide
me some guidelines on how to develop real-time streaming applications using
Spark Streaming API with Kafka.
I am aware the Spark Streaming and Kafka integration [1]. However, a real
life example should be better
/docs/latest/streaming-kafka-0-10-integration.html
>
> I've added that dependencies:
>
>
> org.apache.spark
> spark-streaming-kafka-0-10_2.11
> 2.0.1
>
>
> org.apache.spark
> spark-core
to it.
I wanted to follow that example:
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
I've added that dependencies:
org.apache.spark
spark-streaming-kafka-0-10_2.11
2.0.1
org.apache.spark
/.
Regards,
Rasika.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Kafka-tp21222p27014.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
additional
>>> processing. The question of which executor works on which tasks is up to
>>> the scheduler (and getPreferredLocations, which only matters if you're
>>> running spark on the same nodes as kafka)
>>>
>>> On Tue, Mar 1, 2016 at 2
ther executors will do additional
>> processing. The question of which executor works on which tasks is up to
>> the scheduler (and getPreferredLocations, which only matters if you're
>> running spark on the same nodes as kafka)
>>
>> On Tue, Mar 1, 2016 at 2:36 AM, Jatin Kumar &
nning spark on
> the same nodes as kafka)
>
> On Tue, Mar 1, 2016 at 2:36 AM, Jatin Kumar <
> jku...@rocketfuelinc.com.invalid> wrote:
>
>> Hello all,
>>
>> I see that there are as of today 3 ways one can read from Kafka in spark
>> streami
h tasks is up to the scheduler
(and getPreferredLocations, which only matters if you're running spark on
the same nodes as kafka)
On Tue, Mar 1, 2016 at 2:36 AM, Jatin Kumar <
jku...@rocketfuelinc.com.invalid> wrote:
> Hello all,
>
> I see that there are as of today 3 ways one
If by smaller block interval you mean the value in seconds passed to the
streaming context constructor, no. You'll still get everything from the
starting offset until now in the first batch.
On Thu, Feb 18, 2016 at 10:02 AM, praveen S wrote:
> Sorry.. Rephrasing :
> Can
Sorry.. Rephrasing :
Can this issue be resolved by having a smaller block interval?
Regards,
Praveen
On 18 Feb 2016 21:30, "praveen S" wrote:
> Can having a smaller block interval only resolve this?
>
> Regards,
> Praveen
> On 18 Feb 2016 21:13, "Cody Koeninger"
Can having a smaller block interval only resolve this?
Regards,
Praveen
On 18 Feb 2016 21:13, "Cody Koeninger" wrote:
> Backpressure won't help you with the first batch, you'd need
> spark.streaming.kafka.maxRatePerPartition
> for that
>
> On Thu, Feb 18, 2016 at 9:40 AM,
Backpressure won't help you with the first batch, you'd need
spark.streaming.kafka.maxRatePerPartition
for that
On Thu, Feb 18, 2016 at 9:40 AM, praveen S wrote:
> Have a look at
>
> spark.streaming.backpressure.enabled
> Property
>
> Regards,
> Praveen
> On 18 Feb 2016
Have a look at
spark.streaming.backpressure.enabled
Property
Regards,
Praveen
On 18 Feb 2016 00:13, "Abhishek Anand" wrote:
> I have a spark streaming application running in production. I am trying to
> find a solution for a particular use case when my application has
You can print whatever you want wherever you want, it's just a question of
whether it's going to show up on the driver or the various executors logs
On Wed, Feb 17, 2016 at 5:50 AM, Cyril Scetbon
wrote:
> I don't think we can print an integer value in a spark streaming
I don't think we can print an integer value in a spark streaming process As
opposed to a spark job. I think I can print the content of an rdd but not debug
messages. Am I wrong ?
Cyril Scetbon
> On Feb 17, 2016, at 12:51 AM, ayan guha wrote:
>
> Hi
>
> You can always
Just use a kafka rdd in a batch job or two, then start your streaming job.
On Wed, Feb 17, 2016 at 12:57 AM, Abhishek Anand
wrote:
> I have a spark streaming application running in production. I am trying to
> find a solution for a particular use case when my
Hi
You can always use RDD properties, which already has partition information.
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/performance_optimization/how_many_partitions_does_an_rdd_have.html
On Wed, Feb 17, 2016 at 2:36 PM, Cyril Scetbon
wrote:
Your understanding is the right one (having re-read the documentation). Still
wondering how I can verify that 5 partitions have been created. My job is
reading from a topic in Kafka that has 5 partitions and sends the data to E/S.
I can see that when there is one task to read from Kafka there
I have a slightly different understanding.
Direct stream generates 1 RDD per batch, however, number of partitions in
that RDD = number of partitions in kafka topic.
On Wed, Feb 17, 2016 at 12:18 PM, Cyril Scetbon
wrote:
> Hi guys,
>
> I'm making some tests with Spark and
Hi guys,
I'm making some tests with Spark and Kafka using a Python script. I use the
second method that doesn't need any receiver (Direct Approach). It should adapt
the number of RDDs to the number of partitions in the topic. I'm trying to
verify it. What's the easiest way to verify it ? I
Thanks Sebastian.
I was indeed trying out FAIR scheduling with a high value for
concurrentJobs today.
It does improve the latency seen by the non-hot partitions, even if it does
not provide complete isolation. So it might be an acceptable middle ground.
On 12 Feb 2016 12:18, "Sebastian Piu"
Have you tried using fair scheduler and queues
On 12 Feb 2016 4:24 a.m., "p pathiyil" wrote:
> With this setting, I can see that the next job is being executed before
> the previous one is finished. However, the processing of the 'hot'
> partition eventually hogs all the
Hi,
I am looking at a way to isolate the processing of messages from each Kafka
partition within the same driver.
Scenario: A DStream is created with the createDirectStream call by passing
in a few partitions. Let us say that the streaming context is defined to
have a time duration of 2 seconds.
With this setting, I can see that the next job is being executed before the
previous one is finished. However, the processing of the 'hot' partition
eventually hogs all the concurrent jobs. If there was a way to restrict
jobs to be one per partition, then this setting would provide the
All,
I'm new to Spark and I'm having a hard time doing a simple join of two DFs
Intent:
- I'm receiving data from Kafka via direct stream and would like to
enrich the messages with data from Cassandra. The Kafka messages
(Protobufs) are decoded into DataFrames and then joined with a
he.org
Subject: [Spark Streaming] Joining Kafka and Cassandra DataFrames
All,
I'm new to Spark and I'm having a hard time doing a simple join of two DFs
Intent:
- I'm receiving data from Kafka via direct stream and would like to enrich the
messages with data from Cassandra. The Kafka me
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
Sent: Tuesday, February 9, 2016 10:05 PM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames
Hi Mohammed
Thanks for hint, I should probably do that :)
As for the DF
Mohammed
Author: Big Data Analytics with Spark
-Original Message-
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
Sent: Tuesday, February 9, 2016 10:47 PM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames
Hi Mohammed
rom: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
Sent: Tuesday, February 9, 2016 10:05 PM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames
Hi Mohammed
Thanks for hint, I should probably do that :)
As for the DF
with Spark
-Original Message-
From: bernh...@chapter7.ch [mailto:bernh...@chapter7.ch]
Sent: Tuesday, February 9, 2016 10:47 PM
To: Mohammed Guller
Cc: user@spark.apache.org
Subject: Re: [Spark Streaming] Joining Kafka and Cassandra DataFrames
Hi Mohammed
I'm aware of that documentation, what
rnh...@chapter7.ch]
Sent: Tuesday, February 9, 2016 6:58 AM
To: user@spark.apache.org
Subject: [Spark Streaming] Joining Kafka and Cassandra DataFrames
All,
I'm new to Spark and I'm having a hard time doing a simple join of two DFs
Intent:
- I'm receiving data from Kafka via direct stream and wou
Hello Spark users,
We are setting up our fist bach of spark streaming pipelines. And I am
running into an issue which I am not sure how to resolve, but seems like
should be fairly trivial.
I am using receiver-mode Kafka consumer that comes with Spark, and running
in standalone mode. I've setup
Please ignore this question, as i've figured out what my problem was.
In the case that anyone else runs into something similar, the problem was
on the kafka side. I was using the console producer to generate the
messages going into the kafka logs. This producer will send all of the
messages to
It's possible you could (ab)use updateStateByKey or mapWithState for this.
But honestly it's probably a lot more straightforward to just choose a
reasonable batch size that gets you a reasonable file size for most of your
keys, then use filecrush or something similar to deal with the hdfs small
Hi,
Are there any ways to store DStreams / RDD read from Kafka in memory to be
processed at a later time ? What we need to do is to read data from Kafka,
process it to be keyed by some attribute that is present in the Kafka
messages, and write out the data related to each key when we have
Would you mind posting the relevant code snippet?
Thanks
Best Regards
On Wed, Dec 23, 2015 at 7:33 PM, Vyacheslav Yanuk
wrote:
> Hi.
> I have very strange situation with direct reading from Kafka.
> For example.
> I have 1000 messages in Kafka.
> After submitting my
Hi.
I have very strange situation with direct reading from Kafka.
For example.
I have 1000 messages in Kafka.
After submitting my application I read this data and process it.
As I process the data I have accumulated 10 new entries.
In next reading from Kafka I read only 3 records, but not 10!!!
Colleagues
Documents written about createDirectStream that
"This does not use Zookeeper to store offsets. The consumed offsets are
tracked by the stream itself. For interoperability with Kafka monitoring
tools that depend on Zookeeper, you have to update Kafka/Zookeeper yourself
from the
Read the documentation
spark.apache.org/docs/latest/streaming-kafka-integration.html
If you still have questions, read the resources linked from
https://github.com/koeninger/kafka-exactly-once
On Wed, Dec 23, 2015 at 7:24 AM, Vyacheslav Yanuk
wrote:
> Colleagues
>
might be a question more fit for a scala mailing list
> but google is failing me at the moment for hints on the interoperability of
> scala and java generics.
>
> [1] The original createDirectStream:
> https://github.com/apache/spark/blob/branch-1.5/external/kafka/src/main/scala/org/apache/
Hi,
Our processing times in Spark Streaming with kafka Direct approach seems to
have increased considerably with increase in the Site traffic. Would
increasing the number of kafka partitions decrease the processing times?
Any suggestions on tuning to reduce the processing times would
Am 04.12.2015 um 22:21 schrieb SRK <swethakasire...@gmail.com>:
>
> Hi,
>
> Our processing times in Spark Streaming with kafka Direct approach seems to
> have increased considerably with increase in the Site traffic. Would
> increasing the number of kafka partitions decrease
/spark/streaming/kafka/KafkaUtils.scala#L395-L423
Thanks,
- Alan
On Tue, Dec 1, 2015 at 8:12 AM, Cody Koeninger <c...@koeninger.org> wrote:
> I actually haven't tried that, since I tend to do the offset lookups if
> necessary.
>
> It's possible that it will work, try it and l
Neat, thanks. If I specify something like -1 as the offset, will it
consume from the latest offset or do I have to instrument that manually?
- Alan
On Tue, Dec 1, 2015 at 6:43 AM, Cody Koeninger wrote:
> Yes, there is a version of createDirectStream that lets you specify
>
Yes, there is a version of createDirectStream that lets you specify
fromOffsets: Map[TopicAndPartition, Long]
On Mon, Nov 30, 2015 at 7:43 PM, Alan Braithwaite
wrote:
> Is there any mechanism in the kafka streaming source to specify the exact
> partition id that we want a
I actually haven't tried that, since I tend to do the offset lookups if
necessary.
It's possible that it will work, try it and let me know.
Be aware that if you're doing a count() or take() operation directly on the
rdd it'll definitely give you the wrong result if you're using -1 for one
of the
Is there any mechanism in the kafka streaming source to specify the exact
partition id that we want a streaming job to consume from?
If not, is there a workaround besides writing our a custom receiver?
Thanks,
- Alan
Hi All ,
Just wanted to find out if there is an benefits to installing kafka
brokers and spark nodes on the same machine ?
is it possible that spark can pull data from kafka if it is local to the
node i.e. the broker or partition is on the same machine.
Thanks,
Ashish
1 - 100 of 226 matches
Mail list logo