Hi,
I changed auto.offset.reset to largest. The result 30, 50, 40, 40, 35, 30
seconds... Instead of 10 seconds. It looks like attempt to react on
backpressure but very slow. In any case it is far from any real time tasks
including soft real time. And ok, I agreed with Spark usage with data
> Yes and I sent you results. It is appropriate only with known parameters of
> input data stream.
No, as far as I can tell from your posts in this thread and your
linked project, you only tested with auto.offset.reset smallest and a
large backlog. That's not what I advised you to do. Don't
>
> If you aren't processing messages as fast as you receive them, you're
> going to run out of kafka retention regardless of whether you're using
> Spark or Flink. Again, physics. It's just a question of what
> compromises you choose.
Yes. I wrote about it. But in case of Flink you will have
If you aren't processing messages as fast as you receive them, you're
going to run out of kafka retention regardless of whether you're using
Spark or Flink. Again, physics. It's just a question of what
compromises you choose.
If by "growing of a processing window time of Spark" you mean a
Ok, thanks. But really this is not full decision. In case of growing of
processing time I will have growing of window time. That is really with
Spark I have 2 points of a latency growing. First is a delay of processing
of messages in Kafka queue due to physical limitation of a computer system.
And
Then double the upper limit you have set until the processing time
approaches the batch time.
On Wed, Jul 6, 2016 at 11:06 AM, rss rss wrote:
> Ok, with:
>
> .set("spark.streaming.backpressure.enabled","true")
> .set("spark.streaming.receiver.maxRate", "1")
>
Ok, with:
.set("spark.streaming.backpressure.enabled","true")
.set("spark.streaming.receiver.maxRate", "1")
.set("spark.streaming.kafka.maxRatePerPartition", "1")
I have something like
***
Processing time: 5626
The configuration you set is spark.streaming.receiver.maxRate. The
direct stream is not a receiver. As I said in my first message in
this thread, and as the pages at
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
and
ok, thanks. I tried to set minimum max rate for beginning. However in
general I don't know initial throughput. BTW it would be very useful to
explain it in
https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning
And really with
The direct stream determines batch sizes on the driver, in advance of
processing. If you haven't specified a maximum batch size, how would
you suggest the backpressure code determine how to limit the first
batch? It has no data on throughput until at least one batch is
completed.
Again, this is
Hello,
thanks, I tried to .set("spark.streaming.backpressure.enabled","true") but
result is negative. Therefore I have prepared small test
https://github.com/rssdev10/spark-kafka-streaming
How to run:
* git clone https://github.com/rssdev10/spark-kafka-streaming.git
Test by producing messages into kafka at a rate comparable to what you
expect in production.
Test with backpressure turned on, it doesn't require you to specify a
fixed limit on number of messages and will do its best to maintain
batch timing. Or you could empirically determine a reasonable
Hi, thanks.
I know about possibility to limit number of messages. But the problem is
I don't know number of messages which the system able to process. It
depends on data. The example is very simple. I need a strict response after
specified time. Something like soft real time. In case of Flink
If you're talking about limiting the number of messages per batch to
try and keep from exceeding batch time, see
http://spark.apache.org/docs/latest/configuration.html
look for backpressure and maxRatePerParition
But if you're only seeing zeros after your job runs for a minute, it
sounds like
Hello,
I'm trying to organize processing of messages from Kafka. And there is a
typical case when a number of messages in kafka's queue is more then
Spark app's possibilities to process. But I need a strong time limit to
prepare result for at least for a part of data.
Code example:
15 matches
Mail list logo