Re: Spark streaming. Strict discretizing by time

2016-07-07 Thread rss rss
Hi, I changed auto.offset.reset to largest. The result 30, 50, 40, 40, 35, 30 seconds... Instead of 10 seconds. It looks like attempt to react on backpressure but very slow. In any case it is far from any real time tasks including soft real time. And ok, I agreed with Spark usage with data

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
> Yes and I sent you results. It is appropriate only with known parameters of > input data stream. No, as far as I can tell from your posts in this thread and your linked project, you only tested with auto.offset.reset smallest and a large backlog. That's not what I advised you to do. Don't

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread rss rss
> > If you aren't processing messages as fast as you receive them, you're > going to run out of kafka retention regardless of whether you're using > Spark or Flink. Again, physics. It's just a question of what > compromises you choose. Yes. I wrote about it. But in case of Flink you will have

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
If you aren't processing messages as fast as you receive them, you're going to run out of kafka retention regardless of whether you're using Spark or Flink. Again, physics. It's just a question of what compromises you choose. If by "growing of a processing window time of Spark" you mean a

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread rss rss
Ok, thanks. But really this is not full decision. In case of growing of processing time I will have growing of window time. That is really with Spark I have 2 points of a latency growing. First is a delay of processing of messages in Kafka queue due to physical limitation of a computer system. And

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
Then double the upper limit you have set until the processing time approaches the batch time. On Wed, Jul 6, 2016 at 11:06 AM, rss rss wrote: > Ok, with: > > .set("spark.streaming.backpressure.enabled","true") > .set("spark.streaming.receiver.maxRate", "1") >

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread rss rss
Ok, with: .set("spark.streaming.backpressure.enabled","true") .set("spark.streaming.receiver.maxRate", "1") .set("spark.streaming.kafka.maxRatePerPartition", "1") I have something like *** Processing time: 5626

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
The configuration you set is spark.streaming.receiver.maxRate. The direct stream is not a receiver. As I said in my first message in this thread, and as the pages at http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers and

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread rss rss
ok, thanks. I tried to set minimum max rate for beginning. However in general I don't know initial throughput. BTW it would be very useful to explain it in https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning And really with

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread Cody Koeninger
The direct stream determines batch sizes on the driver, in advance of processing. If you haven't specified a maximum batch size, how would you suggest the backpressure code determine how to limit the first batch? It has no data on throughput until at least one batch is completed. Again, this is

Re: Spark streaming. Strict discretizing by time

2016-07-06 Thread rss rss
Hello, thanks, I tried to .set("spark.streaming.backpressure.enabled","true") but result is negative. Therefore I have prepared small test https://github.com/rssdev10/spark-kafka-streaming How to run: * git clone https://github.com/rssdev10/spark-kafka-streaming.git

Re: Spark streaming. Strict discretizing by time

2016-07-05 Thread Cody Koeninger
Test by producing messages into kafka at a rate comparable to what you expect in production. Test with backpressure turned on, it doesn't require you to specify a fixed limit on number of messages and will do its best to maintain batch timing. Or you could empirically determine a reasonable

Re: Spark streaming. Strict discretizing by time

2016-07-05 Thread rss rss
Hi, thanks. I know about possibility to limit number of messages. But the problem is I don't know number of messages which the system able to process. It depends on data. The example is very simple. I need a strict response after specified time. Something like soft real time. In case of Flink

Re: Spark streaming. Strict discretizing by time

2016-07-05 Thread Cody Koeninger
If you're talking about limiting the number of messages per batch to try and keep from exceeding batch time, see http://spark.apache.org/docs/latest/configuration.html look for backpressure and maxRatePerParition But if you're only seeing zeros after your job runs for a minute, it sounds like

Spark streaming. Strict discretizing by time

2016-07-05 Thread rss rss
Hello, I'm trying to organize processing of messages from Kafka. And there is a typical case when a number of messages in kafka's queue is more then Spark app's possibilities to process. But I need a strong time limit to prepare result for at least for a part of data. Code example: