Cody, sorry I didn’t get back sooner, I never saw the response pass by. I was looking at the spark ui. I’ll see if I can recreate the issue w/version 1.5.2. Thanks..
From: Cody Koeninger <c...@koeninger.org<mailto:c...@koeninger.org>> Date: Friday, October 16, 2015 at 12:48 To: robert towne <robert.to...@webtrends.com<mailto:robert.to...@webtrends.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>> Subject: Re: Problems w/YARN Spark Streaming app reading from Kafka What do you mean by "the current documentation states it isn’t used"? http://spark.apache.org/docs/latest/configuration.html still lists the value and its meaning. As far as the issue you're seeing, are you measuring records by looking at logs, the spark ui, or actual downstream sinks of data? I don't think the backpressure code does any logging, but KafkaRDD will log at info level the offsets for each topicpartition that is computed (message starts with "Computing topic") On Fri, Oct 16, 2015 at 1:52 PM, Robert Towne <robert.to...@webtrends.com<mailto:robert.to...@webtrends.com>> wrote: I have a Spark Streaming app that reads using a reciever-less connection ( KafkaUtils.createDirectStream) with an interval of 1 minute. For about 15 hours it was running fine, ranging in input size of 3,861,758 to 16,836 events. Then about 3 hours ago, every minute batch brought in the same number of records = 5,760 (2 topics, topic 1 = 64 partitions, topic 2 = 32 partitions). I know there is more data than the 5,760 records that being piped in, and eventually we’ll fall so far behind that our kafka offsets will not be available. It seems odd that 5760/96 (partitions) = 60 – or my interval in seconds. I do have spark.streaming.backpressure.enabled = true and even though the current documentation states it isn’t used I have a value set for spark.streaming.kafka.maxRatePerPartition. Has anyone else seen this issue where the rate seems capped even though it should be pulling more data? Thanks, Robert