Hi all, Spark has had a backpressure implementation since 1.5 that helps to stabilize a Spark Streaming application in terms of keeping the processing time/batch under control and less than the batch interval. This implementation leaves excess records in the source (Kafka, Flume etc) and they get picked up in subsequent batches.
However, there are use cases where it would be useful to pick up the whole batch of records from the source and randomly sample it down to a dynamically computed "desired" batch size. This would allow the application to not lag behind in processing the latest traffic with the trade off being that some traffic could be lost. I believe such a random sampling strategy has been proposed in the original backpressure JIRA (SPARK-7398) design doc but not natively implemented yet. I have written a blog post about implementing such a technique in the application using the PIDEstimator used in Spark's Backpressure implementation and randomly sampling the batch using its outcome. Implementing a Dynamic Sampling Strategy in a Spark Streaming Application <http://hubs.ly/H0824FD0> Hope that some people find it useful. Comments and discussion are welcome. Thanks, Nikunj