Implementing Dynamic Sampling in a Spark Streaming Application

N B Wed, 12 Jul 2017 15:44:40 -0700

Hi all,

Spark has had a backpressure implementation since 1.5 that helps to
stabilize a Spark Streaming application in terms of keeping the processing
time/batch under control and less than the batch interval. This
implementation leaves excess records in the source (Kafka, Flume etc) and
they get picked up in subsequent batches.


However, there are use cases where it would be useful to pick up the whole
batch of records from the source and randomly sample it down to a
dynamically computed "desired" batch size. This would allow the application
to not lag behind in processing the latest traffic with the trade off being
that some traffic could be lost. I believe such a random sampling strategy
has been proposed in the original backpressure JIRA (SPARK-7398) design doc
but not natively implemented yet.

I have written a blog post about implementing such a technique in the
application using the PIDEstimator used in Spark's Backpressure
implementation and randomly sampling the batch using its outcome.

Implementing a Dynamic Sampling Strategy in a Spark Streaming Application
<http://hubs.ly/H0824FD0>

Hope that some people find it useful. Comments and discussion are welcome.

Thanks,
Nikunj

Implementing Dynamic Sampling in a Spark Streaming Application

Reply via email to