[
https://issues.apache.org/jira/browse/SPARK-24707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-24707:
----------------------------------
Affects Version/s: (was: 3.0.0)
3.1.0
> Enable spark-kafka-streaming to maintain min buffer using async thread to
> avoid blocking kafka poll
> ---------------------------------------------------------------------------------------------------
>
> Key: SPARK-24707
> URL: https://issues.apache.org/jira/browse/SPARK-24707
> Project: Spark
> Issue Type: Improvement
> Components: DStreams
> Affects Versions: 3.1.0
> Reporter: Sidhavratha Kumar
> Priority: Major
> Attachments: 40_partition_topic_without_buffer.pdf
>
>
> Currently Spark Kafka RDD will block on kafka consumer poll. Specially in
> Spark-Kafka-streaming job this poll duration adds into batch processing time
> which result in
> * Increased batch processing time (which is apart from time taken to process
> records)
> * Results in unpredictable batch processing time based on poll time.
> If we can poll kafka in background thread and maintain buffer for each
> partition, poll time will not get added into batch processing time, and this
> will make processing time more predicatble (based on time taken to process
> each record, instead of extra time taken to poll records from source)
> For ex. we are facing issues where sometime kafka poll is ~30 secs, and
> sometime it returns within second. With backpressure enabled this reduces our
> job speed to great extent. In this situation it is also difficult to scale
> our processing or calculate resource requirement for future increase in
> records.
> Even if someone does not face varying kafka poll time, it will be provide
> performance improvement if some buffer is already maintained for each
> partition, so that each batch can just concentrate on processing records.
> Ex :
> Lets consider
> * each kafka poll takes 2sec average
> * batch duration is 10 sec
> * to process 100 records we take 10 sec
> * each kafka poll returns 300 recordsĀ
> ## Spark Job starts
> ## Batch-1 (100 records) (buffer = 0) (processing time = 10 sec + 2sec) =>
> 12 sec processing time
> ## Batch-2 (100 records) (buffer = 200) (processing time = 10 sec) => 10 sec
> processing time
> ## Batch-3 (100 records) (buffer = 100) (processing time = 10 sec) => 10 sec
> processing time
> ## Batch-4 (100 records) (buffer = 0) (processing time = 10 sec + 2 sec) =>
> 12 sec processing time
> If we poll in async and always maintain 500 records for each partition, only
> Batch-1 will take 12 sec. After that all batches will complete in 10 sec
> (unless some rebalancing/failure happens, in that case buffer will be cleaned
> and next batch will take 12 sec).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]