[
https://issues.apache.org/jira/browse/SPARK-35312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350505#comment-17350505
]
Apache Spark commented on SPARK-35312:
--------------------------------------
User 'satishgopalani' has created a pull request for this issue:
https://github.com/apache/spark/pull/32653
> Introduce new Option in Kafka source to specify minimum number of records to
> read per trigger
> ---------------------------------------------------------------------------------------------
>
> Key: SPARK-35312
> URL: https://issues.apache.org/jira/browse/SPARK-35312
> Project: Spark
> Issue Type: Improvement
> Components: Structured Streaming
> Affects Versions: 3.1.1
> Reporter: Satish Gopalani
> Priority: Major
>
> Kafka source currently provides options to set the maximum number of offsets
> to read per trigger.
> I will like to introduce a new option to specify the minimum number of
> offsets to read per trigger i.e. *minOffsetsPerTrigger*.
> This new option will allow skipping trigger/batch when the number of records
> available in Kafka is low. This is a very useful feature in cases where we
> have a sudden burst of data at certain intervals in a day and data volume is
> low for the rest of the day. Tunning such jobs is difficult as decreasing
> trigger processing time increasing the number of batches and hence cluster
> resource usage and adds to small file issues. Increasing trigger processing
> time adds consumer lag. This will save cluster resources and also help solve
> small file issues as it is running lesser batches.
> Along with this, I would like to introduce '*maxTriggerDelay*' option which
> will help to avoid cases of infinite delay in scheduling trigger and the
> trigger will happen irrespective of records available if the maxTriggerDelay
> time exceeds the last trigger. It would be an optional parameter with a
> default value of 15 mins. _This option will be only applicable if
> minOffsetsPerTrigger is set._
> *minOffsetsPerTrigger* option would be optional of course, but once specified
> it would take precedence over *maxOffestsPerTrigger* which will be honored
> only after *minOffsetsPerTrigger* is satisfied.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]