Satish Gopalani created SPARK-35312:
---------------------------------------
Summary: Introduce new Option in Kafka source to specify minimum
number of records to read per trigger
Key: SPARK-35312
URL: https://issues.apache.org/jira/browse/SPARK-35312
Project: Spark
Issue Type: Improvement
Components: Structured Streaming
Affects Versions: 3.1.1
Reporter: Satish Gopalani
Kafka source currently provides options to set the maximum number of offsets to
read per trigger.
I will like to introduce a new option to specify the minimum number of offsets
to read per trigger i.e. *minOffsetsPerTrigger*.
This new option will allow skipping trigger/batch when the number of records
available in Kafka is low. This is a very useful feature in cases where we have
a sudden burst of data at certain intervals in a day and data volume is low for
the rest of the day. Tunning such jobs is difficult as decreasing trigger
processing time increasing the number of batches and hence cluster resource
usage and adds to small file issues. Increasing trigger processing time adds
consumer lag.
This will save cluster resources and also help solve small file issues as it is
running lesser batches.
Along with this, I would like to introduce '*maxTriggerDelay*' option which
will help to avoid infinite delay in scheduling trigger and the trigger will
happen irrespective of records available if the maxTriggerDelay time exceeds
the last trigger. It would be an optional parameter with a default value of 15
mins. _This option will be only applicable if minOffsetsPerTrigger is set._
*minOffsetsPerTrigger* option would be optional of course, but once specified
it would take precedence over *maxOffestsPerTrigger* which will be honored only
after *minOffsetsPerTrigger* is satisfied.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]