[ 
https://issues.apache.org/jira/browse/SPARK-35312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Gopalani updated SPARK-35312:
------------------------------------
    Description: 
Kafka source currently provides options to set the maximum number of offsets to 
read per trigger.

I will like to introduce a new option to specify the minimum number of offsets 
to read per trigger i.e. *minOffsetsPerTrigger*.

This new option will allow skipping trigger/batch when the number of records 
available in Kafka is low. This is a very useful feature in cases where we have 
a sudden burst of data at certain intervals in a day and data volume is low for 
the rest of the day. Tunning such jobs is difficult as decreasing trigger 
processing time increasing the number of batches and hence cluster resource 
usage and adds to small file issues. Increasing trigger processing time adds 
consumer lag. This will save cluster resources and also help solve small file 
issues as it is running lesser batches.
 Along with this, I would like to introduce '*maxTriggerDelay*' option which 
will help to avoid cases of infinite delay in scheduling trigger and the 
trigger will happen irrespective of records available if the maxTriggerDelay 
time exceeds the last trigger. It would be an optional parameter with a default 
value of 15 mins. _This option will be only applicable if minOffsetsPerTrigger 
is set._

*minOffsetsPerTrigger* option would be optional of course, but once specified 
it would take precedence over *maxOffestsPerTrigger* which will be honored only 
after *minOffsetsPerTrigger* is satisfied.

 

  was:
Kafka source currently provides options to set the maximum number of offsets to 
read per trigger.

I will like to introduce a new option to specify the minimum number of offsets 
to read per trigger i.e. *minOffsetsPerTrigger*.

This new option will allow skipping trigger/batch when the number of records 
available in Kafka is low. This is a very useful feature in cases where we have 
a sudden burst of data at certain intervals in a day and data volume is low for 
the rest of the day. Tunning such jobs is difficult as decreasing trigger 
processing time increasing the number of batches and hence cluster resource 
usage and adds to small file issues. Increasing trigger processing time adds 
consumer lag. 

This will save cluster resources and also help solve small file issues as it is 
running lesser batches.
Along with this, I would like to introduce '*maxTriggerDelay*' option which 
will help to avoid infinite delay in scheduling trigger and the trigger will 
happen irrespective of records available if the maxTriggerDelay time exceeds 
the last trigger. It would be an optional parameter with a default value of 15 
mins. _This option will be only applicable if minOffsetsPerTrigger is set._

*minOffsetsPerTrigger* option would be optional of course, but once specified 
it would take precedence over *maxOffestsPerTrigger* which will be honored only 
after *minOffsetsPerTrigger* is satisfied.

 


> Introduce new Option in Kafka source to specify minimum number of records to 
> read per trigger
> ---------------------------------------------------------------------------------------------
>
>                 Key: SPARK-35312
>                 URL: https://issues.apache.org/jira/browse/SPARK-35312
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.1.1
>            Reporter: Satish Gopalani
>            Priority: Major
>
> Kafka source currently provides options to set the maximum number of offsets 
> to read per trigger.
> I will like to introduce a new option to specify the minimum number of 
> offsets to read per trigger i.e. *minOffsetsPerTrigger*.
> This new option will allow skipping trigger/batch when the number of records 
> available in Kafka is low. This is a very useful feature in cases where we 
> have a sudden burst of data at certain intervals in a day and data volume is 
> low for the rest of the day. Tunning such jobs is difficult as decreasing 
> trigger processing time increasing the number of batches and hence cluster 
> resource usage and adds to small file issues. Increasing trigger processing 
> time adds consumer lag. This will save cluster resources and also help solve 
> small file issues as it is running lesser batches.
>  Along with this, I would like to introduce '*maxTriggerDelay*' option which 
> will help to avoid cases of infinite delay in scheduling trigger and the 
> trigger will happen irrespective of records available if the maxTriggerDelay 
> time exceeds the last trigger. It would be an optional parameter with a 
> default value of 15 mins. _This option will be only applicable if 
> minOffsetsPerTrigger is set._
> *minOffsetsPerTrigger* option would be optional of course, but once specified 
> it would take precedence over *maxOffestsPerTrigger* which will be honored 
> only after *minOffsetsPerTrigger* is satisfied.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to