satishgopalani opened a new pull request #32434:
URL: https://github.com/apache/spark/pull/32434


   ### What changes were proposed in this pull request?
   This patch introduces a new option to specify the minimum number of offsets 
to read per trigger i.e. minOffsetsPerTrigger and maxTriggerDelay to avoid the 
infinite wait for the trigger.
   
   This new option will allow skipping trigger/batch when the number of records 
available in Kafka is low. This is a very useful feature in cases where we have 
a sudden burst of data at certain intervals in a day and data volume is low for 
the rest of the day. 
   'maxTriggerDelay' option will help to avoid cases of infinite delay in 
scheduling trigger and the trigger will happen irrespective of records 
available if the maxTriggerDelay time exceeds the last trigger. It would be an 
optional parameter with a default value of 15 mins. This option will be only 
applicable if minOffsetsPerTrigger is set.
   
   minOffsetsPerTrigger option would be optional of course, but once specified 
it would take precedence over maxOffestsPerTrigger which will be honored only 
after minOffsetsPerTrigger is satisfied.
   
   ### Why are the changes needed?
   There are many scenarios where there is a sudden burst of data at certain 
intervals in a day and data volume is low for the rest of the day. Tunning such 
jobs is difficult as decreasing trigger processing time increasing the number 
of batches and hence cluster resource usage and adds to small file issues. 
Increasing trigger processing time adds consumer lag. This patch tries to 
address this issue.
   
   ### How was this patch tested?
   This patch was tested manually on a cluster where the job was running for a 
full one day with data burst happening once a day.
   Here is the picture of databurst and hence consumer lag:
   <img width="1198" alt="Screenshot 2021-04-29 at 11 39 35 PM" 
src="https://user-images.githubusercontent.com/1044003/116997587-9b2ab180-acfa-11eb-91fd-524802ce3316.png";>
   
   This is how the job behaved at burst time running every 4.5 mins (which is 
the specified trigger time): 
   <img width="1154" alt="Burst Time" 
src="https://user-images.githubusercontent.com/1044003/116997919-12f8dc00-acfb-11eb-9b0a-98387fc67560.png";>
   
   This is job behavior during non-burst time where it is skipping 2 to 3 
triggers and running once every 9 to 13.5 mins
   <img width="1154" alt="Non Burst Time" 
src="https://user-images.githubusercontent.com/1044003/116998244-8b5f9d00-acfb-11eb-8340-33d47149ef81.png";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to