[PR] [SPARK-XXXX][SS] Add maxBytesPerTrigger threshold [spark]

via GitHub Mon, 08 Jan 2024 22:34:16 -0800


MaxNevermind opened a new pull request, #44636:
URL: https://github.com/apache/spark/pull/44636


   What changes were proposed in this pull request?
   This PR adds [Input Streaming 
Source's](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources)
 option maxBytesPerTrigger for limiting the total size of input files in a 
streaming batch. Semantics of maxBytesPerTrigger is very close to already 
existing one maxFilesPerTrigger option.
   
   How a feature was implemented?
   Because maxBytesPerTrigger is semantically close to maxFilesPerTrigger I 
used all the maxFilesPerTrigger usages in the whole repository as a potential 
places that requires changes, that includes:
   
   Option paramater definition
   Option related logic
   Option related ScalaDoc and MD files
   Option related test
   I went over the usage of all usages of maxFilesPerTrigger in 
FileStreamSourceSuite and implemented maxBytesPerTrigger in the same fashion as 
those two are pretty close in their nature. From the structure and elements of 
ReadLimit I've concluded that current design implies only one simple rule for 
ReadLimit, so I openly prohibited the setting of both maxFilesPerTrigger and 
maxBytesPerTrigger at the same time.
   
   Why are the changes needed?
   This feature is useful for our and our sister teams and we expect it will 
find a broad acceptance among Spark users. We have a use-case in a few of the 
Spark pipelines we support when we use Available-now trigger for periodic 
processing using Spark Streaming. We use maxFilesPerTrigger threshold for now, 
but this is not ideal as Input file size might change with the time which 
requires periodic configuration adjustment of maxFilesPerTrigger. Computational 
complexity of the job depends on the event count/total size of the input and 
maxBytesPerTrigger is a better predictor of that than maxFilesPerTrigger.
   
   Does this PR introduce any user-facing change?
   Yes
   
   How was this patch tested?
   New unit tests were added or existing maxFilesPerTrigger test were extended. 
I searched maxFilesPerTrigger related test and added new tests or extended 
existing ones trying to minimize and simplify the changes.
   
   Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-XXXX][SS] Add maxBytesPerTrigger threshold [spark]

Reply via email to