MaxNevermind opened a new pull request, #44636: URL: https://github.com/apache/spark/pull/44636
What changes were proposed in this pull request? This PR adds [Input Streaming Source's](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources) option maxBytesPerTrigger for limiting the total size of input files in a streaming batch. Semantics of maxBytesPerTrigger is very close to already existing one maxFilesPerTrigger option. How a feature was implemented? Because maxBytesPerTrigger is semantically close to maxFilesPerTrigger I used all the maxFilesPerTrigger usages in the whole repository as a potential places that requires changes, that includes: Option paramater definition Option related logic Option related ScalaDoc and MD files Option related test I went over the usage of all usages of maxFilesPerTrigger in FileStreamSourceSuite and implemented maxBytesPerTrigger in the same fashion as those two are pretty close in their nature. From the structure and elements of ReadLimit I've concluded that current design implies only one simple rule for ReadLimit, so I openly prohibited the setting of both maxFilesPerTrigger and maxBytesPerTrigger at the same time. Why are the changes needed? This feature is useful for our and our sister teams and we expect it will find a broad acceptance among Spark users. We have a use-case in a few of the Spark pipelines we support when we use Available-now trigger for periodic processing using Spark Streaming. We use maxFilesPerTrigger threshold for now, but this is not ideal as Input file size might change with the time which requires periodic configuration adjustment of maxFilesPerTrigger. Computational complexity of the job depends on the event count/total size of the input and maxBytesPerTrigger is a better predictor of that than maxFilesPerTrigger. Does this PR introduce any user-facing change? Yes How was this patch tested? New unit tests were added or existing maxFilesPerTrigger test were extended. I searched maxFilesPerTrigger related test and added new tests or extended existing ones trying to minimize and simplify the changes. Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
