ragnarok56 opened a new pull request, #45362: URL: https://github.com/apache/spark/pull/45362
### What changes were proposed in this pull request? This change adds configuration options for the streaming input File Source for `maxCachedFiles` and `discardCachedFilesRatio`. These values were originally introduced with https://github.com/apache/spark/pull/27620 but were hardcoded to 10,000 and 0.2, respectively. ### Why are the changes needed? Under certain workloads with large `maxFilesPerTrigger` settings, the performance gain from caching the input files capped at 10,000 can cause a cluster to be underutilized and jobs to take longer to finish if each batch takes a while to finish. For example, a job with `maxFilesPerTrigger` set to 100,000 would do all 100k in batch 1, then only 10k in batch 2, but both batches could take just as long since some of the files cause skewed processing times. This results in a cluster spending nearly the same amount of time while processing only 1/10 of the files it could have. ### Does this PR introduce _any_ user-facing change? Updated documentation for structured streaming sources to describe new configurations options ### How was this patch tested? New and existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
