[
https://issues.apache.org/jira/browse/SPARK-44924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772845#comment-17772845
]
Mike K commented on SPARK-44924:
--------------------------------
User 'ragnarok56' has created a pull request for this issue:
https://github.com/apache/spark/pull/42623
> Add configurations for FileStreamSource cached files
> ----------------------------------------------------
>
> Key: SPARK-44924
> URL: https://issues.apache.org/jira/browse/SPARK-44924
> Project: Spark
> Issue Type: Improvement
> Components: Structured Streaming
> Affects Versions: 3.1.0
> Reporter: kevin nacios
> Priority: Minor
>
> With https://issues.apache.org/jira/browse/SPARK-30866, caching of listed
> files was added for structured streaming to reduce cost of relisting from
> filesystem each batch. The settings that drive this are currently hardcoded
> and there is no way to change them.
>
> This impacts some of our workloads where we process large datasets where its
> unknown how "heavy" some files are, so a single batch can take a long period
> of time. When we set maxFilesPerTrigger to 100k files, a subsequent batch
> using the cached max of 10k files is causing the job to take longer since the
> cluster is capable of handling the 100k files but is stuck doing 10% of the
> workload. The benefit of the caching doesn't outweigh the cost of the
> performance on the rest of the job.
>
> With config settings available for this, we could either absorb some
> increased driver memory usage for caching the next 100k files, or opt to
> disable caching entirely and just relist files each batch by setting the
> cache amount to 0.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]