kevin nacios created SPARK-44924:
------------------------------------
Summary: Add configurations for FileStreamSource cached files
Key: SPARK-44924
URL: https://issues.apache.org/jira/browse/SPARK-44924
Project: Spark
Issue Type: Improvement
Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: kevin nacios
With https://issues.apache.org/jira/browse/SPARK-30866, caching of listed files
was added for structured streaming to reduce cost of relisting from filesystem
each batch. The settings that drive this are currently hardcoded and there is
no way to change them.
This impacts some of our workloads where we process large datasets where its
unknown how "heavy" some files are, so a single batch can take a long period of
time. When we set maxFilesPerTrigger to 100k files, a subsequent batch using
the cached max of 10k files is causing the job to take longer since the cluster
is capable of handling the 100k files but is stuck doing 10% of the workload.
The benefit of the caching doesn't outweigh the cost of the performance on the
rest of the job.
With config settings available for this, we could either absorb some increased
driver memory usage for caching the next 100k files, or opt to disable caching
entirely and just relist files each batch by setting the cache amount to 0.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]