kevin nacios created SPARK-44924:
------------------------------------

             Summary: Add configurations for FileStreamSource cached files
                 Key: SPARK-44924
                 URL: https://issues.apache.org/jira/browse/SPARK-44924
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 3.1.0
            Reporter: kevin nacios


With https://issues.apache.org/jira/browse/SPARK-30866, caching of listed files 
was added for structured streaming to reduce cost of relisting from filesystem 
each batch.  The settings that drive this are currently hardcoded and there is 
no way to change them.  

 

This impacts some of our workloads where we process large datasets where its 
unknown how "heavy" some files are, so a single batch can take a long period of 
time.  When we set maxFilesPerTrigger to 100k files, a subsequent batch using 
the cached max of 10k files is causing the job to take longer since the cluster 
is capable of handling the 100k files but is stuck doing 10% of the workload.  
The benefit of the caching doesn't outweigh the cost of the performance on the 
rest of the job.

 

With config settings available for this, we could either absorb some increased 
driver memory usage for caching the next 100k files, or opt to disable caching 
entirely and just relist files each batch by setting the cache amount to 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to