[jira] [Commented] (SPARK-44924) Add configurations for FileStreamSource cached files

Mike K (Jira) Sat, 07 Oct 2023 09:51:04 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-44924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772845#comment-17772845
 ]


Mike K commented on SPARK-44924:
--------------------------------

User 'ragnarok56' has created a pull request for this issue:
https://github.com/apache/spark/pull/42623

> Add configurations for FileStreamSource cached files
> ----------------------------------------------------
>
>                 Key: SPARK-44924
>                 URL: https://issues.apache.org/jira/browse/SPARK-44924
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.1.0
>            Reporter: kevin nacios
>            Priority: Minor
>
> With https://issues.apache.org/jira/browse/SPARK-30866, caching of listed 
> files was added for structured streaming to reduce cost of relisting from 
> filesystem each batch.  The settings that drive this are currently hardcoded 
> and there is no way to change them.  
>  
> This impacts some of our workloads where we process large datasets where its 
> unknown how "heavy" some files are, so a single batch can take a long period 
> of time.  When we set maxFilesPerTrigger to 100k files, a subsequent batch 
> using the cached max of 10k files is causing the job to take longer since the 
> cluster is capable of handling the 100k files but is stuck doing 10% of the 
> workload.  The benefit of the caching doesn't outweigh the cost of the 
> performance on the rest of the job.
>  
> With config settings available for this, we could either absorb some 
> increased driver memory usage for caching the next 100k files, or opt to 
> disable caching entirely and just relist files each batch by setting the 
> cache amount to 0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-44924) Add configurations for FileStreamSource cached files

Reply via email to