[GitHub] [spark] HeartSaVioR commented on pull request #27620: [SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyond maxFilesPerTrigger as unread files

GitBox Sun, 28 Jun 2020 23:02:42 -0700


HeartSaVioR commented on pull request #27620:
URL: https://github.com/apache/spark/pull/27620#issuecomment-650932618



   After looking at a couple of more issues on file stream source, I'm feeling 
that we also need to have upper bound of the cache, as file stream source is 
already contributing memory usage on driver and this adds (possibly) unbounded 
amount of memory.
   
   I guess 10,000 entries are good enough, as it affects 100 batches when 
maxFilesPerTrigger is set to 100, and affects 10 batches when 
maxFilesPerTrigger is set to 1000. Once we find that higher value is OK for 
memory usage and pretty much helpful on majority of workloads, we can make it 
configurable with higher default value.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on pull request #27620: [SPARK-30866][SS] FileStreamSource: Cache fetched list of files beyond maxFilesPerTrigger as unread files

Reply via email to