[jira] [Commented] (HADOOP-18291) SingleFilePerBlockCache does not have a limit

Daniel Carl Jones (Jira) Thu, 30 Jun 2022 05:23:04 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-18291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17561018#comment-17561018
 ]


Daniel Carl Jones commented on HADOOP-18291:
--------------------------------------------

I would propose to move the disk caching mechanism to be very similar to the 
in-memory prefetching - with a manager vending out a fixed amount of disk per 
<whatever we pick for prefetching>.

This fixed amount might be allocated per input stream as it is now, or maybe we 
will move to per S3A filesystem or JVM. In any case, the logic should be the 
same - it just depends how many caching managers/pools we create.

> SingleFilePerBlockCache does not have a limit
> ---------------------------------------------
>
>                 Key: HADOOP-18291
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18291
>             Project: Hadoop Common
>          Issue Type: Sub-task
>            Reporter: Ahmar Suhail
>            Priority: Major
>
> Currently there is no limit on the size of disk cache. This means we could 
> have a large number of files on files, especially for access patterns that 
> are very random and do not always read the block fully. 
>  
> eg:
> in.seek(5);
> in.read(); 
> in.seek(blockSize + 10) // block 0 gets saved to disk as it's not fully read
> in.read();
> in.seek(2 * blockSize + 10) // block 1 gets saved to disk
> .. and so on
>  
> The in memory cache is bounded, and by default has a limit of 72MB (9 
> blocks). When a block is fully read, and a seek is issued it's released 
> [here|https://github.com/apache/hadoop/blob/feature-HADOOP-18028-s3a-prefetch/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3CachingInputStream.java#L109].
>  We can also delete the on disk file for the block here if it exists. 
>  
> Also maybe add an upper limit on disk space, and delete the file which stores 
> data of the block furthest from the current block (similar to the in memory 
> cache) when this limit is reached. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-18291) SingleFilePerBlockCache does not have a limit

Reply via email to