[jira] [Commented] (HADOOP-18291) S3A prefetch - Implement LRU cache for SingleFilePerBlockCache

ASF GitHub Bot (Jira) Fri, 30 Jun 2023 17:36:04 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-18291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739255#comment-17739255
 ]


ASF GitHub Bot commented on HADOOP-18291:
-----------------------------------------

virajjasani commented on PR #5754:
URL: https://github.com/apache/hadoop/pull/5754#issuecomment-1615308837

   @mukund-thakur i tried using guava LoadingCache, it's not consistently able 
to evict cache entries, it's doing asynchronously with weak ref and hence 
leading to inconsistent num of entries.
   
   for instance, even when i set max size as 1, i can see 8 entries in the map 
for more than 15s. hence, maintaining consistency with concurrency seems really 
problematic with this implementation.
   there is option to set concurrency too, but still somehow eviction is not 
frequent enough, i suspect this might be because of this:
   ```
   An update to the map and recording
   of reads may not be immediately reflected on the algorithm's data structures.
   ```




> S3A prefetch - Implement LRU cache for SingleFilePerBlockCache
> --------------------------------------------------------------
>
>                 Key: HADOOP-18291
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18291
>             Project: Hadoop Common
>          Issue Type: Sub-task
>    Affects Versions: 3.4.0
>            Reporter: Ahmar Suhail
>            Assignee: Viraj Jasani
>            Priority: Major
>              Labels: pull-request-available
>
> Currently there is no limit on the size of disk cache. This means we could 
> have a large number of files on files, especially for access patterns that 
> are very random and do not always read the block fully. 
>  
> eg:
> in.seek(5);
> in.read(); 
> in.seek(blockSize + 10) // block 0 gets saved to disk as it's not fully read
> in.read();
> in.seek(2 * blockSize + 10) // block 1 gets saved to disk
> .. and so on
>  
> The in memory cache is bounded, and by default has a limit of 72MB (9 
> blocks). When a block is fully read, and a seek is issued it's released 
> [here|https://github.com/apache/hadoop/blob/feature-HADOOP-18028-s3a-prefetch/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3CachingInputStream.java#L109].
>  We can also delete the on disk file for the block here if it exists. 
>  
> Also maybe add an upper limit on disk space, and delete the file which stores 
> data of the block furthest from the current block (similar to the in memory 
> cache) when this limit is reached. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-18291) S3A prefetch - Implement LRU cache for SingleFilePerBlockCache

Reply via email to