[jira] [Commented] (HADOOP-18291) S3A prefetch - Implement LRU cache for SingleFilePerBlockCache

ASF GitHub Bot (Jira) Fri, 30 Jun 2023 15:58:04 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-18291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739242#comment-17739242
 ]


ASF GitHub Bot commented on HADOOP-18291:
-----------------------------------------

virajjasani commented on PR #5754:
URL: https://github.com/apache/hadoop/pull/5754#issuecomment-1615259963

   > Can we not use some already inbuilt cache rather than us implementing it 
from scratch ( this part is interesting for sure :)) 
https://github.com/google/guava/blob/master/guava/src/com/google/common/cache/LocalCache.java
 guava cache supports maximumSize for eviction and it internally uses LRU. ( 
see the Java docs of the class. )
   
   Interesting,
   
   from the javadoc:
   ```
      * The page replacement algorithm's data structures are kept casually 
consistent with the map. The
      * ordering of writes to a segment is sequentially consistent. An update 
to the map and recording
      * of reads may not be immediately reflected on the algorithm's data 
structures.
   ```
   
   for not-so-heavy loaded cache, perhaps our own implementation might be 
better? given that even reads would have immediate reflection on the doubly 
linked list data structure in our case.
   let me explore a bit more though. thanks for the reference!




> S3A prefetch - Implement LRU cache for SingleFilePerBlockCache
> --------------------------------------------------------------
>
>                 Key: HADOOP-18291
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18291
>             Project: Hadoop Common
>          Issue Type: Sub-task
>    Affects Versions: 3.4.0
>            Reporter: Ahmar Suhail
>            Assignee: Viraj Jasani
>            Priority: Major
>              Labels: pull-request-available
>
> Currently there is no limit on the size of disk cache. This means we could 
> have a large number of files on files, especially for access patterns that 
> are very random and do not always read the block fully. 
>  
> eg:
> in.seek(5);
> in.read(); 
> in.seek(blockSize + 10) // block 0 gets saved to disk as it's not fully read
> in.read();
> in.seek(2 * blockSize + 10) // block 1 gets saved to disk
> .. and so on
>  
> The in memory cache is bounded, and by default has a limit of 72MB (9 
> blocks). When a block is fully read, and a seek is issued it's released 
> [here|https://github.com/apache/hadoop/blob/feature-HADOOP-18028-s3a-prefetch/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3CachingInputStream.java#L109].
>  We can also delete the on disk file for the block here if it exists. 
>  
> Also maybe add an upper limit on disk space, and delete the file which stores 
> data of the block furthest from the current block (similar to the in memory 
> cache) when this limit is reached. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-18291) S3A prefetch - Implement LRU cache for SingleFilePerBlockCache

Reply via email to