[jira] [Comment Edited] (HADOOP-18291) S3A prefetch - Implement LRU cache for SingleFilePerBlockCache

2023-07-14 Thread Quan Li (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17743347#comment-17743347
 ] 

Quan Li edited comment on HADOOP-18291 at 7/15/23 5:57 AM:
---

the tests are failing in our internal, can't follow the code, tons of review 
comments, not sure if  reviewers wrote code via review this much comments.

[~mthakur] [~mehakmeet] [~ayushsaxena] [~hexiaoqiao] [~inigoiri] 

Can someone fix/revert this

 ticket -> let reviewer fix it via review -> still break -> addendum

very tough backporting such


was (Author: quanli):
the tests are failing in our internal, can't follow the code, tons of review 
comments, not sure if  reviewers wrote code via review this much comments.

[~mthakur] [~mehakmeet] [~ayushsaxena] [~hexiaoqiao] [~inigoiri] 

Can someone fix/revert this

> S3A prefetch - Implement LRU cache for SingleFilePerBlockCache
> --
>
> Key: HADOOP-18291
> URL: https://issues.apache.org/jira/browse/HADOOP-18291
> Project: Hadoop Common
>  Issue Type: Sub-task
>Affects Versions: 3.4.0
>Reporter: Ahmar Suhail
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.9
>
>
> Currently there is no limit on the size of disk cache. This means we could 
> have a large number of files on files, especially for access patterns that 
> are very random and do not always read the block fully. 
>  
> eg:
> in.seek(5);
> in.read(); 
> in.seek(blockSize + 10) // block 0 gets saved to disk as it's not fully read
> in.read();
> in.seek(2 * blockSize + 10) // block 1 gets saved to disk
> .. and so on
>  
> The in memory cache is bounded, and by default has a limit of 72MB (9 
> blocks). When a block is fully read, and a seek is issued it's released 
> [here|https://github.com/apache/hadoop/blob/feature-HADOOP-18028-s3a-prefetch/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3CachingInputStream.java#L109].
>  We can also delete the on disk file for the block here if it exists. 
>  
> Also maybe add an upper limit on disk space, and delete the file which stores 
> data of the block furthest from the current block (similar to the in memory 
> cache) when this limit is reached. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HADOOP-18291) S3A prefetch - Implement LRU cache for SingleFilePerBlockCache

2023-06-17 Thread Viraj Jasani (Jira)


[ 
https://issues.apache.org/jira/browse/HADOOP-18291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17721569#comment-17721569
 ] 

Viraj Jasani edited comment on HADOOP-18291 at 6/17/23 6:48 AM:


{quote}you'd maybe want a block cache - readers would lock their block before a 
read; unlock after. Use an LRU policy for recycling blocks, with unbuffer/close 
releasing all blocks of a caller.
{quote}
-if jobs using s3a prefetching get aborted without calling s3afs#close, and 
prefetched block files are kept on EBS volumes that could be accessed again by 
new vm instance or container that resume the jobs, we might also want to 
consider deleting all old local block files as part of s3afs#initialize-


was (Author: vjasani):
{quote}you'd maybe want a block cache - readers would lock their block before a 
read; unlock after. Use an LRU policy for recycling blocks, with unbuffer/close 
releasing all blocks of a caller.
{quote}
if jobs using s3a prefetching get aborted without calling s3afs#close, and 
prefetched block files are kept on EBS volumes that could be accessed again by 
new vm instance or container that resume the jobs, we might also want to 
consider deleting all old local block files as part of s3afs#initialize

> S3A prefetch - Implement LRU cache for SingleFilePerBlockCache
> --
>
> Key: HADOOP-18291
> URL: https://issues.apache.org/jira/browse/HADOOP-18291
> Project: Hadoop Common
>  Issue Type: Sub-task
>Affects Versions: 3.4.0
>Reporter: Ahmar Suhail
>Assignee: Viraj Jasani
>Priority: Major
>
> Currently there is no limit on the size of disk cache. This means we could 
> have a large number of files on files, especially for access patterns that 
> are very random and do not always read the block fully. 
>  
> eg:
> in.seek(5);
> in.read(); 
> in.seek(blockSize + 10) // block 0 gets saved to disk as it's not fully read
> in.read();
> in.seek(2 * blockSize + 10) // block 1 gets saved to disk
> .. and so on
>  
> The in memory cache is bounded, and by default has a limit of 72MB (9 
> blocks). When a block is fully read, and a seek is issued it's released 
> [here|https://github.com/apache/hadoop/blob/feature-HADOOP-18028-s3a-prefetch/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3CachingInputStream.java#L109].
>  We can also delete the on disk file for the block here if it exists. 
>  
> Also maybe add an upper limit on disk space, and delete the file which stores 
> data of the block furthest from the current block (similar to the in memory 
> cache) when this limit is reached. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org