[jira] [Commented] (HADOOP-18291) S3A prefetch - Implement LRU cache for SingleFilePerBlockCache

ASF GitHub Bot (Jira) Sat, 01 Jul 2023 10:49:03 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-18291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739315#comment-17739315
 ]


ASF GitHub Bot commented on HADOOP-18291:
-----------------------------------------

virajjasani commented on code in PR #5754:
URL: https://github.com/apache/hadoop/pull/5754#discussion_r1248909177


##########
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/prefetch/SingleFilePerBlockCache.java:
##########
@@ -103,13 +114,17 @@ private enum LockType {
       READ,
       WRITE
     }
+    private Entry previous;

Review Comment:
   we could have also used LinkedList that Java provides, which internally does 
use doubly linked list, however the catch is, the only way we can remove an 
element from the middle of the list is by using `remove(Object o)` method and 
that does traverse the list from the head, and hence has time complexity of 
O(n), whereas by maintaining pointers ourselves, we can remove any element from 
anywhere in the list with O(1) time complexity.
   
   LinkedList 
https://docs.oracle.com/javase/8/docs/api/java/util/LinkedList.html
   ```
       /**
        * Removes the first occurrence of the specified element from this list,
        * if it is present.  If this list does not contain the element, it is
        * unchanged.  More formally, removes the element with the lowest index
        * {@code i} such that
        * 
<tt>(o==null&nbsp;?&nbsp;get(i)==null&nbsp;:&nbsp;o.equals(get(i)))</tt>
        * (if such an element exists).  Returns {@code true} if this list
        * contained the specified element (or equivalently, if this list
        * changed as a result of the call).
        *
        * @param o element to be removed from this list, if present
        * @return {@code true} if this list contained the specified element
        */
       public boolean remove(Object o) {
           if (o == null) {
               for (Node<E> x = first; x != null; x = x.next) {
                   if (x.item == null) {
                       unlink(x);
                       return true;
                   }
               }
           } else {
               for (Node<E> x = first; x != null; x = x.next) {
                   if (o.equals(x.item)) {
                       unlink(x);
                       return true;
                   }
               }
           }
           return false;
       }
   ```
   
   we could use `remove(int index)` but then we will end up maintaining indexes 
which is even more complicated because as any element gets added or removed to 
the list, we will have to update our entire map with the indexes of all 
affected elements and hence it ends up being O(n) operation anyways.
   
   hence, i decided to maintain doubly linked list ourselves, the algo is quite 
less error-prone (based on good amount of concurrency testing that i did).





> S3A prefetch - Implement LRU cache for SingleFilePerBlockCache
> --------------------------------------------------------------
>
>                 Key: HADOOP-18291
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18291
>             Project: Hadoop Common
>          Issue Type: Sub-task
>    Affects Versions: 3.4.0
>            Reporter: Ahmar Suhail
>            Assignee: Viraj Jasani
>            Priority: Major
>              Labels: pull-request-available
>
> Currently there is no limit on the size of disk cache. This means we could 
> have a large number of files on files, especially for access patterns that 
> are very random and do not always read the block fully. 
>  
> eg:
> in.seek(5);
> in.read(); 
> in.seek(blockSize + 10) // block 0 gets saved to disk as it's not fully read
> in.read();
> in.seek(2 * blockSize + 10) // block 1 gets saved to disk
> .. and so on
>  
> The in memory cache is bounded, and by default has a limit of 72MB (9 
> blocks). When a block is fully read, and a seek is issued it's released 
> [here|https://github.com/apache/hadoop/blob/feature-HADOOP-18028-s3a-prefetch/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/read/S3CachingInputStream.java#L109].
>  We can also delete the on disk file for the block here if it exists. 
>  
> Also maybe add an upper limit on disk space, and delete the file which stores 
> data of the block furthest from the current block (similar to the in memory 
> cache) when this limit is reached. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-18291) S3A prefetch - Implement LRU cache for SingleFilePerBlockCache

Reply via email to