[jira] [Work logged] (HADOOP-18028) improve S3 read speed using prefetching & caching

ASF GitHub Bot (Jira) Wed, 01 Dec 2021 14:28:05 -0800


     [ 
https://issues.apache.org/jira/browse/HADOOP-18028?focusedWorklogId=688946&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-688946
 ]


ASF GitHub Bot logged work on HADOOP-18028:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 01/Dec/21 22:27
            Start Date: 01/Dec/21 22:27
    Worklog Time Spent: 10m 
      Work Description: rbalamohan commented on a change in pull request #3736:
URL: https://github.com/apache/hadoop/pull/3736#discussion_r760618253



##########
File path: 
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/common/BoundedResourcePool.java
##########
@@ -0,0 +1,179 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hadoop.fs.common;
+
+import java.util.Collections;
+import java.util.IdentityHashMap;
+import java.util.Set;
+import java.util.concurrent.ArrayBlockingQueue;
+
+/**
+ * Manages a fixed pool of resources.
+ *
+ * Avoids creating a new resource if a previously created instance is already 
available.
+ */
+public abstract class BoundedResourcePool<T> extends ResourcePool<T> {
+  // The size of this pool. Fixed at creation time.
+  private final int size;
+
+  // Items currently available in the pool.
+  private ArrayBlockingQueue<T> items;
+
+  // Items that have been created so far (regardless of whether they are 
currently available).
+  private Set<T> createdItems;
+
+  /**
+   * Constructs a resource pool of the given size.
+   *
+   * @param size the size of this pool. Cannot be changed post creation.
+   */
+  public BoundedResourcePool(int size) {
+    Validate.checkPositiveInteger(size, "size");
+
+    this.size = size;
+    this.items = new ArrayBlockingQueue<T>(size);
+
+    // The created items are identified based on their object reference.
+    this.createdItems = Collections.newSetFromMap(new IdentityHashMap<T, 
Boolean>());
+  }
+
+  /**
+   * Acquires a resource blocking if necessary until one becomes available.
+   */
+  @Override
+  public T acquire() {
+    return this.acquireHelper(true);
+  }
+
+  /**
+   * Acquires a resource blocking if one is immediately available. Otherwise 
returns null.
+   */
+  @Override
+  public T tryAcquire() {
+    return this.acquireHelper(false);
+  }
+
+  /**
+   * Releases a previously acquired resource.
+   */
+  @Override
+  public void release(T item) {
+    Validate.checkNotNull(item, "item");
+
+    synchronized (this.createdItems) {
+      if (!this.createdItems.contains(item)) {
+        throw new IllegalArgumentException("This item is not a part of this 
pool");
+      }
+    }
+
+    // Return if this item was released earlier.
+    // We cannot use this.items.contains() because that check is not based on 
reference equality.
+    for (T entry : this.items) {
+      if (entry == item) {
+        return;
+      }
+    }
+
+    while (true) {
+      try {
+        this.items.put(item);

Review comment:
       While loop isn't needed?. Doesn't ArrayBlockingQueue inherently wait for 
space to become available?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 688946)
    Time Spent: 1h 40m  (was: 1.5h)

> improve S3 read speed using prefetching & caching
> -------------------------------------------------
>
>                 Key: HADOOP-18028
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18028
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>            Reporter: Bhalchandra Pandit
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> I work for Pinterest. I developed a technique for vastly improving read 
> throughput when reading from the S3 file system. It not only helps the 
> sequential read case (like reading a SequenceFile) but also significantly 
> improves read throughput of a random access case (like reading Parquet). This 
> technique has been very useful in significantly improving efficiency of the 
> data processing jobs at Pinterest. 
>  
> I would like to contribute that feature to Apache Hadoop. More details on 
> this technique are available in this blog I wrote recently:
> [https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HADOOP-18028) improve S3 read speed using prefetching & caching

Reply via email to