[jira] [Commented] (HADOOP-18028) High performance S3A input stream with prefetching & caching

ASF GitHub Bot (Jira) Mon, 17 Apr 2023 05:33:05 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-18028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17713065#comment-17713065
 ]


ASF GitHub Bot commented on HADOOP-18028:
-----------------------------------------

ahmarsuhail commented on PR #5559:
URL: https://github.com/apache/hadoop/pull/5559#issuecomment-1511252967

   looks good so far, not sure if this helpful, but patches that came after 
this big commit are (listed in order they were committed to trunk): 
   
   - ITestS3ACannedACLs failure; not in a span: 
[JIRA](https://issues.apache.org/jira/browse/HADOOP-18385), 
[PR](https://github.com/apache/hadoop/pull/4736)
   - fs.s3a.prefetch.block.size to be read through longBytesOption: 
[JIRA](https://issues.apache.org/jira/browse/HADOOP-18380), [PR
   ](https://github.com/apache/hadoop/pull/4762)
   - s3a prefetching to use SemaphoredDelegatingExecutor for submitting work: 
[JIRA](https://issues.apache.org/jira/browse/HADOOP-18186), 
[PR](https://github.com/apache/hadoop/pull/4796)
   - hadoop-aws maven build to add a prefetch profile to run all tests with 
prefetching: [JIRA](https://issues.apache.org/jira/browse/HADOOP-18377), 
[PR](https://github.com/apache/hadoop/pull/4914)
   - s3a prefetching Executor should be closed: 
[JIRA](https://issues.apache.org/jira/browse/HADOOP-18455), 
[PR](https://github.com/apache/hadoop/pull/4879) & 
[PR](https://github.com/apache/hadoop/pull/4926)
   - Implement readFully(long position, byte[] buffer, int offset, int length) 
- [JIRA](https://issues.apache.org/jira/browse/HADOOP-18378), 
[PR](https://github.com/apache/hadoop/pull/4955)
   - S3PrefetchingInputStream to support status probes when closed - 
[JIRA](https://issues.apache.org/jira/browse/HADOOP-18189), 
[PR](https://github.com/apache/hadoop/pull/5036)
   - assertion failure in ITestS3APrefetchingInputStream - 
[JIRA](https://issues.apache.org/jira/browse/HADOOP-18531), 
[PR](https://github.com/apache/hadoop/pull/5149)
   - Remove lower limit on s3a prefetching/caching block size - 
[JIRA](https://issues.apache.org/jira/browse/HADOOP-18246), 
[PR](https://github.com/apache/hadoop/pull/5120)
   - S3A prefetching: Error logging during reads - 
[JIRA](https://issues.apache.org/jira/browse/HADOOP-18351),[ 
PR](https://github.com/apache/hadoop/pull/5274)
   
   Patch available, but not merged yet:
   SingleFilePerBlockCache to use LocalDirAllocator for file allocation: 
[JIRA](https://issues.apache.org/jira/browse/HADOOP-18399), 
[PR](https://github.com/apache/hadoop/pull/5054)




> High performance S3A input stream with prefetching & caching
> ------------------------------------------------------------
>
>                 Key: HADOOP-18028
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18028
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>            Reporter: Bhalchandra Pandit
>            Assignee: Bhalchandra Pandit
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 14.5h
>  Remaining Estimate: 0h
>
> I work for Pinterest. I developed a technique for vastly improving read 
> throughput when reading from the S3 file system. It not only helps the 
> sequential read case (like reading a SequenceFile) but also significantly 
> improves read throughput of a random access case (like reading Parquet). This 
> technique has been very useful in significantly improving efficiency of the 
> data processing jobs at Pinterest. 
>  
> I would like to contribute that feature to Apache Hadoop. More details on 
> this technique are available in this blog I wrote recently:
> [https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-18028) High performance S3A input stream with prefetching & caching

Reply via email to