[jira] [Commented] (HADOOP-18028) High performance S3A input stream with prefetching & caching

Ahmar Suhail (Jira) Mon, 20 Jun 2022 05:22:06 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-18028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17556348#comment-17556348
 ]


Ahmar Suhail commented on HADOOP-18028:
---------------------------------------

[[email protected]] On memory usage, it's using a bounded pool, so max memory 
used by a single input stream at any point will be 72MB. I was wondering if you 
had any info on how many input streams an application might open simultaneously 
(and how high this number typically is?) as we should probably update defaults 
accordingly. Other concern is around disk usage, for now rather than placing a 
limit, it should be enough if we delete a file once a file is read completely. 

Also opened a couple of PR's:
 * [https://github.com/apache/hadoop/pull/4458] - Adds in iostats. 
 * [https://github.com/apache/hadoop/pull/4469] - Updates documentation and 
disables prefetching by default. 

I think the feature is well documented now, but maybe we need to mention 
somewhere that it's not stable yet. Is the best place to do that in the 
[prefetching 
docs|https://github.com/apache/hadoop/blob/feature-HADOOP-18028-s3a-prefetch/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/prefetching.md]?

> High performance S3A input stream with prefetching & caching
> ------------------------------------------------------------
>
>                 Key: HADOOP-18028
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18028
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>            Reporter: Bhalchandra Pandit
>            Assignee: Bhalchandra Pandit
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 13h 50m
>  Remaining Estimate: 0h
>
> I work for Pinterest. I developed a technique for vastly improving read 
> throughput when reading from the S3 file system. It not only helps the 
> sequential read case (like reading a SequenceFile) but also significantly 
> improves read throughput of a random access case (like reading Parquet). This 
> technique has been very useful in significantly improving efficiency of the 
> data processing jobs at Pinterest. 
>  
> I would like to contribute that feature to Apache Hadoop. More details on 
> this technique are available in this blog I wrote recently:
> [https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-18028) High performance S3A input stream with prefetching & caching

Reply via email to