[jira] [Work logged] (HADOOP-18028) improve S3 read speed using prefetching & caching

ASF GitHub Bot (Jira) Thu, 17 Feb 2022 17:36:08 -0800


     [ 
https://issues.apache.org/jira/browse/HADOOP-18028?focusedWorklogId=729381&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-729381
 ]


ASF GitHub Bot logged work on HADOOP-18028:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 18/Feb/22 01:35
            Start Date: 18/Feb/22 01:35
    Worklog Time Spent: 10m 
      Work Description: bhalchandrap commented on pull request #3736:
URL: https://github.com/apache/hadoop/pull/3736#issuecomment-1043710783


   Hi Steve,
   Thanks for your earlier reviews. Take your time, I do not want to make your
   workload worse. I am happy to work with you using any process that you
   recommend. Let me know what your recommendations are and I will adopt them.
   
   On Thu, Feb 17, 2022 at 12:40 PM Steve Loughran ***@***.***>
   wrote:
   
   > I'm not deliberately ignoring you, just so overloaded with my own work
   > that I've not been able to review anything.
   >
   > we've put mukund's vectored IO patch into a feature branch with the idea
   > being we will still do the normal review process before every patch, but we
   > can get that short chain of patches lined up before the big merge into
   > trunk. we can also play rebase and interactive rebase too.
   >
   > would that work for you too? so patch 1 is "take what there is", patch 2+
   > being the changes we want in before shipping
   >
   > the risk is always it gets forgotten, so we still need to push hard to get
   > it into a state where it can be used as an option, the goal being "no side
   > effects if not used", including nothing extra on the classpath.
   >
   > meanwhile, have a look at this #2584
   > <https://github.com/apache/hadoop/pull/2584>
   >
   > its a big patch, but a key feature is you can declare your read policy,
   > with whole-file being an option as well as random, sequential and vectored.
   >
   > distcp and the CLI tools all declare their read plans this way
   >
   > i'd like this in before both the vectored io and your stream, so you can
   > use it to help decide whether to cache etc, and to support custom options
   > as well as parse the newly defined standard ones
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/hadoop/pull/3736#issuecomment-1043409450>, or
   > unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/ATA7ZBSIKJCGWTVIQ4DS6LLU3VMNTANCNFSM5I7V2OJQ>
   > .
   > Triage notifications on the go with GitHub Mobile for iOS
   > 
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
   > or Android
   > 
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
   >
   > You are receiving this because you authored the thread.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 729381)
    Time Spent: 11h  (was: 10h 50m)

> improve S3 read speed using prefetching & caching
> -------------------------------------------------
>
>                 Key: HADOOP-18028
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18028
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>            Reporter: Bhalchandra Pandit
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 11h
>  Remaining Estimate: 0h
>
> I work for Pinterest. I developed a technique for vastly improving read 
> throughput when reading from the S3 file system. It not only helps the 
> sequential read case (like reading a SequenceFile) but also significantly 
> improves read throughput of a random access case (like reading Parquet). This 
> technique has been very useful in significantly improving efficiency of the 
> data processing jobs at Pinterest. 
>  
> I would like to contribute that feature to Apache Hadoop. More details on 
> this technique are available in this blog I wrote recently:
> [https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HADOOP-18028) improve S3 read speed using prefetching & caching

Reply via email to