[jira] [Work logged] (HADOOP-18028) improve S3 read speed using prefetching & caching

ASF GitHub Bot (Jira) Fri, 25 Mar 2022 13:17:06 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-18028?focusedWorklogId=747970&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-747970
 ]


ASF GitHub Bot logged work on HADOOP-18028:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 25/Mar/22 20:16
            Start Date: 25/Mar/22 20:16
    Worklog Time Spent: 10m 
      Work Description: steveloughran opened a new pull request #4109:
URL: https://github.com/apache/hadoop/pull/4109


   ### Description of PR
   
   This is the PR of #3736 applied to a dedicated feature branch, with some 
minor pre-merge tuning with all subsequent changes to be their own PR
   
   * rename test classes, have AbstractHadoopTestBase as the base
   * package info files for new packages
   * import ordering
   * move to intercept() for assertions; ExceptionAsserts is invoking it and 
can be removed in future.
   
   this adds a dependency on a twitter lib which looks like scala code. that 
MUST be cut before we can merge.
   
   ### How was this patch tested?
   
   s3 london with `-Dparallel-tests -DtestsThreadCount=8  -Dmarkers=keep 
-Dscale`
   
   there are some failing integration tests which will need to be fixed before 
the feature branch is merged to trunk.
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [X] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 747970)
    Time Spent: 12h  (was: 11h 50m)

> improve S3 read speed using prefetching & caching
> -------------------------------------------------
>
>                 Key: HADOOP-18028
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18028
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>            Reporter: Bhalchandra Pandit
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 12h
>  Remaining Estimate: 0h
>
> I work for Pinterest. I developed a technique for vastly improving read 
> throughput when reading from the S3 file system. It not only helps the 
> sequential read case (like reading a SequenceFile) but also significantly 
> improves read throughput of a random access case (like reading Parquet). This 
> technique has been very useful in significantly improving efficiency of the 
> data processing jobs at Pinterest. 
>  
> I would like to contribute that feature to Apache Hadoop. More details on 
> this technique are available in this blog I wrote recently:
> [https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HADOOP-18028) improve S3 read speed using prefetching & caching

Reply via email to