[ 
https://issues.apache.org/jira/browse/HADOOP-16202?focusedWorklogId=521118&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-521118
 ]

ASF GitHub Bot logged work on HADOOP-16202:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Dec/20 12:12
            Start Date: 07/Dec/20 12:12
    Worklog Time Spent: 10m 
      Work Description: steveloughran commented on pull request #2168:
URL: https://github.com/apache/hadoop/pull/2168#issuecomment-739880426


   The latest patch rounds things off. This thing is ready to go in. 
   * We now have the option to specify the start and end of splits; the input 
formats in the MR client do this.
   * everywhere in the code where we explicitly download sequential datasets 
request sequential IO. (actually, I've just realised `hadoop fs -head <path>` 
should request random IO as well as declare split lengths...we don't want a 
full GET).
   
   its important that FS implementations don't rely on split length to set max 
file len, because splits are allowed to overrun to ensure a whole record/block 
is read. Apps which pass split info down to worker processes (hive &c) need to 
pass in file size too if they want to save the HEAD request. It could still be 
used by the input streams if they can think of a way 
   
   1. For sequential IO: end of content length = min(split-end, file-length) 
for that initial request,
   2 For random IO, assume it's the initial EOF. 
   
   because openFile() declares FNFEs can be delayed until reads, we could also 
see if we could do an async HEAD request while processing that first GET/HEAD, 
so have the final file length without blocking. That would make streams more 
complex —at least now we have the option.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 521118)
    Time Spent: 3h 40m  (was: 3.5h)

> Stabilize openFile() and adopt internally
> -----------------------------------------
>
>                 Key: HADOOP-16202
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16202
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs, fs/s3, tools/distcp
>    Affects Versions: 3.3.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> The {{openFile()}} builder API lets us add new options when reading a file
> Add an option {{"fs.s3a.open.option.length"}} which takes a long and allows 
> the length of the file to be declared. If set, *no check for the existence of 
> the file is issued when opening the file*
> Also: withFileStatus() to take any FileStatus implementation, rather than 
> only S3AFileStatus -and not check that the path matches the path being 
> opened. Needed to support viewFS-style wrapping and mounting.
> and Adopt where appropriate to stop clusters with S3A reads switched to 
> random IO from killing download/localization
> * fs shell copyToLocal
> * distcp
> * IOUtils.copy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to