[ 
https://issues.apache.org/jira/browse/HADOOP-17414?focusedWorklogId=521367&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-521367
 ]

ASF GitHub Bot logged work on HADOOP-17414:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 07/Dec/20 20:26
            Start Date: 07/Dec/20 20:26
    Worklog Time Spent: 10m 
      Work Description: steveloughran opened a new pull request #2530:
URL: https://github.com/apache/hadoop/pull/2530


   …tten collected by spark
   
   This is a PoC which, having implemented, I don't think is viable.
   
   Yes, we can fix up getFileStatus so it reads the header. It even knows
   to always bypass S3Guard (no inconsistencies to worry about any more).
   
   But: the blast radius of the change is too big. I'm worried about
   distcp or any other code which goes
   len =getFileStatus(path).getLen()
   open(path).readFully(0, len, dest)
   
   You'll get an EOF here. Find the file through a listing and you'll be OK
   provided S3Guard isn't updated with that GetFileStatus result, which I
   have seen.
   
   The ordering of probes in 
ITestMagicCommitProtocol.validateTaskAttemptPathAfterWrite
   need to be list before getFileStatus, so the S3Guard table is updated from
   the list.
   
   overall: danger. Even without S3Guard there's risk.
   
   Anyway, shown it can be done. And I think there's a merit in a leaner patch
   which attaches the marker but doesn't do any fixup. This would let us add
   an API call "getObjectHeaders(path) -> Future<Map<String, String>> and
   then use that to do the lookup. We can implement the probe for
   ABFS and S3, add a hasPathCapabilities for it as well as an interface
   the FS can implement (which passthrough filesystems would need to do).
   
   Change-Id: If56213c0c5d8ab696d2d89b48ad52874960b0920
   
   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HADOOP-XXXXX. Fix a typo in YYY.)
   For more details, please see 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

            Worklog Id:     (was: 521367)
    Remaining Estimate: 0h
            Time Spent: 10m

> Magic committer files don't have the count of bytes written collected by spark
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-17414
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17414
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.2.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The spark statistics tracking doesn't correctly assess the size of the 
> uploaded files as it only calls getFileStatus on the zero byte objects -not 
> the yet-to-manifest files.
> Everything works with the staging committer purely because it's measuring the 
> length of the files staged to the local FS, not the unmaterialized output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to