[
https://issues.apache.org/jira/browse/HADOOP-17414?focusedWorklogId=521367&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-521367
]
ASF GitHub Bot logged work on HADOOP-17414:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 07/Dec/20 20:26
Start Date: 07/Dec/20 20:26
Worklog Time Spent: 10m
Work Description: steveloughran opened a new pull request #2530:
URL: https://github.com/apache/hadoop/pull/2530
…tten collected by spark
This is a PoC which, having implemented, I don't think is viable.
Yes, we can fix up getFileStatus so it reads the header. It even knows
to always bypass S3Guard (no inconsistencies to worry about any more).
But: the blast radius of the change is too big. I'm worried about
distcp or any other code which goes
len =getFileStatus(path).getLen()
open(path).readFully(0, len, dest)
You'll get an EOF here. Find the file through a listing and you'll be OK
provided S3Guard isn't updated with that GetFileStatus result, which I
have seen.
The ordering of probes in
ITestMagicCommitProtocol.validateTaskAttemptPathAfterWrite
need to be list before getFileStatus, so the S3Guard table is updated from
the list.
overall: danger. Even without S3Guard there's risk.
Anyway, shown it can be done. And I think there's a merit in a leaner patch
which attaches the marker but doesn't do any fixup. This would let us add
an API call "getObjectHeaders(path) -> Future<Map<String, String>> and
then use that to do the lookup. We can implement the probe for
ABFS and S3, add a hasPathCapabilities for it as well as an interface
the FS can implement (which passthrough filesystems would need to do).
Change-Id: If56213c0c5d8ab696d2d89b48ad52874960b0920
## NOTICE
Please create an issue in ASF JIRA before opening a pull request,
and you need to set the title of the pull request which starts with
the corresponding JIRA issue number. (e.g. HADOOP-XXXXX. Fix a typo in YYY.)
For more details, please see
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 521367)
Remaining Estimate: 0h
Time Spent: 10m
> Magic committer files don't have the count of bytes written collected by spark
> ------------------------------------------------------------------------------
>
> Key: HADOOP-17414
> URL: https://issues.apache.org/jira/browse/HADOOP-17414
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 3.2.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> The spark statistics tracking doesn't correctly assess the size of the
> uploaded files as it only calls getFileStatus on the zero byte objects -not
> the yet-to-manifest files.
> Everything works with the staging committer purely because it's measuring the
> length of the files staged to the local FS, not the unmaterialized output.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]