steveloughran opened a new pull request #2530:
URL: https://github.com/apache/hadoop/pull/2530


   …tten collected by spark
   
   This is a PoC which, having implemented, I don't think is viable.
   
   Yes, we can fix up getFileStatus so it reads the header. It even knows
   to always bypass S3Guard (no inconsistencies to worry about any more).
   
   But: the blast radius of the change is too big. I'm worried about
   distcp or any other code which goes
   len =getFileStatus(path).getLen()
   open(path).readFully(0, len, dest)
   
   You'll get an EOF here. Find the file through a listing and you'll be OK
   provided S3Guard isn't updated with that GetFileStatus result, which I
   have seen.
   
   The ordering of probes in 
ITestMagicCommitProtocol.validateTaskAttemptPathAfterWrite
   need to be list before getFileStatus, so the S3Guard table is updated from
   the list.
   
   overall: danger. Even without S3Guard there's risk.
   
   Anyway, shown it can be done. And I think there's a merit in a leaner patch
   which attaches the marker but doesn't do any fixup. This would let us add
   an API call "getObjectHeaders(path) -> Future<Map<String, String>> and
   then use that to do the lookup. We can implement the probe for
   ABFS and S3, add a hasPathCapabilities for it as well as an interface
   the FS can implement (which passthrough filesystems would need to do).
   
   Change-Id: If56213c0c5d8ab696d2d89b48ad52874960b0920
   
   ## NOTICE
   
   Please create an issue in ASF JIRA before opening a pull request,
   and you need to set the title of the pull request which starts with
   the corresponding JIRA issue number. (e.g. HADOOP-XXXXX. Fix a typo in YYY.)
   For more details, please see 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to