Steve Loughran created HADOOP-13654:
---------------------------------------

             Summary: S3A create() to support asynchronous check of dest & 
parent paths
                 Key: HADOOP-13654
                 URL: https://issues.apache.org/jira/browse/HADOOP-13654
             Project: Hadoop Common
          Issue Type: Sub-task
          Components: fs/s3
    Affects Versions: 2.7.3
            Reporter: Steve Loughran


One source of delays in S3A is the need to check if a destination path exists 
in create; this makes sure the operation isn't trying to overwrite a directory.

#. This is slow, 1-4 HTTPS requests
# The code doesn't seem to check the entire parent path to make sure there 
isn't a file as a parent (which raises the question: shouldn't we have a 
contract test for this?)
# Even with the create overwrite=false check, the fact that the new object 
isn't created until the output stream is close()'d, means that the check has 
race conditions.

Instead of doing a synchronous check in create(), we could do an asynchronous 
check of the parent directory tree. If any error surfaced, this could be cached 
and then thrown on the next call to: write(), flush() or close(); that is, the 
failure of a create due to path problems would not surface immediately on the 
create() call, *but before any writes were committed*.

The full directory tree can/should be checked, and is results remembered. This 
would allow for the post-commit cleanup to issue delete() requests purely for 
those paths (if any) which referred to directories.

As well as the need to use the AWS thread pool, there's a bit of complexity 
with cancelling multipart uploads: the output stream needs to know that the 
request failed, and that the multipart should be aborted.

If the complexity of the asynchronous calls can be coped with, *and client code 
happy to accept errors in the any IO call to the output stream*, then the 
initial overhead at file creation could be skipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to