[ 
https://issues.apache.org/jira/browse/HADOOP-19072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842628#comment-17842628
 ] 

ASF GitHub Bot commented on HADOOP-19072:
-----------------------------------------

steveloughran commented on PR #6543:
URL: https://github.com/apache/hadoop/pull/6543#issuecomment-2088443377

   I'm going to propose we change how the options are done, and do something 
similar for ABFS.
   I think we need something like C/C++ optimisers where you pass in a -O list 
of things to optimise.
   This lets you turn on everything you can but turn off those which breaks 
your code.
   
   This avoids us having to deal with regressions where suddenly something 
breaks -and the only fix is to turn off all optimisation.
   
   It also gives us the ability to add some very aggressive optimisations, such 
as disabling probing for and recreating parent directories on delete and 
rename. harshit has tried this and some things break.
   If we make this one of the flags, those deployments with applications which 
know they are robust can turn it on.
   
   
   * we add specific optimisation flags for different behaviours, which we can 
explicitly turn on and off.
   * we add a single "fs.s3a.performance.options" which takes a list of these, 
parses to an S3APerformanceFlags object containing the flags. we can move this 
parsing out of the s3aFileSystem code into the S3APerformanceFlags class, which 
assists testing.
   * unknown flags are logged once at info.
   * the performance flags are available from StoreContext.
   
   It would still be good to wire this up to hasPathCapability.
   Proposed: S3APerformanceFlags adds a hasCapability(string) method, and if 
s3aFS.hasPathCapability is probed for a feature begininning 
"fs.s3a.performance.options." then the rest of the string passed in for the 
check.
   
   The create.performance flag now complicates things here. Too bad we have 
already shipped it.
   Proposed: S3APerformanceFlags also looks for that flag value, but also adds 
"create" as one of the options
   
   that is
   ```
   fs.s3a.performance.options=create,mkdirs
   ```
   would cover both that and this new change.
   
   I think we can and should lay down a policy here then which is "the 
semantics of a specific optimisation flag MUST NOT CHANGE" but that new flags 
MAY be added to tune that behaviour further.
   
   the most radical would be if we copy the presto connector and declare that 
all paths which don't return a file or a non-empty listing are a directory.
   they can downgrade mkdirs() to a no-op, delete dir, rm dir etc always report 
a parent dir existing, so their code is happy.
   
   They can get away with this because they know the exact semantics of the 
code -and that it does not break with this change. We lack that luxury across 
the broad pool of applications using our library. That doesn't mean we can't 
allow a select few applications to take advantage of optimisations we have 
written and tested with them.




> S3A: expand optimisations on stores with "fs.s3a.create.performance"
> --------------------------------------------------------------------
>
>                 Key: HADOOP-19072
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19072
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.4.0
>            Reporter: Steve Loughran
>            Assignee: Viraj Jasani
>            Priority: Major
>              Labels: pull-request-available
>
> on an s3a store with fs.s3a.create.performance set, speed up other operations
> *  mkdir to skip parent directory check: just do a HEAD to see if there's a 
> file at the target location



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to