[ https://issues.apache.org/jira/browse/HADOOP-19072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842628#comment-17842628 ]
ASF GitHub Bot commented on HADOOP-19072: ----------------------------------------- steveloughran commented on PR #6543: URL: https://github.com/apache/hadoop/pull/6543#issuecomment-2088443377 I'm going to propose we change how the options are done, and do something similar for ABFS. I think we need something like C/C++ optimisers where you pass in a -O list of things to optimise. This lets you turn on everything you can but turn off those which breaks your code. This avoids us having to deal with regressions where suddenly something breaks -and the only fix is to turn off all optimisation. It also gives us the ability to add some very aggressive optimisations, such as disabling probing for and recreating parent directories on delete and rename. harshit has tried this and some things break. If we make this one of the flags, those deployments with applications which know they are robust can turn it on. * we add specific optimisation flags for different behaviours, which we can explicitly turn on and off. * we add a single "fs.s3a.performance.options" which takes a list of these, parses to an S3APerformanceFlags object containing the flags. we can move this parsing out of the s3aFileSystem code into the S3APerformanceFlags class, which assists testing. * unknown flags are logged once at info. * the performance flags are available from StoreContext. It would still be good to wire this up to hasPathCapability. Proposed: S3APerformanceFlags adds a hasCapability(string) method, and if s3aFS.hasPathCapability is probed for a feature begininning "fs.s3a.performance.options." then the rest of the string passed in for the check. The create.performance flag now complicates things here. Too bad we have already shipped it. Proposed: S3APerformanceFlags also looks for that flag value, but also adds "create" as one of the options that is ``` fs.s3a.performance.options=create,mkdirs ``` would cover both that and this new change. I think we can and should lay down a policy here then which is "the semantics of a specific optimisation flag MUST NOT CHANGE" but that new flags MAY be added to tune that behaviour further. the most radical would be if we copy the presto connector and declare that all paths which don't return a file or a non-empty listing are a directory. they can downgrade mkdirs() to a no-op, delete dir, rm dir etc always report a parent dir existing, so their code is happy. They can get away with this because they know the exact semantics of the code -and that it does not break with this change. We lack that luxury across the broad pool of applications using our library. That doesn't mean we can't allow a select few applications to take advantage of optimisations we have written and tested with them. > S3A: expand optimisations on stores with "fs.s3a.create.performance" > -------------------------------------------------------------------- > > Key: HADOOP-19072 > URL: https://issues.apache.org/jira/browse/HADOOP-19072 > Project: Hadoop Common > Issue Type: Sub-task > Components: fs/s3 > Affects Versions: 3.4.0 > Reporter: Steve Loughran > Assignee: Viraj Jasani > Priority: Major > Labels: pull-request-available > > on an s3a store with fs.s3a.create.performance set, speed up other operations > * mkdir to skip parent directory check: just do a HEAD to see if there's a > file at the target location -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org