[ 
https://issues.apache.org/jira/browse/HADOOP-17833?focusedWorklogId=781261&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-781261
 ]

ASF GitHub Bot logged work on HADOOP-17833:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 14/Jun/22 14:31
            Start Date: 14/Jun/22 14:31
    Worklog Time Spent: 10m 
      Work Description: steveloughran commented on code in PR #3289:
URL: https://github.com/apache/hadoop/pull/3289#discussion_r896902900


##########
hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdataoutputstreambuilder.md:
##########
@@ -182,3 +182,58 @@ see `FileSystem#create(path, ...)` and 
`FileSystem#append()`.
     result = FSDataOutputStream
 
 The result is `FSDataOutputStream` to be used to write data to filesystem.
+
+
+## <a name="s3a"></a> S3A-specific options
+
+Here are the custom options which the S3A Connector supports.
+
+| Name                        | Type      | Meaning                            
    |
+|-----------------------------|-----------|----------------------------------------|
+| `fs.s3a.create.performance` | `boolean` | create a file with maximum 
performance |
+| `fs.s3a.create.header`      | `string`  | prefix for user supplied headers   
    |
+
+### `fs.s3a.create.performance`
+
+Prioritize file creation performance over safety checks for filesystem 
consistency.
+
+This:
+1. Skips the `LIST` call which makes sure a file is being created over a 
directory.
+   Risk: a file is created over a directory.
+1. Ignores the overwrite flag.
+1. Never issues a `DELETE` call to delete parent directory markers.
+
+It is possible to probe an S3A Filesystem instance for this capability through
+the `hasPathCapability(path, "fs.s3a.create.performance")` check.
+
+Creating files with this option over existing directories is likely
+to make S3A filesystem clients behave inconsistently.
+
+Operations optimized for directories (e.g. listing calls) are likely
+to see the directory tree not the file; operations optimized for
+files (`getFileStatus()`, `isFile()`) more likely to see the file.
+The exact form of the inconsistencies, and which operations/parameters
+trigger this are undefined and may change between even minor releases.
+
+Using this option is the equivalent of pressing and holding down the
+"Electronic Stability Control"
+button on a rear-wheel drive car for five seconds: the safety checks are off.
+Things wil be faster if the driver knew what they were doing.
+If they didn't, the fact they had held the button down will
+be used as evidence at the inquest as proof that they made a
+conscious decision to choose speed over safety and
+that the outcome was their own fault.
+
+Accordingly: *Use if and only if you are confident that the conditions are 
met.*

Review Comment:
   we aren't actually that more vulnerable than when someone creates a file 
under a file, *which they can do today*. 





Issue Time Tracking
-------------------

    Worklog Id:     (was: 781261)
    Time Spent: 12h  (was: 11h 50m)

> Improve Magic Committer Performance
> -----------------------------------
>
>                 Key: HADOOP-17833
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17833
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>    Affects Versions: 3.3.1
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 12h
>  Remaining Estimate: 0h
>
> Magic committer tasks can be slow because every file created with 
> overwrite=false triggers a HEAD (verify there's no file) and a LIST (that 
> there's no dir). And because of delayed manifestations, it may not behave as 
> expected.
> ParquetOutputFormat is one example of a library which does this.
> we could fix parquet to use overwrite=true, but (a) there may be surprises in 
> other uses (b) it'd still leave the list and (c) do nothing for other formats 
> call
> Proposed: createFile() under a magic path to skip all probes for file/dir at 
> end of path
> Only a single task attempt Will be writing to that directory and it should 
> know what it is doing. If there is conflicting file names and parts across 
> tasks that won't even get picked up at this point. Oh and none of the 
> committers ever check for this: you'll get the last file manifested (s3a) or 
> renamed (file)
> If we skip the checks we will save 2 HTTP requests/file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to