[ 
https://issues.apache.org/jira/browse/HADOOP-19251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17873898#comment-17873898
 ] 

Steve Loughran commented on HADOOP-19251:
-----------------------------------------

Rename. Joy.

Way way back I did try to export the existing protected 
FileSystem.rename(source. dest, options) method as a public API -this is the 
one which FileContext invokes but which defaults to being non-atomic (the 
exists probes, see). What I love about this one is that it actually fails 
meaningfully rather than just returning false for callers to invoke as

{code}
if (!fs.rename(src, dest)) throw new IOException("rename failed but we have no 
idea why");
{code}

This is of course the good invocation; the bad one is they don't check the 
result. Either way: pretty useless



HADOOP-11452 make rename/3 public
https://github.com/apache/hadoop/pull/2735

I think I got distracted by other stuff but also by some of the implications. 
And the fact that even that wasn't enough for my needs.


You might also want to look at 
AzureBlobFileSystem.commitSingleFileByRename(src, dest, etag) which implements 
felt tolerant for rename on your by recovering from load-related failures (503 
returned but rename worked so a retry fails w/ 404). It's also throws 
exceptions and returned some information about whether the reign was recovered 
from and how long it took, adding to the manifest committer's statistics in 
_SUCCESS. Oh, and it is rate limited because it is often that renaming which 
can't generate heavy load across the entire storage account and so impact other 
applications.

All this convinced me that the way to do it would actually be to have a 
builder-based rename the way we do for openFile()/createFile(), atomic rename 
would be one of the options (along with etag/version id). 


{code}
// atomic rename, src etag.
CompletableFuture<RenameOutcome> r = filesystem.initiateRename(source, dest)
  .opt("fs.rename.src.etag", "abb2a")
  .must("fs.rename.atomic", true)
  .build()
  
RenameOutcome o = FutureIO.awaitFuture(r)
{code}

RenameOutcome would implement IOStatisticsSource; all failures raised as 
exceptions of some kind (awaitFuture() unwraps these)

{code}
// rename may/may not be atomic. if slower, provide some progress callbacks
CompletableFuture<RenameOutcome> r = filesystem.initiateRename(source, dest)
  .withStatus(sourceFileStatus)   // version id/etag can be picked up here. 
source path doesn't need to match status.path; source is normative.
  .withProgress(callback)
  .build()
  
// here we may have a slower rename which could be cancelled, chained with 
other operations
r.cancel()
{code}

See? This is exactly what we would want for object storage. Options to specify 
constraints, file status to skip a head request and a synchronous completion 
with intermediate progress callbacks.

But it gets really complicated really fast –and it will become a commitment to 
get right and maintain. I'm not saying that isn't the right action to do -just 
that it was going to take too much time on something which wasn't actually 
going to work properly on GCS and S3 anyway. I had more important things to do.

One thing which is a lot more tractable is to define PathCapabilities probes 
for rename semantics, which filesystems can be queried for. 

fs.rename.file.atomic  : file rename is atomic  (hdfs. fs. abfs. gcs)
fs.rename.directory.atomic : same for dirs. false for GCS
fs.rename.file.fast: O(1) performance independent of file size. false for AWS 
S3; true for most others
fs.rename.directory.fast: O(1) performance for dir rename, independent of 
atomicity. false for s3 and gcs

Doing that with some contract test to probe stuff may be good. We'd do true for 
all "real" filesystems, for the object stores we don't know of we have to check 
an update internally or as PRs to their external repos (gcs)

Assuming you are trying to commit work through rename and want to know the 
semantics match your requirements that should be enough. If you want to take 
that on we can help supervise. It is low on code; understanding what the stores 
do the important thing.

There is another strategy which is to use the Abortable interface which S3A 
output implement to let you write to the destination, but let you back off if 
you don't want to commit. Problem here: s3 doesn't have a no-overwrite flag the 
way some other stores do, so you still cannot use it for an atomic write.

Meanwhile, if you are worrying about object stores, how about you take a look 
at https://github.com/apache/hadoop/pull/6938 ? We have encountered this in the 
wild -it looks rare enough that the fact the AWS SDK can't recover has never 
been spotted by their team.

> Add Options.Rename.THROW_NON_ATOMIC
> -----------------------------------
>
>                 Key: HADOOP-19251
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19251
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 3.3.6
>            Reporter: Alkis Evlogimenos
>            Priority: Major
>
> I propose we add an option `Options.Rename.THROW_NON_ATOMIC` to change 
> `rename()` behavior to throw when the underlying filesystem's rename 
> operation is not atomic.
> This would be useful for callers that expect to perform an atomic op but want 
> to fail if when an atomic rename fails.
>  
> At first this might seem something that can be done by querying capabilities 
> of the filesystem but that would only work on real filesystems. A motivating 
> example would be a virtual filesystem for which paths can resolve to any 
> concrete filesystem (s3, etc). If `rename()` is called with two virtual paths 
> that resolve to different filesystems (s3 and gcs for example) then obviously 
> the operation can't be atomic since bytes must be copied from one fs to 
> another.
>  
> What do you think [~steve_l] ?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to