[
https://issues.apache.org/jira/browse/HADOOP-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802888#comment-15802888
]
Sanjay Radia commented on HADOOP-11452:
---------------------------------------
Steve suggested:
bq. note that we could consider adding a new enum operation
Rename.ATOMIC_REQUIRED which will fail if atomicity is not supported
We had considered such things (and this specific one) multiple times in the
past, in the context of S3 and also the local file system for not just rename
but also other methods. Neither local fs or S3 have exactly the same semantics
as HDFS for each method. *Here is the main issue:* File systems like
LocalFIlesystem is used for testing apps and for a long time S3 was used for
simply testing or for non-critical usage on the cloud. Folks were willing to
live with the occasional inconsistency or with the performance overhead of say
copy-delete for rename on S3. If applications like hive or Spark used the
rename.ATOMIC_REQUIRED on then the app would just fail on S3 and those use
cases (testing, non-critical or willing to live with the performance overhead)
would not be supported and its users would be unhappy.
Now that users want to run production apps on cloud storage like S3, apps like
Hive are being modified to run well on S3 by changing how they do commit (say
via the metastore or a menifest file instead of the rename).
So adding the Rename.ATOMIC_REQUIRED flag is easy. But is it going to be
useful? Please articulate how it will be used. For example if we were to change
Hive to use Rename.ATOMIC_REQUIRED then Hive will just fail on S3.
So I think we should continue to make progress on Hive, Spark and others to run
first class on S3. I dont think Rename.ATOMIC_REQUIRED helps. I believe it make
sense to have an FS.whatFeaturesDoYouSupport() API so that an app like Hive
could be implemented to run first class on HDFS, S3, AzureBlobStoage etc by
querying the FS features and then using a different implementation for say
committing the output of a job. In some cases it may be better to use a totally
different approach that works on all FSs such as a manifest file or depend on
Hive Metastore to commit . (Turns out hive needs to be able to commit multiple
tables and hence even the rename-dir is not good enough.)
> Revisit FileSystem.rename(path, path, options)
> ----------------------------------------------
>
> Key: HADOOP-11452
> URL: https://issues.apache.org/jira/browse/HADOOP-11452
> Project: Hadoop Common
> Issue Type: Task
> Components: fs
> Affects Versions: 2.7.3
> Reporter: Yi Liu
> Assignee: Steve Loughran
> Attachments: HADOOP-11452-001.patch, HADOOP-11452-002.patch
>
>
> Currently in {{FileSystem}}, {{rename}} with _Rename options_ is protected
> and with _deprecated_ annotation. And the default implementation is not
> atomic.
> So this method is not able to be used outside. On the other hand, HDFS has a
> good and atomic implementation. (Also an interesting thing in {{DFSClient}},
> the _deprecated_ annotations for these two methods are opposite).
> It makes sense to make public for {{rename}} with _Rename options_, since
> it's atomic for rename+overwrite, also it saves RPC calls if user desires
> rename+overwrite.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]