[
https://issues.apache.org/jira/browse/HADOOP-17981?focusedWorklogId=673066&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-673066
]
ASF GitHub Bot logged work on HADOOP-17981:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 02/Nov/21 10:28
Start Date: 02/Nov/21 10:28
Worklog Time Spent: 10m
Work Description: steveloughran commented on pull request #3597:
URL: https://github.com/apache/hadoop/pull/3597#issuecomment-957310383
> This mechanism becomes very FileSystem specific. Implemented by Azure
right now.
I agree, which is why the API is restricted for its uses to mr-client-core
only.
as abfs is the only one which needs it for correctness under load,
And I'm not worried about that specifity.
Can I point to how much of the hadoop fs api are hdfs-only -and they are
public.
> Other users of rename will not see the benefits without changing
interfaces, which in turn requires shimming etc.
Please don't try and use this particular interface in Hive.
> Would it be better for AzureFileSystem rename itself to add a config
parameter which can lookup the src etag (at the cost of a performance hit for
consistency), so that downstream components / any users of the rename operation
can benefit from this change without having to change interfaces.
We are going straight from a listing (1 request/500 entries) to that rename.
doing a HEAD first cuts the throughtput in half. so no.
> Also, if the performance penalty is a big problem - Abfs could create very
short-lived caches for FileStatus objects, and handle errors on discrepancies
with the cached copy.
Possible but convoluted.
> Essentially - don't force usage of the new interface to get the benefits.
I understand the interests of the hive team, but this fix is not the place
to do a better API.
Briefly cacheing the source FS entries is something to consider though. Not
this week.
What I could do with is some help getting #2735 in, then we can start on a
public rename() builder API which will take a file status, as openFile does.
> Side note: The fs.getStatus within ResilientCommitByRenameHelper for
FileSystems where this new functionality is not supported will lead to a
performance penalty for the other FileSystems (performing a getFileStatus on
src).
There is an option to say "i know it is not there"; this skips the check.
the committer passes this option down because it issues a delete call first.
FWIW the manifest committer will make that pre-rename commit optional,
saving that IO request. I am curious as to how well that will work I went
executed on well formed tables.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 673066)
Time Spent: 2h 20m (was: 2h 10m)
> Support etag-assisted renames in FileOutputCommitter
> ----------------------------------------------------
>
> Key: HADOOP-17981
> URL: https://issues.apache.org/jira/browse/HADOOP-17981
> Project: Hadoop Common
> Issue Type: New Feature
> Components: fs, fs/azure
> Affects Versions: 3.4.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Major
> Labels: pull-request-available
> Time Spent: 2h 20m
> Remaining Estimate: 0h
>
> To deal with some throttling/retry issues in object stores,
> pass the FileStatus entries retrieved during listing
> into a private interface ResilientCommitByRename which filesystems
> may implement to use extra attributes in the listing (etag, version)
> to constrain and validate the operation.
> Although targeting azure, GCS and others could use. no point in S3A as they
> shouldn't use this committer.
> # And we are not going to do any changes to FileSystem as there are explicit
> guarantees of public use and stability.
> I am not going to make that change as the hive thing that will suddenly start
> expecting it to work forever.
> # I'm not planning to merge this in, as the manifest committer is going to
> include this and more (MAPREDUCE-7341)
> However, I do need to get this in on a branch, so am doing this work on trunk
> for dev & test and for others to review
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]