[
https://issues.apache.org/jira/browse/HADOOP-17981?focusedWorklogId=676063&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-676063
]
ASF GitHub Bot logged work on HADOOP-17981:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 04/Nov/21 01:55
Start Date: 04/Nov/21 01:55
Worklog Time Spent: 10m
Work Description: sidseth commented on pull request #3597:
URL: https://github.com/apache/hadoop/pull/3597#issuecomment-958658416
> > This mechanism becomes very FileSystem specific. Implemented by Azure
right now.
>
> I agree, which is why the API is restricted for its uses to mr-client-core
only. as abfs is the only one which needs it for correctness under load, And
I'm not worried about that specifity. Can I point to how much of the hadoop fs
api are hdfs-only -and they are public.
>
> > Other users of rename will not see the benefits without changing
interfaces, which in turn requires shimming etc.
>
> Please don't try and use this particular interface in Hive.
>
Was referring to any potential usage - including Hive.
> > Would it be better for AzureFileSystem rename itself to add a config
parameter which can lookup the src etag (at the cost of a performance hit for
consistency), so that downstream components / any users of the rename operation
can benefit from this change without having to change interfaces.
>
> We are going straight from a listing (1 request/500 entries) to that
rename. doing a HEAD first cuts the throughtput in half. so no.
>
In the scenario where this is encountered. Would not be the default
behaviour, and limits the change to Abfs. Could also have the less consistent
version which is not etag based, and responds only on failures. Again - limited
to Abfs.
> > Also, if the performance penalty is a big problem - Abfs could create
very short-lived caches for FileStatus objects, and handle errors on
discrepancies with the cached copy.
>
> Possible but convoluted.
>
Agree. Quite convoluted. Tossing in potential options - to avoid a new
public API.
> > Essentially - don't force usage of the new interface to get the benefits.
>
> I understand the interests of the hive team, but this fix is not the place
to do a better API.
>
> Briefly cacheing the source FS entries is something to consider though.
Not this week.
>
> What I could do with is some help getting #2735 in, then we can start on a
public rename() builder API which will take a file status, as openFile does.
>
This particular change would be FSImpl agnostic, and potentially remove the
need for the new interface here?
> > Side note: The fs.getStatus within ResilientCommitByRenameHelper for
FileSystems where this new functionality is not supported will lead to a
performance penalty for the other FileSystems (performing a getFileStatus on
src).
>
> There is an option to say "i know it is not there"; this skips the check.
the committer passes this option down because it issues a delete call first.
>
EOD - this ends up being a new API (almost on the FileSystem), which is used
by the committer first; then someone discovers it and decides to make use of it.
> FWIW the manifest committer will make that pre-rename commit optional,
saving that IO request. I am curious as to how well that will work I went
executed on well formed tables.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 676063)
Time Spent: 8h 10m (was: 8h)
> Support etag-assisted renames in FileOutputCommitter
> ----------------------------------------------------
>
> Key: HADOOP-17981
> URL: https://issues.apache.org/jira/browse/HADOOP-17981
> Project: Hadoop Common
> Issue Type: New Feature
> Components: fs, fs/azure
> Affects Versions: 3.4.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Priority: Major
> Labels: pull-request-available
> Time Spent: 8h 10m
> Remaining Estimate: 0h
>
> To deal with some throttling/retry issues in object stores,
> pass the FileStatus entries retrieved during listing
> into a private interface ResilientCommitByRename which filesystems
> may implement to use extra attributes in the listing (etag, version)
> to constrain and validate the operation.
> Although targeting azure, GCS and others could use. no point in S3A as they
> shouldn't use this committer.
> # And we are not going to do any changes to FileSystem as there are explicit
> guarantees of public use and stability.
> I am not going to make that change as the hive thing that will suddenly start
> expecting it to work forever.
> # I'm not planning to merge this in, as the manifest committer is going to
> include this and more (MAPREDUCE-7341)
> However, I do need to get this in on a branch, so am doing this work on trunk
> for dev & test and for others to review
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]