[ 
https://issues.apache.org/jira/browse/HADOOP-17981?focusedWorklogId=676063&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-676063
 ]

ASF GitHub Bot logged work on HADOOP-17981:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 04/Nov/21 01:55
            Start Date: 04/Nov/21 01:55
    Worklog Time Spent: 10m 
      Work Description: sidseth commented on pull request #3597:
URL: https://github.com/apache/hadoop/pull/3597#issuecomment-958658416


   > > This mechanism becomes very FileSystem specific. Implemented by Azure 
right now.
   > 
   > I agree, which is why the API is restricted for its uses to mr-client-core 
only. as abfs is the only one which needs it for correctness under load, And 
I'm not worried about that specifity. Can I point to how much of the hadoop fs 
api are hdfs-only -and they are public.
   > 
   > > Other users of rename will not see the benefits without changing 
interfaces, which in turn requires shimming etc.
   > 
   > Please don't try and use this particular interface in Hive.
   > 
   Was referring to any potential usage - including Hive.
   > > Would it be better for AzureFileSystem rename itself to add a config 
parameter which can lookup the src etag (at the cost of a performance hit for 
consistency), so that downstream components / any users of the rename operation 
can benefit from this change without having to change interfaces.
   > 
   > We are going straight from a listing (1 request/500 entries) to that 
rename. doing a HEAD first cuts the throughtput in half. so no.
   > 
   In the scenario where this is encountered. Would not be the default 
behaviour, and limits the change to Abfs. Could also have the less consistent 
version which is not etag based, and responds only on failures. Again - limited 
to Abfs.
   > > Also, if the performance penalty is a big problem - Abfs could create 
very short-lived caches for FileStatus objects, and handle errors on 
discrepancies with the cached copy.
   > 
   > Possible but convoluted.
   > 
   Agree. Quite convoluted. Tossing in potential options - to avoid a new 
public API.
   > > Essentially - don't force usage of the new interface to get the benefits.
   > 
   > I understand the interests of the hive team, but this fix is not the place 
to do a better API.
   > 
   > Briefly cacheing the source FS entries is something to consider though. 
Not this week.
   > 
   > What I could do with is some help getting #2735 in, then we can start on a 
public rename() builder API which will take a file status, as openFile does.
   > 
   This particular change would be FSImpl agnostic, and potentially remove the 
need for the new interface here?
   > > Side note: The fs.getStatus within ResilientCommitByRenameHelper for 
FileSystems where this new functionality is not supported will lead to a 
performance penalty for the other FileSystems (performing a getFileStatus on 
src).
   > 
   > There is an option to say "i know it is not there"; this skips the check. 
the committer passes this option down because it issues a delete call first.
   > 
   EOD - this ends up being a new API (almost on the FileSystem), which is used 
by the committer first; then someone discovers it and decides to make use of it.
   > FWIW the manifest committer will make that pre-rename commit optional, 
saving that IO request. I am curious as to how well that will work I went 
executed on well formed tables.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 676063)
    Time Spent: 8h 10m  (was: 8h)

> Support etag-assisted renames in FileOutputCommitter
> ----------------------------------------------------
>
>                 Key: HADOOP-17981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17981
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure
>    Affects Versions: 3.4.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> To deal with some throttling/retry issues in object stores,
> pass the FileStatus entries retrieved during listing
> into a private interface ResilientCommitByRename which filesystems
> may implement to use extra attributes in the listing (etag, version)
> to constrain and validate the operation.
> Although targeting azure, GCS and others could use. no point in S3A as they 
> shouldn't use this committer.
> # And we are not going to do any changes to FileSystem as there are explicit 
> guarantees of public use and stability.
> I am not going to make that change as the hive thing that will suddenly start 
> expecting it to work forever.
> # I'm not planning to merge this in, as the manifest committer is going to 
> include this and more (MAPREDUCE-7341)
> However, I do need to get this in on a branch, so am doing this work on trunk 
> for dev & test and for others to review



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to