[jira] [Work logged] (HADOOP-17682) ABFS: Support FileStatus input to OpenFileWithOptions() via OpenFileParameters

ASF GitHub Bot (Jira) Fri, 27 Aug 2021 08:58:06 -0700


     [ 
https://issues.apache.org/jira/browse/HADOOP-17682?focusedWorklogId=642890&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-642890
 ]


ASF GitHub Bot logged work on HADOOP-17682:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 27/Aug/21 15:57
            Start Date: 27/Aug/21 15:57
    Worklog Time Spent: 10m 
      Work Description: steveloughran commented on pull request #2975:
URL: https://github.com/apache/hadoop/pull/2975#issuecomment-907304674


   
   > eTag is a parameter that is sent with every read request to server today. 
Server ensures that the file is indeed of the expected eTag before proceeding 
with the read operation. If the file was recreated during the timespan 
inputStream was created and read is issued, eTag check will fail and so will 
the read, else read will end up wrongly getting data from the new file. This is 
the reason we are mandating eTag to be present and in else case fall back to 
triggering a GetFileStatus. We will changing the log type to debug as per your 
comment in the else case.
   
   I do know about etags. The S3A connector picks it up on the HEAD or first 
GET and caches it, to at least ensure that things are unchanged for the 
duration of the stream. 
   
   
   > Noticed an example where S3 eTag is passed as an optional key (referring 
to link , for the Hive case where FileStatus is wrapped, is this how eTag gets 
passed ? And was also wondering how we can get various Hadoop workloads to send 
it without being aware of which is the current active FileSystem considering 
the key name is specific to it.
   
   not in use at all by hive, even in PoCs. I'm not worrying about changes 
between the list and the open, just about consistency once a file is opened. If 
a file was changed between hive query planning and work execution then as it is 
invariably of a different length the workers will notice soon enough. Anyway, 
HDFS doesn't worry about this so why should the other stores?
   
   When you skip the HEAD, the first GET will return the etag. All you have to 
do is skip setting the If-Match header on the first request, pick the etag from 
the response and use it after.
   
   
   > I did have a look at #2584 which provides the means to pass mandatory and 
optional keys through OpenFileParameters. 
   
   This was *already* in the openFile() API call from the outset. Indeed, abfs 
already supports one, `fs.azure.buffered.pread.disable`.
   
   What the new PR adds is
   * standard options for read policies, split start/end, all of which can be 
used to optimize reads.
   * tuning of the `withFileStatus()` based on initial use in format libraries, 
attempts to use in hive, etc.
   * tests for this stuff
   
   No API change, other than `withFileStatus(null)` required to be a no-op, and 
requirements in the API spec relaxed to say "only check path.getName()" for a 
match.
   
   
   > Also wanted to check if it would be right to add eTag as a field into the 
base FileStatus class. Probably as a sub structure within FileStatus so that it 
can be expanded in future to hold properties that Cloud storage systems 
commonly depend o
   
   way too late for that. used everywhere, gets serialized etc. What you could 
do is add some interface `EtagSource` with a `getEtag()` method, and implement 
that interface for the ABFS and s3a connectors. It's still going to be lost 
when things like hive replace the custom FileStatus with the standard one.
   
   Where this could be useful is in distcp applications where the etag of an 
uploaded file could be cached and compared when rescanning a directory. IF a 
store implements the getFileChecksum() method you can serve the etag that way 
but (a) distcp can't handle it and (b) if you could get it on every file from a 
list() call, you save a lot of IO.
   
   So please, have someone implement this in hadoop-common, with specification, 
tests etc.
   
   
   > OpenFile() change is one of the key performance improvement that we are 
looking at adopting to. The aspect of FileStatus being passed down in itself 
reduces HEAD request count by half and look forward to adopting to the new set 
of read policies too. We will work on understanding how to map various read 
policies to the current optimizations that the driver has for different read 
patterns. I think that would translate to a change equivalent to 
PrepareToOpenFile for ABFS driver.
   
   I've just tried to let apps declare how they want to read things; there's 
generally a straightforward map to sequential vs random vs adaptive...but I 
hope that the orc and parquet options would provide even more opportunity to 
tweak behaviour, e.g .by knowing that there will be an initial footer read 
sequence, then stripes will be read.
   
   > Would it be ok if we make this change once #2584 checks in ?
   
   yes, but I do want to at least have the azure impl not reimplement the same 
brittleness around `withFileStatus()` as s3a before it goes into 3.3. That way, 
consistency.
   
   > Currently we are in a bit of tight schedule and short staffed as we aim to 
complete on the feature work tracked in HADOOP-17853 and another customer 
requirement that we are in feasibility analysis stage.
   
   I really want to get a hadoop 3.3.2 out the door with this API in it. It's 
going to happen before '17853 as that is a big piece of work, and which is 
going to need lots of review time by others, myself included.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 642890)
    Time Spent: 7.5h  (was: 7h 20m)

> ABFS: Support FileStatus input to OpenFileWithOptions() via OpenFileParameters
> ------------------------------------------------------------------------------
>
>                 Key: HADOOP-17682
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17682
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/azure
>            Reporter: Sumangala Patki
>            Assignee: Sumangala Patki
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.4.0
>
>          Time Spent: 7.5h
>  Remaining Estimate: 0h
>
> ABFS open methods require certain information (contentLength, eTag, etc) to  
> to create an InputStream for the file at the given path. This information is 
> retrieved via a GetFileStatus request to backend.
> However, client applications may often have access to the FileStatus prior to 
> invoking the open API. Providing this FileStatus to the driver through the 
> OpenFileParameters argument of openFileWithOptions() can help avoid the call 
> to Store for FileStatus.
> This PR adds handling for the FileStatus instance (if any) provided via the 
> OpenFileParameters argument.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HADOOP-17682) ABFS: Support FileStatus input to OpenFileWithOptions() via OpenFileParameters

Reply via email to