[
https://issues.apache.org/jira/browse/HADOOP-17682?focusedWorklogId=642890&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-642890
]
ASF GitHub Bot logged work on HADOOP-17682:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 27/Aug/21 15:57
Start Date: 27/Aug/21 15:57
Worklog Time Spent: 10m
Work Description: steveloughran commented on pull request #2975:
URL: https://github.com/apache/hadoop/pull/2975#issuecomment-907304674
> eTag is a parameter that is sent with every read request to server today.
Server ensures that the file is indeed of the expected eTag before proceeding
with the read operation. If the file was recreated during the timespan
inputStream was created and read is issued, eTag check will fail and so will
the read, else read will end up wrongly getting data from the new file. This is
the reason we are mandating eTag to be present and in else case fall back to
triggering a GetFileStatus. We will changing the log type to debug as per your
comment in the else case.
I do know about etags. The S3A connector picks it up on the HEAD or first
GET and caches it, to at least ensure that things are unchanged for the
duration of the stream.
> Noticed an example where S3 eTag is passed as an optional key (referring
to link , for the Hive case where FileStatus is wrapped, is this how eTag gets
passed ? And was also wondering how we can get various Hadoop workloads to send
it without being aware of which is the current active FileSystem considering
the key name is specific to it.
not in use at all by hive, even in PoCs. I'm not worrying about changes
between the list and the open, just about consistency once a file is opened. If
a file was changed between hive query planning and work execution then as it is
invariably of a different length the workers will notice soon enough. Anyway,
HDFS doesn't worry about this so why should the other stores?
When you skip the HEAD, the first GET will return the etag. All you have to
do is skip setting the If-Match header on the first request, pick the etag from
the response and use it after.
> I did have a look at #2584 which provides the means to pass mandatory and
optional keys through OpenFileParameters.
This was *already* in the openFile() API call from the outset. Indeed, abfs
already supports one, `fs.azure.buffered.pread.disable`.
What the new PR adds is
* standard options for read policies, split start/end, all of which can be
used to optimize reads.
* tuning of the `withFileStatus()` based on initial use in format libraries,
attempts to use in hive, etc.
* tests for this stuff
No API change, other than `withFileStatus(null)` required to be a no-op, and
requirements in the API spec relaxed to say "only check path.getName()" for a
match.
> Also wanted to check if it would be right to add eTag as a field into the
base FileStatus class. Probably as a sub structure within FileStatus so that it
can be expanded in future to hold properties that Cloud storage systems
commonly depend o
way too late for that. used everywhere, gets serialized etc. What you could
do is add some interface `EtagSource` with a `getEtag()` method, and implement
that interface for the ABFS and s3a connectors. It's still going to be lost
when things like hive replace the custom FileStatus with the standard one.
Where this could be useful is in distcp applications where the etag of an
uploaded file could be cached and compared when rescanning a directory. IF a
store implements the getFileChecksum() method you can serve the etag that way
but (a) distcp can't handle it and (b) if you could get it on every file from a
list() call, you save a lot of IO.
So please, have someone implement this in hadoop-common, with specification,
tests etc.
> OpenFile() change is one of the key performance improvement that we are
looking at adopting to. The aspect of FileStatus being passed down in itself
reduces HEAD request count by half and look forward to adopting to the new set
of read policies too. We will work on understanding how to map various read
policies to the current optimizations that the driver has for different read
patterns. I think that would translate to a change equivalent to
PrepareToOpenFile for ABFS driver.
I've just tried to let apps declare how they want to read things; there's
generally a straightforward map to sequential vs random vs adaptive...but I
hope that the orc and parquet options would provide even more opportunity to
tweak behaviour, e.g .by knowing that there will be an initial footer read
sequence, then stripes will be read.
> Would it be ok if we make this change once #2584 checks in ?
yes, but I do want to at least have the azure impl not reimplement the same
brittleness around `withFileStatus()` as s3a before it goes into 3.3. That way,
consistency.
> Currently we are in a bit of tight schedule and short staffed as we aim to
complete on the feature work tracked in HADOOP-17853 and another customer
requirement that we are in feasibility analysis stage.
I really want to get a hadoop 3.3.2 out the door with this API in it. It's
going to happen before '17853 as that is a big piece of work, and which is
going to need lots of review time by others, myself included.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 642890)
Time Spent: 7.5h (was: 7h 20m)
> ABFS: Support FileStatus input to OpenFileWithOptions() via OpenFileParameters
> ------------------------------------------------------------------------------
>
> Key: HADOOP-17682
> URL: https://issues.apache.org/jira/browse/HADOOP-17682
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/azure
> Reporter: Sumangala Patki
> Assignee: Sumangala Patki
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.4.0
>
> Time Spent: 7.5h
> Remaining Estimate: 0h
>
> ABFS open methods require certain information (contentLength, eTag, etc) to
> to create an InputStream for the file at the given path. This information is
> retrieved via a GetFileStatus request to backend.
> However, client applications may often have access to the FileStatus prior to
> invoking the open API. Providing this FileStatus to the driver through the
> OpenFileParameters argument of openFileWithOptions() can help avoid the call
> to Store for FileStatus.
> This PR adds handling for the FileStatus instance (if any) provided via the
> OpenFileParameters argument.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]