[
https://issues.apache.org/jira/browse/HADOOP-17682?focusedWorklogId=642750&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-642750
]
ASF GitHub Bot logged work on HADOOP-17682:
-------------------------------------------
Author: ASF GitHub Bot
Created on: 27/Aug/21 09:12
Start Date: 27/Aug/21 09:12
Worklog Time Spent: 10m
Work Description: snvijaya commented on pull request #2975:
URL: https://github.com/apache/hadoop/pull/2975#issuecomment-907053394
> Hi,
>
> only just seen this...feel free to let me know when you are doing big bits
work, especially with unstable bits of the API that I've worked on.
>
> FWIW, I'll be able to use this in the manifest committer #2971; as it well
let me go straight from the listing of the directory of manifests to opening
each manifest without the HEAD requests. I also have an internal PoC parquet
lib which does the same -and showed up the bugs in s3a FS there.
>
> I have some concerns with the patch as is, which I'd like addressed in a
"stabilize abfs openFile.withStatus" JIRA, which, once is in, can go in to -3.3
back to back with this
>
> In particular, the openFile code on L705 warns if the FileStatus is of the
wrong type.
> This turns out to cause problems in Hive or any other FS which wraps the
inner FS, and
> creates its own FileStatus objects. Why does hive do this? Because
FileStatus.getPath()
> returns the full path of the reference; hive has to replace this with a
new FileStatus
> with its mapped path.
>
> _printing WARN if the type is wrong is going to fill up Hive logs and
remove all speedup options_
>
> In #3289 I soften the requirement for the path to == the object; I'm only
checking the final elements are equal. This allows parquet through Hive to use
the API.
>
>
https://github.com/apache/hadoop/pull/3289/files#diff-4050f95b7e3912145415b6e2f9cd3b0760fcf2ce96bf0980c6c30a6edad2d0fbR5402
>
> In #2584, which is my real "stabilize openfile patch" I do it properly
>
>
https://github.com/apache/hadoop/blob/700f99835f21571d9f29f2194327894776abe13f/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/PrepareToOpenFile.java
>
> * only check name
> * reject directories
> * handle LocatedFileStatus
> * and any other FS subclass, only picking up the length value
> * and as this has got complex, factored out into its own class
>
> Really, it's only the length which is important, other than that it is
just a declaration of "this is a file, skip the HEAD"
>
> So for ABFS to work, I'm going to require the equivalent of that
PrepareToOpenFile: support for more FileStatus types. Doesn't need to be
factored out, though that turned out to work well for testing.
>
> This makes clear that I need to get #2584 in; it's been neglected.
>
> That patch allows for a seek policy to be specified; all the whole file
reads ask for sequential IO, so when hive clusters are deployed with the
default seek policy set to random, things like distcp and hadoop fs put don't
fail. It critically defines a standard set of seek policies, which Azure can
implement too.
> I've realized we need one more "whole-file", which says "no need for lazy
seek, just do the GET".
>
> Does anyone want to take that PR up? I keep getting distracted by other
high-importance tasks.
>
> Meanwhile the manifest committer can be the first user of this; I will
verify that it delivers some speed up. It should: job commit with 1000 tasks
will go from one list, 1000 HEAD, 1000 GET to 1 list and the 1000 GETs.
>
> FWIW, this is the code in callthe magic committer patch which becomes safe
to use with the filename fix there:
>
https://github.com/apache/hadoop/pull/3289/files#diff-3da795bf4618a7e504e21d3c53f9f7a42c52a75fe232906feb4e66f98794d41eR94
>
> ...I'll do the same with the manifest committer; if I do it this week its
new tests will help stress the feature.
Hi @steveloughran,
Thanks a lot for taking a look at this change. We will make updates to
address the comments provided.
There is one comment on the eTag that I would like to check with you.
eTag is a parameter that is sent with every read request to server today.
Server ensures that the file is indeed of the expected eTag before proceeding
with the read operation. If the file was recreated during the timespan
inputStream was created and read is issued, eTag check will fail and so will
the read, else read will end up wrongly getting data from the new file. This is
the reason we are mandating eTag to be present and in else case fall back to
triggering a GetFileStatus. We will changing the log type to debug as per your
comment in the else case.
I did have a look at [#2584](https://github.com/apache/hadoop/pull/2584)
which provides the means to pass mandatory and optional keys through
OpenFileParameters. Noticed an example where S3 eTag is passed as an optional
key (referring to
[link](https://github.com/apache/hadoop/pull/2584/files#diff-a4a5b4990d0a979db924e2f338eab5421d9909ccbafee3846bf76b3e816f31d2R49)
, for the Hive case where FileStatus is wrapped, is this how eTag gets passed
? And was also wondering how we can get various Hadoop workloads to send it
without being aware of which is the current active FileSystem considering the
key name is specific to it.
Also wanted to check if it would be right to add eTag as a field into the
base FileStatus class. Probably as a sub structure within FileStatus so that it
can be expanded in future to hold properties that Cloud storage systems
commonly depend on.
OpenFile() change is one of the key performance improvement that we are
looking at adopting to. The aspect of FileStatus being passed down in itself
reduces HEAD request count by half and look forward to adopting to the new set
of read policies too. We will work on understanding how to map various read
policies to the current optimizations that the driver has for different read
patterns. I think that would translate to a change equivalent to
PrepareToOpenFile for ABFS driver. Would it be ok if we make this change once
[#2584](https://github.com/apache/hadoop/pull/2584) checks in ?
Currently we are in a bit of tight schedule and short staffed as we aim to
complete on the feature work tracked in
[HADOOP-17853](https://issues.apache.org/jira/browse/HADOOP-17853) and another
customer requirement that we are in feasibility analysis stage.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 642750)
Time Spent: 7h 20m (was: 7h 10m)
> ABFS: Support FileStatus input to OpenFileWithOptions() via OpenFileParameters
> ------------------------------------------------------------------------------
>
> Key: HADOOP-17682
> URL: https://issues.apache.org/jira/browse/HADOOP-17682
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/azure
> Reporter: Sumangala Patki
> Assignee: Sumangala Patki
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.4.0
>
> Time Spent: 7h 20m
> Remaining Estimate: 0h
>
> ABFS open methods require certain information (contentLength, eTag, etc) to
> to create an InputStream for the file at the given path. This information is
> retrieved via a GetFileStatus request to backend.
> However, client applications may often have access to the FileStatus prior to
> invoking the open API. Providing this FileStatus to the driver through the
> OpenFileParameters argument of openFileWithOptions() can help avoid the call
> to Store for FileStatus.
> This PR adds handling for the FileStatus instance (if any) provided via the
> OpenFileParameters argument.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]