[
https://issues.apache.org/jira/browse/MAPREDUCE-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17640794#comment-17640794
]
Steve Loughran commented on MAPREDUCE-7401:
-------------------------------------------
the current pr isn't going in for the reasons described in the pr.
listFiles(path, recursive) is the existing api...the challenge is to allow apps
to use it without breaking the current apis -which is really, really hard
-1, sorry.
Do not go near this unless you can show that the current `listFiles(path,
recursive)' is inadequate. Which I do not believe it is.
If you can make the case that it doesn't change it then you have to look very
closely at the Javadocs at the top of FileSystem and any recent changes to the
API to see how they are managed. Vectored IO for example. also look at
HADOOP-16898 and HADOOP-16898 to see their listing changes including my
unhappiness about something going in without more publicity across the
different teams.
Any change in that API is public facing and has to be maintained forever. It
needs to be supported effectively in HDFS and in cloud storage. That means
you're going to have to do a full api specification, write contract tests,
implement those contact tests on in hadoop-aws and azure, and ideally anywhere
else (google gcs). then make sure that you don't break the external libs named
in the javadocs.
Assume that I will automatically veto any new list method returning an array.
It hits scale problems on HDFS -lock duration, size of responses to marshall-
and prevents us doing things in the object stores including prefetching,
IOStatistics collection and supporting close(). Also using builder APIs and
returning a CompletableFuture.
Look at the s3a and abfs listing code to see how implement listFiles, and the
s3a and manifest I committed to see how they are effectively used. we kick off
operations (treewalk, file loading) while waiting for next page of responses to
come in, ideally swallowing the entire latency of each list call.
Note also that because listFiles only returns files, not directories, we can do
O(files/page size) deep list calls against s3.
If the justification is that we need path filtering, see HADOOP-16673 _Add
filter parameter to FileSystem>>listFiles_ to see why that doesn't work in
cloud and hence closed as WONTFIX.
I think a more manageable focus of this work would be to see how
FileInputFormat could be speeded up by using the existing APIs, I am at with
all work done knowing that many external libraries subclass that. For example,
Parquet, Avro and ORC. Any incompatible change will stop them upgrading and we
cannot do that.
Am I being very negative here? Yes I am. If you do want to change the Apis then
you need to start talking about it on the HDFS and common lists, show that it
delivers tangible benefit on-prem and in cloud, and undertake the extensive
piece of work needed to implement in the primary cloud stores to show it is
performant.
> Optimize liststatus for better performance by using recursive listing
> ---------------------------------------------------------------------
>
> Key: MAPREDUCE-7401
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7401
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Affects Versions: 3.3.3
> Reporter: Ashutosh Gupta
> Assignee: Ashutosh Gupta
> Priority: Major
> Labels: pull-request-available
> Time Spent: 40m
> Remaining Estimate: 0h
>
> This change adds recursive listing APIs to FileSystem. The purpose is to
> enable different FileSystem implementations optimize on the listStatus calls
> if they can. Default implementation is provided for normal FileSystem
> implementation which does level by level listing for each directory.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]