[ https://issues.apache.org/jira/browse/MAPREDUCE-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17640794#comment-17640794 ]
Steve Loughran commented on MAPREDUCE-7401: ------------------------------------------- the current pr isn't going in for the reasons described in the pr. listFiles(path, recursive) is the existing api...the challenge is to allow apps to use it without breaking the current apis -which is really, really hard -1, sorry. Do not go near this unless you can show that the current `listFiles(path, recursive)' is inadequate. Which I do not believe it is. If you can make the case that it doesn't change it then you have to look very closely at the Javadocs at the top of FileSystem and any recent changes to the API to see how they are managed. Vectored IO for example. also look at HADOOP-16898 and HADOOP-16898 to see their listing changes including my unhappiness about something going in without more publicity across the different teams. Any change in that API is public facing and has to be maintained forever. It needs to be supported effectively in HDFS and in cloud storage. That means you're going to have to do a full api specification, write contract tests, implement those contact tests on in hadoop-aws and azure, and ideally anywhere else (google gcs). then make sure that you don't break the external libs named in the javadocs. Assume that I will automatically veto any new list method returning an array. It hits scale problems on HDFS -lock duration, size of responses to marshall- and prevents us doing things in the object stores including prefetching, IOStatistics collection and supporting close(). Also using builder APIs and returning a CompletableFuture. Look at the s3a and abfs listing code to see how implement listFiles, and the s3a and manifest I committed to see how they are effectively used. we kick off operations (treewalk, file loading) while waiting for next page of responses to come in, ideally swallowing the entire latency of each list call. Note also that because listFiles only returns files, not directories, we can do O(files/page size) deep list calls against s3. If the justification is that we need path filtering, see HADOOP-16673 _Add filter parameter to FileSystem>>listFiles_ to see why that doesn't work in cloud and hence closed as WONTFIX. I think a more manageable focus of this work would be to see how FileInputFormat could be speeded up by using the existing APIs, I am at with all work done knowing that many external libraries subclass that. For example, Parquet, Avro and ORC. Any incompatible change will stop them upgrading and we cannot do that. Am I being very negative here? Yes I am. If you do want to change the Apis then you need to start talking about it on the HDFS and common lists, show that it delivers tangible benefit on-prem and in cloud, and undertake the extensive piece of work needed to implement in the primary cloud stores to show it is performant. > Optimize liststatus for better performance by using recursive listing > --------------------------------------------------------------------- > > Key: MAPREDUCE-7401 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7401 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Affects Versions: 3.3.3 > Reporter: Ashutosh Gupta > Assignee: Ashutosh Gupta > Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > This change adds recursive listing APIs to FileSystem. The purpose is to > enable different FileSystem implementations optimize on the listStatus calls > if they can. Default implementation is provided for normal FileSystem > implementation which does level by level listing for each directory. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org