[jira] [Commented] (MAPREDUCE-7401) Optimize liststatus for better performance by using recursive listing

Steve Loughran (Jira) Tue, 29 Nov 2022 07:50:56 -0800


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17640794#comment-17640794
 ]


Steve Loughran commented on MAPREDUCE-7401:
-------------------------------------------

the current pr isn't going in for the reasons described in the pr. 
listFiles(path, recursive) is the existing api...the challenge is to allow apps 
to use it without breaking the current apis -which is really, really hard

-1, sorry.

Do not go near this unless you can show that the current `listFiles(path, 
recursive)' is inadequate. Which I do not believe it is.

If you can make the case that it doesn't change it then you have to look very 
closely at the Javadocs at the top of FileSystem and any recent changes to the 
API to see how they are managed. Vectored IO for example. also look at 
HADOOP-16898 and HADOOP-16898 to see their listing changes including my 
unhappiness about something going in without more publicity across the 
different teams.

Any change in that API is public facing and has to be maintained forever. It 
needs to be supported effectively in HDFS and in cloud storage. That means 
you're going to have to do a full api specification, write contract tests, 
implement those contact tests on in hadoop-aws and azure, and ideally anywhere 
else (google gcs). then make sure that you don't break the external libs named 
in the javadocs. 

Assume that I will automatically veto any new list method returning an array. 
It hits scale problems on HDFS -lock duration, size of responses to marshall- 
and prevents us doing things in the object stores including prefetching, 
IOStatistics collection and supporting close(). Also using builder APIs and 
returning a CompletableFuture.

Look at the s3a and abfs listing code to see how implement listFiles, and the 
s3a and manifest I committed to see how they are effectively used. we kick off 
operations (treewalk, file loading) while waiting for next page of responses to 
come in, ideally swallowing the entire latency of each list call.


Note also that because listFiles only returns files, not directories, we can do 
O(files/page size) deep list calls against s3. 

If the justification is that we need path filtering, see HADOOP-16673 _Add 
filter parameter to FileSystem>>listFiles_ to see why that doesn't work in 
cloud and hence closed as WONTFIX.

I think a more manageable focus of this work would be to see how 
FileInputFormat could be speeded up by using the existing APIs, I am at with 
all work done knowing that many external libraries subclass that. For example, 
Parquet, Avro and ORC. Any incompatible change will stop them upgrading and we 
cannot do that.

Am I being very negative here? Yes I am. If you do want to change the Apis then 
you need to start talking about it on the HDFS and common lists, show that it 
delivers tangible benefit on-prem and in cloud, and undertake the extensive 
piece of work needed to implement in the primary cloud stores to show it is 
performant.




> Optimize liststatus for better performance by using recursive listing
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7401
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7401
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 3.3.3
>            Reporter: Ashutosh Gupta
>            Assignee: Ashutosh Gupta
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> This change adds recursive listing APIs to FileSystem. The purpose is to 
> enable different FileSystem implementations optimize on the listStatus calls 
> if they can. Default implementation is provided for normal FileSystem 
> implementation which does level by level listing for each directory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7401) Optimize liststatus for better performance by using recursive listing

Reply via email to