listFiles

Steve Loughran (Jira) Tue, 29 Oct 2019 10:54:09 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-16673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962267#comment-16962267
 ]


Steve Loughran edited comment on HADOOP-16673 at 10/29/19 5:52 PM:
-------------------------------------------------------------------

 This wouldn't work. The path filtering you have described will only work 
efficiently if you are actually doing a tree walk. For S3 we are issuing LIST 
path/ commands and getting pages of results back. There is no way we could do 
this filtering except by doing exactly what you are trying to do yourself: List 
everything then discard the stuff that is not needed. It won't be any more 
efficient it will only set unrealistic expectations on performance.
 
 Have a look at {{org.apache.hadoop.fs.s3a.Listing.ProvidedFileStatusIterator}} 
to see what todo. You just need to write an iterator which wraps the current 
one and does the filtering there, which you can then iterate over.
 
 Closing as a WONTFIX. Sorry

 I'll look at the Hive problem. 

+[~gabor.bota]


was (Author: [email protected]):

 This wouldn't work. The path filtering you have described Will only work 
efficiently if you are actually doing a tree walk. For S3 we are issuing LIST 
path/ commands and getting pages of results back. There is no way we could do 
this filtering except by doing exactly what you are trying to do yourself: List 
everything then discard the stuff that is not needed. It won't be any more 
efficient it will only set unrealistic expectations on performance.
 
 Have a look at {{org.apache.hadoop.fs.s3a.Listing.ProvidedFileStatusIterator}} 
to see what todo. You just need to write an iterator which wraps the current 
one and does the filtering there, which you can then iterate over.
 
 Closing as a WONTFIX. Sorry

 I'll look at the Hive problem. 

+[~gabor.bota]

> Add filter parameter to FileSystem>>listFiles
> ---------------------------------------------
>
>                 Key: HADOOP-16673
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16673
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs, fs/s3
>    Affects Versions: 3.2.2
>            Reporter: Attila Magyar
>            Priority: Major
>
> Currently getting recursively a filtered list of files in a directory is 
> clumsy because filtering should happen afterwards on the result list.
> Imagine we want to list all non hidden files recursively.
> The non hidden files filter is defined as: 
> {code:java}
> !name.startsWith("_") && !name.startsWith(".") {code}
>  
> Then we can do:
>  
> {code:java}
> RemoteIterator<LocatedFileStatus> remoteIterator = fs.listFiles(path, 
> /*recursive*/true);
> while (remoteIterator.hasNext()) {
>  LocatedFileStatus each = remoteIterator.next();
>  if (filter applies to all of the path elements in each) {
>    result.add(each);
>  }
> }
>  
> {code}
>  
> For example each of these paths should be skipped:
>  * /.a/b/c
>  * /a/.b/c
>  * /a/b/.c/
> It would be lot better to have a filter parameter on listFiles. This is 
> needed to solve HIVE-22411 effectively. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HADOOP-16673) Add filter parameter to FileSystem>>listFiles

Reply via email to