JFI, HADOOP-12502 introduced RemoteIterator at client side which is not committed through.
--Brahma Reddy Battula -----Original Message----- From: Andrew Wang [mailto:andrew.w...@cloudera.com] Sent: 20 October 2016 05:48 To: Zhe Zhang Cc: Xiao Chen; hdfs-dev@hadoop.apache.org Subject: Re: Listing large directories via WebHDFS If the issue is just "hadoop fs -ls -R /", one thing we can look into is making the Globber use the listStatus API that returns a RemoteIterator rather than a FileStatus[]. That'll use the client-side pagination Xiao mentioned for WebHDFS/HttpFS (though this is currently not in a 2.x release). The general case is still hard, for the reason you mentioned. Best, Andrew On Wed, Oct 19, 2016 at 2:40 PM, Zhe Zhang <z...@apache.org> wrote: > Thanks Xiao! > > Seems like server-side throttling are still vulnerable to abusing > users issuing large listing requests. Once such a request is > scheduled, it will keep listing potentially millions of files without > having to go through IPC/RPC queue again. It does have to compete for > fsn lock though, thanks to this server-side throttling logic. > > On Wed, Oct 19, 2016 at 2:33 PM Xiao Chen <x...@cloudera.com> wrote: > > > Hi Zhe, > > > > Per my understanding, the runner in webhdfs goes to > NamenodeWebHdfsMethods > > <https://github.com/apache/hadoop/blob/e9c4616b5e47e9c616799abc53226 > > 9 > 572ab24e6e/hadoop-hdfs-project/hadoop-hdfs/src/main/ > java/org/apache/hadoop/hdfs/server/namenode/web/resources/ > NamenodeWebHdfsMethods.java#L972>, > > which eventually calls FSNameSystem#getListing. So it's still > > throttled > on > > the NN side. Up for discussions for ddos part... > > > > Also, Andrew did some pagination features for webhdfs/httpfs via > > https://issues.apache.org/jira/browse/HDFS-10784 and > > https://issues.apache.org/jira/browse/HDFS-10823, to provide better > > control. > > > > Best, > > > > -Xiao > > > > On Wed, Oct 19, 2016 at 2:08 PM, Zhe Zhang <z...@apache.org> wrote: > > > > Hi, > > > > The regular HDFS client (DistributedFileSystem) throttles the > > workload of listing large directories by dividing the work into > > batches, something > like > > below: > > {code} > > // fetch the first batch of entries in the directory > > DirectoryListing thisListing = dfs.listPaths( > > src, HdfsFileStatus.EMPTY_NAME); > > ...... > > if (!thisListing.hasMore()) { // got all entries of the directory > > FileStatus[] stats = new FileStatus[partialListing.length]; > > {code} > > > > However, WebHDFS doesn't seem to have this batching logic. > > {code} > > @Override > > public FileStatus[] listStatus(final Path f) throws IOException { > > final HttpOpParam.Op op = GetOpParam.Op.LISTSTATUS; > > return new FsPathResponseRunner<FileStatus[]>(op, f) { > > @Override > > FileStatus[] decodeResponse(Map<?,?> json) { > > .... > > } > > }.run(); > > } > > {code} > > > > Am I missing anything? So a user can DDoS by {{hadoop fs -ls -R /}} > > via WebHDFS? > > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org