RE: Listing large directories via WebHDFS

Brahma Reddy Battula Wed, 19 Oct 2016 19:15:07 -0700

JFI, HADOOP-12502 introduced RemoteIterator at client side which is not 
committed through.




--Brahma Reddy Battula

-----Original Message-----
From: Andrew Wang [mailto:andrew.w...@cloudera.com] 
Sent: 20 October 2016 05:48
To: Zhe Zhang
Cc: Xiao Chen; hdfs-dev@hadoop.apache.org
Subject: Re: Listing large directories via WebHDFS

If the issue is just "hadoop fs -ls -R /", one thing we can look into is making 
the Globber use the listStatus API that returns a RemoteIterator rather than a 
FileStatus[]. That'll use the client-side pagination Xiao mentioned for 
WebHDFS/HttpFS (though this is currently not in a 2.x release).

The general case is still hard, for the reason you mentioned.

Best,
Andrew

On Wed, Oct 19, 2016 at 2:40 PM, Zhe Zhang <z...@apache.org> wrote:

> Thanks Xiao!
>
> Seems like server-side throttling are still vulnerable to abusing 
> users issuing large listing requests. Once such a request is 
> scheduled, it will keep listing potentially millions of files without 
> having to go through IPC/RPC queue again. It does have to compete for 
> fsn lock though, thanks to this server-side throttling logic.
>
> On Wed, Oct 19, 2016 at 2:33 PM Xiao Chen <x...@cloudera.com> wrote:
>
> > Hi Zhe,
> >
> > Per my understanding, the runner in webhdfs goes to
> NamenodeWebHdfsMethods
> > <https://github.com/apache/hadoop/blob/e9c4616b5e47e9c616799abc53226
> > 9
> 572ab24e6e/hadoop-hdfs-project/hadoop-hdfs/src/main/
> java/org/apache/hadoop/hdfs/server/namenode/web/resources/
> NamenodeWebHdfsMethods.java#L972>,
> > which eventually calls FSNameSystem#getListing. So it's still 
> > throttled
> on
> > the NN side. Up for discussions for ddos part...
> >
> > Also, Andrew did some pagination features for webhdfs/httpfs via
> > https://issues.apache.org/jira/browse/HDFS-10784 and 
> > https://issues.apache.org/jira/browse/HDFS-10823, to provide better 
> > control.
> >
> > Best,
> >
> > -Xiao
> >
> > On Wed, Oct 19, 2016 at 2:08 PM, Zhe Zhang <z...@apache.org> wrote:
> >
> > Hi,
> >
> > The regular HDFS client (DistributedFileSystem) throttles the 
> > workload of listing large directories by dividing the work into 
> > batches, something
> like
> > below:
> > {code}
> >     // fetch the first batch of entries in the directory
> >     DirectoryListing thisListing = dfs.listPaths(
> >         src, HdfsFileStatus.EMPTY_NAME);
> >      ......
> >     if (!thisListing.hasMore()) { // got all entries of the directory
> >       FileStatus[] stats = new FileStatus[partialListing.length];
> > {code}
> >
> > However, WebHDFS doesn't seem to have this batching logic.
> > {code}
> >   @Override
> >   public FileStatus[] listStatus(final Path f) throws IOException {
> >     final HttpOpParam.Op op = GetOpParam.Op.LISTSTATUS;
> >     return new FsPathResponseRunner<FileStatus[]>(op, f) {
> >       @Override
> >       FileStatus[] decodeResponse(Map<?,?> json) {
> >           ....
> >       }
> >     }.run();
> >   }
> > {code}
> >
> > Am I missing anything? So a user can DDoS by {{hadoop fs -ls -R /}} 
> > via WebHDFS?
> >
> >
> >
>

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

RE: Listing large directories via WebHDFS

Reply via email to