[
https://issues.apache.org/jira/browse/HDFS-14663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16903839#comment-16903839
]
Siyao Meng edited comment on HDFS-14663 at 8/9/19 12:06 PM:
------------------------------------------------------------
I spent some time on this, attached a debugger and set a few breakpoints. I
figured the root cause is indeed that
*FSOperations.FSListStatusBatch.WrappedFileSystem#listStatusBatch()* calls
*FileSystem#listStatusBatch()*, while we should expect it to call
*HttpFSFileSystem#listStatusBatch()*. The difference is that the former doesn't
support batch at all. That's why specifying startAfter doesn't work.
The
[WrappedFileSystem|https://github.com/apache/hadoop/blob/43a91f820a5fce75ea69f78a62331bdc58e09a37/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/fs/http/server/FSOperations.java#L811]
is designed to expose the listStatusBatch() inside *HttpFSFileSystem*, but it
somehow failed to do so. Not sure why.
The type of the wrapped
[fs|https://github.com/apache/hadoop/blob/43a91f820a5fce75ea69f78a62331bdc58e09a37/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/fs/http/server/FSOperations.java#L825]
is *DistributedFileSystem*, as expected.
An interesting thing is that WebHDFS's implementation of *LISTSTATUS_BATCH*
doesn't use *WebHdfsFileSystem#listStatusBatch()*. Rather it just uses
*getListing()* to finish the request. I wonder if we could/should try to do the
same to HttpFS.
was (Author: smeng):
I spent a few hours on this, attached a debugger and set a few breakpoints. I
figured the root cause is that
*FSOperations.FSListStatusBatch.WrappedFileSystem#listStatusBatch()* calls
*FileSystem#listStatusBatch()*, while we should expect it to call
*HttpFSFileSystem#listStatusBatch()*. The difference is that the former doesn't
support batch at all. That's why specifying startAfter doesn't work.
The
[WrappedFileSystem|https://github.com/apache/hadoop/blob/43a91f820a5fce75ea69f78a62331bdc58e09a37/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/fs/http/server/FSOperations.java#L811]
is designed to expose the listStatusBatch() inside HttpFSFileSystem, but it
somehow failed to do so. Not sure why.
The type of the wrapped
[fs|https://github.com/apache/hadoop/blob/43a91f820a5fce75ea69f78a62331bdc58e09a37/hadoop-hdfs-project/hadoop-hdfs-httpfs/src/main/java/org/apache/hadoop/fs/http/server/FSOperations.java#L825]
is DistributedFileSystem, as expected.
> HttpFS: LISTSTATUS_BATCH does not return batches
> ------------------------------------------------
>
> Key: HDFS-14663
> URL: https://issues.apache.org/jira/browse/HDFS-14663
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: httpfs
> Affects Versions: 3.3.0
> Reporter: Stephen O'Donnell
> Assignee: Siyao Meng
> Priority: Major
>
> The webhdfs protocol supports a LISTSTATUS_BATCH operation where it can
> retrieve the file listing for a large directory in chunks.
> When using the webhdfs service embedded in the namenode, this works as
> expected, but when using HTTPFS, any call to LISTSTATUS_BATCH simply returns
> the entire listing rather than batches, working effectively like LISTSTATUS
> instead.
> This seems to be because HTTPFS falls back to using the method
> org.apache.hadoop.fs.FileSystem#listStatusBatch, which is intended to be
> overridden, but the implementation used in HTTPFS has not done that, leading
> to this limitation.
> This feature (LISTSTATUS_BATCH) was added to HTTPFS by HDFS-10823, but based
> on my testing it does not work as intended. I suspect it is because the
> listStatusBatch operation was added to the WebHdfsFileSystem and
> HttpFSFileSystem as part of the above Jira, but behind the scenes HTTPFS
> seems to use DistributeFileSystem and hence it falls back to the default
> implementation "org.apache.hadoop.fs.FileSystem#listStatusBatch" which
> returns all entries in a single batch.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]