[
https://issues.apache.org/jira/browse/HADOOP-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12510929
]
Doug Cutting commented on HADOOP-1563:
--------------------------------------
A couple of thoughts:
1. If, for performance, we find we must cache FileStatus in most
FileSystem#listPaths implementations, then that means the FileSystem API is
inappropriate. In this case, we should replace FileSystem#listPaths() and
#getFileStatus() with a single new method:
public abstract Map<Path,FileStatus> listStatus(Path path) throws IOException;
2. If, in HttpFileSystem, we find that (e.g., in order to efficiently support
#listStatus) an HTML-based implementation is insufficient for HDFS, then we
should not implement other directory formats by subclassing. Rather
HttpFileSystem should use plugins for various formats. That fits the existing
FileSystem extension mechanism better, which dispatches on protocol only.
The plugin interface might look like:
public interface HttpFileServer {
/** Set connection properties prior to connect, typically authentication
headers. */
void prepareConnection(HttpURLConnection connection);
/** Parse directory content. */
Map<Path,FileStatus> parseDirectoryContent(byte[] content);
}
HttpFileSystem would pick an HttpFileServer implementation based hostname,
content type or something. Content-type would be elegant, but probably
insufficient, since, e.g., S3 returns a content-type of application/xml.
Hostname would require reconfiguration for each site. Perhaps we can use the
"Server" header. That would work for S3, and we could set it for HDFS.
> Create FileSystem implementation to read HDFS data via http
> -----------------------------------------------------------
>
> Key: HADOOP-1563
> URL: https://issues.apache.org/jira/browse/HADOOP-1563
> Project: Hadoop
> Issue Type: New Feature
> Components: fs
> Affects Versions: 0.14.0
> Reporter: Owen O'Malley
> Assignee: Chris Douglas
> Attachments: httpfs.patch
>
>
> There should be a FileSystem implementation that can read from a Namenode's
> http interface. This would have a couple of useful abilities:
> 1. Copy using distcp between different versions of HDFS.
> 2. Use map/reduce inputs from a different version of HDFS.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.