Here's a thought: implement a simple read-only HttpFileSystem that works
for MapReduce input.  It couldn't perform directory enumeration, so job
inputs would have to be listed explicitly, not as directories or glob
patterns.

For raw S3, one could make a subclass that adds directory enumeration,
since that's possible with S3, but still throws exceptions for renames,
etc.  (One could also add support for write and delete.)

CopyFiles could then use http uris directly, so it wouldn't need a
separate mapper for http inputs and would be further simplified.
Processing Hadoop log files should also be possible using an HttpFileSystem.

One could even extend HttpFileSystem to work for basic MapReduce output,
using HTTP PUT to store files, passing a configured authorization.  File
deletion could be implemented with DELETE.  One could adopt a convention
that HTTP URIs ending in slashes indicate directories.  Directory
enumeration could then work by parsing the return HTML directory
listing, providing a reasonably complete FileSystem implementation.


This sounds like a good plan. I wonder whether the existing
block-based s3 scheme should be renamed (as s3block or similar) so s3
is the scheme that sores raw files as you describe?

Tom

Reply via email to