Dennis, Thanks for the detailed response. I need to play with the SequenceFile format a bit -- I found the documentation for it on the wiki. I think I could build on top of the format to handle storage of very large documents. The vast majority of documents will fit into RAM and in a standard HDFS block (64MB, maybe up it to 128MB). For very large documents, I can split them into consecutive records in the SequenceFile. I can overload the key to be a combination of a "real" key and a record number... Shouldn't be too hard to extend SequenceFile to do this.
Much obliged, John