Dennis,

Thanks for the detailed response. I need to play with the SequenceFile
format a bit -- I found the documentation for it on the wiki. I think
I could build on top of the format to handle storage of very large
documents. The vast majority of documents will fit into RAM and in a
standard HDFS block (64MB, maybe up it to 128MB). For very large
documents, I can split them into consecutive records in the
SequenceFile. I can overload the key to be a combination of a "real"
key and a record number... Shouldn't be too hard to extend
SequenceFile to do this.

Much obliged,

John

Reply via email to