Doug Cutting wrote:
Johan Oskarsson wrote:
I'm considering using the sequence file output of hadoop jobs to serve data from as it would mean I could skip the conversion from sequence file -> other file format step.

To do this efficiently I would need the data to be in one file.

I think it should be more efficient to keep things in separate files. If you use MapFileOutputFormat, there are methods to randomly access entries from job output:

http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/MapFileOutputFormat.html

SequenceFileOutputFormat will also let you open all readers, but there's no random access, since a SequenceFile has no index.

http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/SequenceFileOutputFormat.html

Will these suffice?

Doug

You're probably right that the best way would be to just leave the files as is. I was mostly worried about reaching limits to the number of open files, but did a quick calculation now and I have over estimated how many files we would have. I think we'd reach other problems before the open files would become an issue.

I have considered using MapFiles, however the key to do lookups on would often be different from the key needed when calculating the data and when using it as input in other hadoop programs. For example if the key writable is called UserResource I might have to do lookups when serving on just the user id. I was planning on doing something similar to a MapFile but with the addition that I can specify what parts of the key to index on. And just as MapFiles it would be read as a SequenceFile when using it as input in other hadoop programs.

Currently we just output everything as text in one big file and index that for serving. It's a simple fixed width index that we use to lookup the start position for the data for a user id.
This is of course a big waste of disk space and bandwidth/time.

Thanks for taking the time to answer my questions.

/Johan

Reply via email to