Re: Merge sequence files

Johan Oskarsson Tue, 15 May 2007 15:08:24 -0700

Doug Cutting wrote:

Johan Oskarsson wrote:
I'm considering using the sequence file output of hadoop jobs toserve data from as it would mean I could skip the conversion fromsequence file -> other file format step.
To do this efficiently I would need the data to be in one file.
I think it should be more efficient to keep things in separate files.If you use MapFileOutputFormat, there are methods to randomly accessentries from job output:
http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/MapFileOutputFormat.html
SequenceFileOutputFormat will also let you open all readers, butthere's no random access, since a SequenceFile has no index.
http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/SequenceFileOutputFormat.html
Will these suffice?

Doug

You're probably right that the best way would be to just leave the filesas is. I was mostly worried about reaching limits to the number of openfiles,but did a quick calculation now and I have over estimated how many fileswe would have. I think we'd reach other problems before the open fileswould become an issue.

I have considered using MapFiles, however the key to do lookups on wouldoften be different from the key needed when calculating the data andwhen using it as inputin other hadoop programs. For example if the key writable is calledUserResource I might have to do lookups when serving on just the user id.I was planning on doing something similar to a MapFile but with theaddition that I can specify what parts of the key to index on. And justas MapFilesit would be read as a SequenceFile when using it as input in otherhadoop programs.

Currently we just output everything as text in one big file and indexthat for serving.It's a simple fixed width index that we use to lookup the start positionfor the data for a user id.

This is of course a big waste of disk space and bandwidth/time.

Thanks for taking the time to answer my questions.

/Johan

Re: Merge sequence files

Reply via email to