Re: Getting started questions

Sagar Naik Mon, 08 Sep 2008 01:08:38 -0700

Dennis Kubes wrote:

John Howland wrote:
I've been reading up on Hadoop for a while now and I'm excited that I'm
finally getting my feet wet with the examples + my own variations. Ifanyone
could answer any of the following questions, I'd greatly appreciate it.
1. I'm processing document collections, with the number of documentsranging
from 10,000 - 10,000,000. What is the best way to store this data for
effective processing?
AFAIK hadoop doesn't do well with, although it can handle, a largenumber of small files. So it would be better to read in the documentsand store them in SequenceFile or MapFile format. This would besimilar to the way the Fetcher works in Nutch. 10M documents in asequence/map file on DFS is comparatively small and can be handledefficiently.
- The bodies of the documents usually range from 1K-100KB in size,but some
outliers can be as big as 4-5GB.
I would say store your document objects as Text objects, not sure ifText has a max size. I think it does but not sure what that is. Ifit does you can always store as a BytesWritable which is just an arrayof bytes. But you are going to have memory issues reading in andwriting out that large of a record.
- I will also need to store some metadata for each document which Ifigure
could be stored as JSON or XML.
- I'll typically filter on the metadata and then doing standardoperations
on the bodies, like word frequency and searching.
It is possible to create an OutputFormat that writes out multiplefiles. You could also use a MapWritable as the value to store thedocument and associated metadata.
Is there a canned FileInputFormat that makes sense? Should I roll myown?How can I access the bodies as streams so I don't have to read theminto RAM
A writable is read into RAM so even treating it like a stream doesn'tget around that.
One thing you might want to consider is to tar up say X documents ata time and store that as a file in DFS. You would have many of thesefiles. Then have an index that has the offsets of the files and theirkeys (document ids). That index can be passed as input into a MR jobthat can then go to DFS and stream out the file as you need it. Thejob will be slower because you are doing it this way but it is asolution to handling such large documents as streams.
all at once? Am I right in thinking that I should treat each documentas a
record and map across them, or do I need to be more creative in what I'm
mapping across?
2. Some of the tasks I want to run are pure map operations (noreduction),where I'm calculating new metadata fields on each document. To end upwith agood result set, I'll need to copy the entire input record + newfields intoanother set of output files. Is there a better way? I haven't wantedto go
down the HBase road because it can't handle very large values (for the
bodies) and it seems to make the most sense to keep the document bodies
together with the metadata, to allow for the greatest locality ofreference
on the datanodes.
If you don't specify a reducer, the IdentityReducer is run whichsimply passes through output.

One can set number of reducers to zero and reduce phase will nottake place.

3. I'm sure this is not a new idea, but I haven't seen anythingregardingit... I'll need to run several MR jobs as a pipeline... is there anyway forthe map tasks in a subsequent stage to begin processing data fromprevious
stage's reduce task before that reducer has fully finished?
Yup, just use FileOutputFormat.getOutputPath(previousJobConf);

Dennis
Whatever insight folks could lend me would be a big help in crossing the
chasm from the Word Count and associated examples to something more"real".
A whole heap of thanks in advance,

John

Re: Getting started questions

Reply via email to