Dennis Kubes wrote:
John Howland wrote:
I've been reading up on Hadoop for a while now and I'm excited that I'm
finally getting my feet wet with the examples + my own variations. If
anyone
could answer any of the following questions, I'd greatly appreciate it.
1. I'm processing document collections, with the number of documents
ranging
from 10,000 - 10,000,000. What is the best way to store this data for
effective processing?
AFAIK hadoop doesn't do well with, although it can handle, a large
number of small files. So it would be better to read in the documents
and store them in SequenceFile or MapFile format. This would be
similar to the way the Fetcher works in Nutch. 10M documents in a
sequence/map file on DFS is comparatively small and can be handled
efficiently.
- The bodies of the documents usually range from 1K-100KB in size,
but some
outliers can be as big as 4-5GB.
I would say store your document objects as Text objects, not sure if
Text has a max size. I think it does but not sure what that is. If
it does you can always store as a BytesWritable which is just an array
of bytes. But you are going to have memory issues reading in and
writing out that large of a record.
- I will also need to store some metadata for each document which I
figure
could be stored as JSON or XML.
- I'll typically filter on the metadata and then doing standard
operations
on the bodies, like word frequency and searching.
It is possible to create an OutputFormat that writes out multiple
files. You could also use a MapWritable as the value to store the
document and associated metadata.
Is there a canned FileInputFormat that makes sense? Should I roll my
own?
How can I access the bodies as streams so I don't have to read them
into RAM
A writable is read into RAM so even treating it like a stream doesn't
get around that.
One thing you might want to consider is to tar up say X documents at
a time and store that as a file in DFS. You would have many of these
files. Then have an index that has the offsets of the files and their
keys (document ids). That index can be passed as input into a MR job
that can then go to DFS and stream out the file as you need it. The
job will be slower because you are doing it this way but it is a
solution to handling such large documents as streams.
all at once? Am I right in thinking that I should treat each document
as a
record and map across them, or do I need to be more creative in what I'm
mapping across?
2. Some of the tasks I want to run are pure map operations (no
reduction),
where I'm calculating new metadata fields on each document. To end up
with a
good result set, I'll need to copy the entire input record + new
fields into
another set of output files. Is there a better way? I haven't wanted
to go
down the HBase road because it can't handle very large values (for the
bodies) and it seems to make the most sense to keep the document bodies
together with the metadata, to allow for the greatest locality of
reference
on the datanodes.
If you don't specify a reducer, the IdentityReducer is run which
simply passes through output.
One can set number of reducers to zero and reduce phase will not
take place.
3. I'm sure this is not a new idea, but I haven't seen anything
regarding
it... I'll need to run several MR jobs as a pipeline... is there any
way for
the map tasks in a subsequent stage to begin processing data from
previous
stage's reduce task before that reducer has fully finished?
Yup, just use FileOutputFormat.getOutputPath(previousJobConf);
Dennis
Whatever insight folks could lend me would be a big help in crossing the
chasm from the Word Count and associated examples to something more
"real".
A whole heap of thanks in advance,
John