David Pollak wrote:
On Nov 8, 2006, at 12:03 PM, Dennis Kubes wrote:
David Pollak wrote:
On Nov 8, 2006, at 7:41 AM, Dennis Kubes wrote:
I don't know if I completely understand what you are asking but let
me try to answer your questions.
David Pollak wrote:
Howdy,
Is there a way to store "by-product" data someplace where it can
be read? For example, as I'm iterating over a collection of
documents, I want to generate some statistics about the
collection, put those stats "someplace" that can be accessed
during future map-reduce cycles. Should I simply run a "faux"
map-reduce cycle to count the information and store it in a known
location in the DFS?
Usually you would run a MapReduce job to store intermediate results
and then another job to process aggregated or final results.
Sometimes this can be done in a single job, sometimes not. Take a
look at the Hadoop example for Grep or WordCount for example jobs.
Yep. I'm able to chain jobs together. In one case, I am counting
URLs and Noun Phrases for documents retrieved during a certain run.
In order to normalize the URLs and NP counts, I want to divide by
the total number of URLs or NPs for that time period. I seem to
have 2 choices:
1 - I can aggregate the counts during the Map/Reduce task that culls
the URLs and NPs
2 - I can run another Map/Reduce task on the URL and NP sets to
count the number of documents.
It seems that if I do the latter, it's another iteration over the
data set which seems expensive. Is #2 the best choice?
If you can aggregate them in a single job I would do that. On a
second job it would have to split, copy, process (even if it is only
an IdentityMapper), and more importantly sort. Sorting time,
depending on the amount of data, could be substantial. Usually it is
not, it's just something to consider. I would use the second job
approach it you needed to aggregate with other data to perform the
counts. For example you count words in one job, noun phrases in
another, and then aggregate both by url in a third.
I think I have a solution. If I have a "count" key, I can aggregate
the document count information in the key and each reduce task will
aggregate the info into the key. If I write the information out to a
mapped file, I can look up the value of the key. If I gave the key a
value that will not occur in "nature" (e.g., "\0count\0") then there's
no likelihood of mixing with a URL or NP. Does this sound reasonable
or am I smoking crack?
Well each split would be independent of the others, in terms of
processing so I don't know how you would get the count key to begin with
or at least your count keys would have duplicates assuming each split
starts at 0 or 1 and you use a MapRunner to pass a shared variable into
each Map task. That being said you may want the key to be the same URL
or NP if you are aggregating it later. Answering your question though,
yes a value that is not the same URL or NP would not mix in terms of the
reduce values but you probably wouldn't want to use those same keys in a
job that has URL or NP as keys.
Is there a way to map a collection of words or documents to
associated numbers so that indexing could be based on the word
number and/or document number rather than actual word and actual
URL? Because the reduce tasks take place in separate processes,
it seems that there's no way to coordinate the ordinal counting.
If you are talking about Index document id then you would need to
read the index and map url to document id and then a second job
would map id to whatever else by url. If you are wanting count
each word globally across all tasks and splits you can coordinate
it within splits by using a MapRunner but across splits I don't
know of a way to do that.
Yep... I'm looking to generate a unique ID across all the MR tasks.
Basically, I want a file that looks like:
apple 1
beta 2
cat 3
dog 4
moose 5
....
Is there a final merge task that merges all the reductions
together? If so, perhaps I could do the count in the final merge.
Any idea is the final merge is accessible?
Thanks,
David
You might be able to do something like that but it would have to be
split independent because the ordinal processing would be unknown, or
rather parallel. There isn't a final merge task, but you could use
the output of the merges as the input to another job and set the
number of reduce tasks to 1. Maybe something like mapping word and
processing time, using a string formatted processing time like this
YYYYMMDDHHMISS as the key to the first output and then in the second
job it would be ordered by that key and you could output a number in
the final reduce.
What if I set the NumReduceTasks to 1. Would that mean there would be
1 reduce task for the whole data set and I could rely on an instance
variable in the reduce class for incrementing?
Thanks,
David
Setting reduce tasks to one in the final job just means that there is a
single split and a single output file, part-00000 as opposed to
part-00000 through part-xxxxx, but no you couldn't rely on an instance
variable in the reduce class. Each split (TaskRunner) is in its own
VM. If you were doing the processing inside of the map tasks on the
final job, assuming 1 reduce task and hence 1 split, you could use an
instance variable on a custom MapRunner that would be passed into each
Map task. This would have to be a reference (object) though for the
changes to be seen in other Map tasks in the same split.
Dennis