On Nov 8, 2006, at 7:41 AM, Dennis Kubes wrote:
I don't know if I completely understand what you are asking but let
me try to answer your questions.
David Pollak wrote:
Howdy,
Is there a way to store "by-product" data someplace where it can
be read? For example, as I'm iterating over a collection of
documents, I want to generate some statistics about the
collection, put those stats "someplace" that can be accessed
during future map-reduce cycles. Should I simply run a "faux" map-
reduce cycle to count the information and store it in a known
location in the DFS?
Usually you would run a MapReduce job to store intermediate results
and then another job to process aggregated or final results.
Sometimes this can be done in a single job, sometimes not. Take a
look at the Hadoop example for Grep or WordCount for example jobs.
Yep. I'm able to chain jobs together. In one case, I am counting
URLs and Noun Phrases for documents retrieved during a certain run.
In order to normalize the URLs and NP counts, I want to divide by the
total number of URLs or NPs for that time period. I seem to have 2
choices:
1 - I can aggregate the counts during the Map/Reduce task that culls
the URLs and NPs
2 - I can run another Map/Reduce task on the URL and NP sets to count
the number of documents.
It seems that if I do the latter, it's another iteration over the
data set which seems expensive. Is #2 the best choice?
Is there a way to map a collection of words or documents to
associated numbers so that indexing could be based on the word
number and/or document number rather than actual word and actual
URL? Because the reduce tasks take place in separate processes,
it seems that there's no way to coordinate the ordinal counting.
If you are talking about Index document id then you would need to
read the index and map url to document id and then a second job
would map id to whatever else by url. If you are wanting count
each word globally across all tasks and splits you can coordinate
it within splits by using a MapRunner but across splits I don't
know of a way to do that.
Yep... I'm looking to generate a unique ID across all the MR tasks.
Basically, I want a file that looks like:
apple 1
beta 2
cat 3
dog 4
moose 5
....
Is there a final merge task that merges all the reductions together?
If so, perhaps I could do the count in the final merge. Any idea is
the final merge is accessible?
Thanks,
David
Dennis
Thanks,
David