Re: some newby questions

David Pollak Wed, 08 Nov 2006 09:41:36 -0800


On Nov 8, 2006, at 7:41 AM, Dennis Kubes wrote:

I don't know if I completely understand what you are asking but letme try to answer your questions.
David Pollak wrote:
Howdy,
Is there a way to store "by-product" data someplace where it canbe read? For example, as I'm iterating over a collection ofdocuments, I want to generate some statistics about thecollection, put those stats "someplace" that can be accessedduring future map-reduce cycles. Should I simply run a "faux" map-reduce cycle to count the information and store it in a knownlocation in the DFS?
Usually you would run a MapReduce job to store intermediate resultsand then another job to process aggregated or final results.Sometimes this can be done in a single job, sometimes not. Take alook at the Hadoop example for Grep or WordCount for example jobs.

Yep. I'm able to chain jobs together. In one case, I am countingURLs and Noun Phrases for documents retrieved during a certain run.In order to normalize the URLs and NP counts, I want to divide by thetotal number of URLs or NPs for that time period. I seem to have 2choices:1 - I can aggregate the counts during the Map/Reduce task that cullsthe URLs and NPs2 - I can run another Map/Reduce task on the URL and NP sets to countthe number of documents.

It seems that if I do the latter, it's another iteration over thedata set which seems expensive. Is #2 the best choice?

Is there a way to map a collection of words or documents toassociated numbers so that indexing could be based on the wordnumber and/or document number rather than actual word and actualURL? Because the reduce tasks take place in separate processes,it seems that there's no way to coordinate the ordinal counting.
If you are talking about Index document id then you would need toread the index and map url to document id and then a second jobwould map id to whatever else by url. If you are wanting counteach word globally across all tasks and splits you can coordinateit within splits by using a MapRunner but across splits I don'tknow of a way to do that.

Yep... I'm looking to generate a unique ID across all the MR tasks.Basically, I want a file that looks like:

apple 1
beta 2
cat 3
dog 4
moose 5
....

Is there a final merge task that merges all the reductions together?If so, perhaps I could do the count in the final merge. Any idea isthe final merge is accessible?


Thanks,

David


Dennis


Thanks,

David

Re: some newby questions

Reply via email to