Re: some newby questions

Dennis Kubes Wed, 08 Nov 2006 15:19:13 -0800


David Pollak wrote:

On Nov 8, 2006, at 12:03 PM, Dennis Kubes wrote:
David Pollak wrote:
On Nov 8, 2006, at 7:41 AM, Dennis Kubes wrote:
I don't know if I completely understand what you are asking but letme try to answer your questions.
David Pollak wrote:
Howdy,
Is there a way to store "by-product" data someplace where it canbe read? For example, as I'm iterating over a collection ofdocuments, I want to generate some statistics about thecollection, put those stats "someplace" that can be accessedduring future map-reduce cycles. Should I simply run a "faux"map-reduce cycle to count the information and store it in a knownlocation in the DFS?
Usually you would run a MapReduce job to store intermediate resultsand then another job to process aggregated or final results.Sometimes this can be done in a single job, sometimes not. Take alook at the Hadoop example for Grep or WordCount for example jobs.
Yep. I'm able to chain jobs together. In one case, I am countingURLs and Noun Phrases for documents retrieved during a certain run.In order to normalize the URLs and NP counts, I want to divide bythe total number of URLs or NPs for that time period. I seem tohave 2 choices:1 - I can aggregate the counts during the Map/Reduce task that cullsthe URLs and NPs2 - I can run another Map/Reduce task on the URL and NP sets tocount the number of documents.
It seems that if I do the latter, it's another iteration over thedata set which seems expensive. Is #2 the best choice?
If you can aggregate them in a single job I would do that. On asecond job it would have to split, copy, process (even if it is onlyan IdentityMapper), and more importantly sort. Sorting time,depending on the amount of data, could be substantial. Usually it isnot, it's just something to consider. I would use the second jobapproach it you needed to aggregate with other data to perform thecounts. For example you count words in one job, noun phrases inanother, and then aggregate both by url in a third.
I think I have a solution. If I have a "count" key, I can aggregatethe document count information in the key and each reduce task willaggregate the info into the key. If I write the information out to amapped file, I can look up the value of the key. If I gave the key avalue that will not occur in "nature" (e.g., "\0count\0") then there'sno likelihood of mixing with a URL or NP. Does this sound reasonableor am I smoking crack?

Well each split would be independent of the others, in terms ofprocessing so I don't know how you would get the count key to begin withor at least your count keys would have duplicates assuming each splitstarts at 0 or 1 and you use a MapRunner to pass a shared variable intoeach Map task. That being said you may want the key to be the same URLor NP if you are aggregating it later. Answering your question though,yes a value that is not the same URL or NP would not mix in terms of thereduce values but you probably wouldn't want to use those same keys in ajob that has URL or NP as keys.

Is there a way to map a collection of words or documents toassociated numbers so that indexing could be based on the wordnumber and/or document number rather than actual word and actualURL? Because the reduce tasks take place in separate processes,it seems that there's no way to coordinate the ordinal counting.
If you are talking about Index document id then you would need toread the index and map url to document id and then a second jobwould map id to whatever else by url. If you are wanting counteach word globally across all tasks and splits you can coordinateit within splits by using a MapRunner but across splits I don'tknow of a way to do that.
Yep... I'm looking to generate a unique ID across all the MR tasks.Basically, I want a file that looks like:
apple 1
beta 2
cat 3
dog 4
moose 5
....
Is there a final merge task that merges all the reductionstogether? If so, perhaps I could do the count in the final merge.Any idea is the final merge is accessible?
Thanks,

David
You might be able to do something like that but it would have to besplit independent because the ordinal processing would be unknown, orrather parallel. There isn't a final merge task, but you could usethe output of the merges as the input to another job and set thenumber of reduce tasks to 1. Maybe something like mapping word andprocessing time, using a string formatted processing time like thisYYYYMMDDHHMISS as the key to the first output and then in the secondjob it would be ordered by that key and you could output a number inthe final reduce.
What if I set the NumReduceTasks to 1. Would that mean there would be1 reduce task for the whole data set and I could rely on an instancevariable in the reduce class for incrementing?
Thanks,

David

Setting reduce tasks to one in the final job just means that there is asingle split and a single output file, part-00000 as opposed topart-00000 through part-xxxxx, but no you couldn't rely on an instancevariable in the reduce class. Each split (TaskRunner) is in its ownVM. If you were doing the processing inside of the map tasks on thefinal job, assuming 1 reduce task and hence 1 split, you could use aninstance variable on a custom MapRunner that would be passed into eachMap task. This would have to be a reference (object) though for thechanges to be seen in other Map tasks in the same split.


Dennis

Re: some newby questions

Reply via email to