David Pollak wrote:

On Nov 8, 2006, at 12:03 PM, Dennis Kubes wrote:

David Pollak wrote:

On Nov 8, 2006, at 7:41 AM, Dennis Kubes wrote:

I don't know if I completely understand what you are asking but let me try to answer your questions.

David Pollak wrote:
Howdy,

Is there a way to store "by-product" data someplace where it can be read? For example, as I'm iterating over a collection of documents, I want to generate some statistics about the collection, put those stats "someplace" that can be accessed during future map-reduce cycles. Should I simply run a "faux" map-reduce cycle to count the information and store it in a known location in the DFS?
Usually you would run a MapReduce job to store intermediate results and then another job to process aggregated or final results. Sometimes this can be done in a single job, sometimes not. Take a look at the Hadoop example for Grep or WordCount for example jobs.

Yep. I'm able to chain jobs together. In one case, I am counting URLs and Noun Phrases for documents retrieved during a certain run. In order to normalize the URLs and NP counts, I want to divide by the total number of URLs or NPs for that time period. I seem to have 2 choices: 1 - I can aggregate the counts during the Map/Reduce task that culls the URLs and NPs 2 - I can run another Map/Reduce task on the URL and NP sets to count the number of documents.

It seems that if I do the latter, it's another iteration over the data set which seems expensive. Is #2 the best choice?
If you can aggregate them in a single job I would do that. On a second job it would have to split, copy, process (even if it is only an IdentityMapper), and more importantly sort. Sorting time, depending on the amount of data, could be substantial. Usually it is not, it's just something to consider. I would use the second job approach it you needed to aggregate with other data to perform the counts. For example you count words in one job, noun phrases in another, and then aggregate both by url in a third.


I think I have a solution. If I have a "count" key, I can aggregate the document count information in the key and each reduce task will aggregate the info into the key. If I write the information out to a mapped file, I can look up the value of the key. If I gave the key a value that will not occur in "nature" (e.g., "\0count\0") then there's no likelihood of mixing with a URL or NP. Does this sound reasonable or am I smoking crack?
Well each split would be independent of the others, in terms of processing so I don't know how you would get the count key to begin with or at least your count keys would have duplicates assuming each split starts at 0 or 1 and you use a MapRunner to pass a shared variable into each Map task. That being said you may want the key to be the same URL or NP if you are aggregating it later. Answering your question though, yes a value that is not the same URL or NP would not mix in terms of the reduce values but you probably wouldn't want to use those same keys in a job that has URL or NP as keys.



Is there a way to map a collection of words or documents to associated numbers so that indexing could be based on the word number and/or document number rather than actual word and actual URL? Because the reduce tasks take place in separate processes, it seems that there's no way to coordinate the ordinal counting.
If you are talking about Index document id then you would need to read the index and map url to document id and then a second job would map id to whatever else by url. If you are wanting count each word globally across all tasks and splits you can coordinate it within splits by using a MapRunner but across splits I don't know of a way to do that.


Yep... I'm looking to generate a unique ID across all the MR tasks. Basically, I want a file that looks like:
apple 1
beta 2
cat 3
dog 4
moose 5
....

Is there a final merge task that merges all the reductions together? If so, perhaps I could do the count in the final merge. Any idea is the final merge is accessible?

Thanks,

David

You might be able to do something like that but it would have to be split independent because the ordinal processing would be unknown, or rather parallel. There isn't a final merge task, but you could use the output of the merges as the input to another job and set the number of reduce tasks to 1. Maybe something like mapping word and processing time, using a string formatted processing time like this YYYYMMDDHHMISS as the key to the first output and then in the second job it would be ordered by that key and you could output a number in the final reduce.

What if I set the NumReduceTasks to 1. Would that mean there would be 1 reduce task for the whole data set and I could rely on an instance variable in the reduce class for incrementing?

Thanks,

David

Setting reduce tasks to one in the final job just means that there is a single split and a single output file, part-00000 as opposed to part-00000 through part-xxxxx, but no you couldn't rely on an instance variable in the reduce class. Each split (TaskRunner) is in its own VM. If you were doing the processing inside of the map tasks on the final job, assuming 1 reduce task and hence 1 split, you could use an instance variable on a custom MapRunner that would be passed into each Map task. This would have to be a reference (object) though for the changes to be seen in other Map tasks in the same split.

Dennis

Reply via email to