There is currently no way to disable S/S. You can do many things to alleviate any issues you have with it though, one of them you mentioned below. Is there a reason why you are allowing each of your keys to be unique? If it is truly because you do not care then just create an even distribution of keys that you assign to allow for more aggregation.
On a side note, what is the actual stack trace you are getting when the reducers fail and what is the reducer doing? I think for your use case using a reduce phase is the best way to go, as long as the job time meets your SLA, so we need to figure out why the job is failing. Matt -----Original Message----- From: Peng, Wei [mailto:[email protected]] Sent: Tuesday, September 20, 2011 10:44 AM To: [email protected] Subject: RE: how to set the number of mappers with 0 reducers? The input is 9010 files (each 500MB), and I would estimate the output to be around 50GB. My hadoop job failed because of out of memory (with 66 reducers). I guess that the key from each mapper output is unique so the sorting would be memory-intensive. Although I can set another key to reduce the number of unique keys, I am curious if there is a way to disable sorting/shuffling. Thanks, Wei -----Original Message----- From: GOEKE, MATTHEW (AG/1000) [mailto:[email protected]] Sent: Tuesday, September 20, 2011 8:34 AM To: [email protected] Subject: RE: how to set the number of mappers with 0 reducers? Amusingly this is almost the same question that was asked the other day :) <quote from Owen O'Malley> There isn't currently a way of getting a collated, but unsorted list of key/value pairs. For most applications, the in memory sort is fairly cheap relative to the shuffle and other parts of the processing. </quote> If you know that you will be filtering out a significant amount of information to the point where shuffle will be trivial then the impact of a reduce phase should be minimal using an identity reducer. It is either that aggregate as much data as you feel comfortable with into each split and have 1 file per map. How much data/percentage of input are you assuming will be output from each of these maps? Matt -----Original Message----- From: Peng, Wei [mailto:[email protected]] Sent: Tuesday, September 20, 2011 10:22 AM To: [email protected] Subject: RE: how to set the number of mappers with 0 reducers? Thank you all for the quick reply!! I think I was wrong. It has nothing to do with the number of mappers because each input file has size 500M, which is not too small in terms of 64M per block. The problem is that the output from each mapper is too small. Is there a way to combine some mappers output together? Setting the number of reducers to 1 might get a very huge file. Can I set the number of reducers to 100, but skip sorting, shuffling...etc.? Wei -----Original Message----- From: Soumya Banerjee [mailto:[email protected]] Sent: Tuesday, September 20, 2011 2:06 AM To: [email protected] Subject: Re: how to set the number of mappers with 0 reducers?. Hi, If you want all your map outputs in a single file you can use a IdentityReducer and set the number of reducers to 1. This would ensure that all your mapper output goes into the reducer and it wites into a single file. Soumya On Tue, Sep 20, 2011 at 2:04 PM, Harsh J <[email protected]> wrote: > Hello Wei! > > On Tue, Sep 20, 2011 at 1:25 PM, Peng, Wei <[email protected]> wrote: > (snip) > > However, the output from the mappers result in many small files (size is > > ~50k, the block size is however 64M, so it wastes a lot of space). > > > > How can I set the number of mappers (say 100)? > > What you're looking for is to 'pack' several files per mapper, if I > get it right. > > In that case, you need to check out the CombineFileInputFormat. It can > pack several files per mapper (with some degree of locality). > > Alternatively, pass a list of files (as a text file) as your input, > and have your Mapper logic read them one by one. This way, if you > divide 50k filenames over 100 files, you will get 100 mappers as you > want - but at the cost of losing almost all locality. > > > If there is no way to set the number of mappers, the only way to solve > > it is "cat" some files together? > > Concatenating is an alternative, if affordable - yes. You can lower > the file count (down from 50k) this way. > > -- > Harsh J > This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware". Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
